A Comprehensive Digital Corpus of Song Ci Poetry for Computational Analysis

Yao Song

doi:10.5334/johd.405

A Comprehensive Digital Corpus of Song Ci Poetry for Computational Analysis

Journal of Open Humanities Data

Volume 12 (2026): Issue 1

By: Yao Song

Open Access

|Jan 2026

Abstract

This paper presents and describes a comprehensive, open-access digital corpus of Song Ci poetry, comprising over 20,000 poems from the Song Dynasty. The corpus was created by aggregating and standardizing texts from multiple public-domain sources into a single, curated SQLite database (‘ci_curated.db’). Each poem record includes the author’s name, rhythmic title (cipai), and poem text in machine-readable form. By providing a large-scale, well-documented corpus, this dataset supports a wide range of computational tasks in the digital humanities—including authorship attribution, stylometry, and the training of language models on classical Chinese—and facilitates reproducible and comparative research.