Abstract
This paper presents and describes a comprehensive, open-access digital corpus of Song Ci poetry, comprising over 20,000 poems from the Song Dynasty. The corpus was created by aggregating and standardizing texts from multiple public-domain sources into a single, curated SQLite database (‘ci_curated.db’). Each poem record includes the author’s name, rhythmic title (cipai), and poem text in machine-readable form. By providing a large-scale, well-documented corpus, this dataset supports a wide range of computational tasks in the digital humanities—including authorship attribution, stylometry, and the training of language models on classical Chinese—and facilitates reproducible and comparative research.
