(1) Overview
Repository location
Context
The Song Dynasty (960–1279 CE) represents a zenith in Chinese literary history, distinguished by the flourishing of Ci (词) poetry, a literary form composed according to specific melodic patterns (cipai) (Fuller, 2018). Despite its cultural significance, computational access to this literary heritage has been hampered. Numerous digital versions of these works exist, but they are frequently dispersed across disparate online collections, exhibiting pervasive inconsistencies in formatting, metadata, and textual accuracy (Song et al., 2024). This fragmentation poses a considerable barrier to systematic research (Liu et al., 2025).
This dataset was developed to address these challenges by unifying scattered sources into a single, reliable, and machine-readable corpus. Creating such a resource addresses the significant impediment that a scarcity of high-quality data presents to computational research, enabling large-scale quantitative inquiries into stylometry, authorship, and linguistic evolution. This paper describes this dataset, which has been utilized in various digital humanities explorations and Natural Language Processing (NLP) projects.
(2) Method
Steps
The creation of the Song Ci Corpus involved a multi-stage methodology centred on aggregation, cleaning, and standardization.
Source Aggregation: The foundational stage consisted of the large-scale collection of raw poetic texts from a wide array of existing public-domain sources. These include community-driven projects (such as the chinese-poetry GitHub repository (Github, 2025), public digital libraries, and digitized book editions. Aggregation across these heterogeneous sources was essential for approaching comprehensive coverage beyond any single antecedent collection.
Data Structuring and Cleaning: Computational scripts were used to parse the aggregated texts and structure them into a consistent schema. Essential metadata—namely the author’s name, the cipai (rhythmic title), and the poem body—were systematically extracted, and obvious non-poetic artifacts introduced during web scraping (e.g., navigation strings) were removed from the ‘content’ field.
Standardization and Formatting: Substantial effort was dedicated to normalizing character encoding to UTF-8, resolving basic formatting irregularities, and deduplicating texts based on matching author and full-text fields. The final, cleaned corpus is released as a single authoritative SQLite database (‘ci_curated.db’), from which any auxiliary formats (such as JSON exports) can be generated as needed.
Sampling strategy
A statistical sampling strategy was not employed in the creation of this corpus. Rather, the objective was comprehensive inclusion. The methodology entailed collecting all available Song Ci poems from the aggregated sources to construct a corpus designed to represent the body of known extant works from the period.
Quality control
Quality control combined automated and manual procedures. Script-based checks were used to identify formatting anomalies, duplicated records, and obvious non-poetic interface remnants, which were then cleaned from the SQLite corpus. Spot checks were performed on sampled poems and author entries to verify the integrity of the extraction and cleaning pipeline. While the upstream digital collections are community-driven, the curation, cleaning, and structuring of this specific release were performed by the author.
(3) Dataset Description
Repository name
Zenodo
Object name
The repository contains a single authoritative data file for this release: ‘ci_curated.db’ (a curated SQLite database of poems and author metadata).
Format names and versions
SQLite (‘.db’). Any JSON or other exports are treated as convenience formats generated from this database.
Creation dates
The current version of the data reflects contributions up to the present.
Language
The data is presented in Classical Chinese.
License
The textual data contained in ‘ci_curated.db’ is dedicated to the public domain under a CC0 1.0 Universal license to maximize scholarly reuse. Any accompanying code or scripts in the repository are released under the MIT License.
Publication date
The dataset is a living collection and is updated periodically.
A sample row from the SQLite database (simplified) illustrates the schema for Su Shi’s Shuidiao Getou:
ci table (poems):
| ID | AUTHOR | RHYTHMIC | CONTENT |
|---|---|---|---|
| 123 | 苏轼 | 水调歌头 | 明月几时有?把酒问青天。… |
A separate ‘ciauthor’ row stores Su Shi’s biographical information (‘name’, ‘long_desc’, and optional ‘short_desc’), which can be joined to the poem records via the author’s name.
(4) Reuse Potential
The corpus’s primary value lies in its potential for reuse in comparative and exploratory research on classical Chinese poetry. Its scale and unified schema make it a useful resource for training and evaluating NLP models for tasks such as authorship attribution, topic modeling, and poetry generation (Taylor and Attridge, 1998). The coverage of over 20,000 poems enables large-scale stylometric analyses of lexical diversity, authorial style, and the evolution of the Ci form. The curated linkage between poems and author biographies also supports author-centric studies and digital pedagogy tools for classical Chinese (Fuller, 2018).
However, users should be aware of limitations that also constrain certain use cases. First, while extensive, the collection is not exhaustive of every extant Song Ci poem. Second, despite cleaning and normalization efforts, the data may still contain transcription errors or inconsistencies inherited from its original sources. Finally, biographical metadata and standardized author identifiers remain incomplete, which limits the suitability of the corpus for tasks such as constructing fully reliable social networks of poets without substantial external data enrichment.
Competing Interests
The author has no competing interests to declare.
Author Contributions
Yao SONG: Conceptualization, Data Curation, Writing (Original Draft; Review & Editing)
