A Variant Character Dataset for Historical Narratives of Middle and Late Imperial China

Jiwon Lee; Youngim Jung

doi:10.5334/johd.325

Abstract

Due to a tradition of valuing historical records, pre-modern China developed a wide range of historical narratives, not limited to the Official History. Computational analysis of Chinese historical narratives is hampered by the historical, regional, and orthographic diversity of classical Chinese characters.

We released a dataset of variant-representative character mapping pairs intended for the computational analysis of historical narratives produced over ten centuries, from the Tang to the Qing dynasties. The construction method of the dataset involves collecting variant character sets from authoritative sources, filtering out duplicates and non-target sets, and selecting representative character by applying prioritization principles. As a result, 2,723 structured variant-to-representative character pairs are created, ensuring consistency and usability for linguistic research and text processing.

This dataset is highly effective for the text similarity detection, and text classification of historical narratives. In addition, it can function as a versatile tool for extending such analyses to pre-modern Korean and Japanese historical narratives, which exhibit strong intertextual connections with Chinese historical records.

References

1Anderl, C. (2020). Some reflections on the Database of Medieval Chinese Texts as a multi-purpose tool for research, teaching, and international collaboration. In B. Basciano, F. Gatti, & A. Morbiato (Eds.), Corpus-based research on Chinese language and linguistics (pp. 341–360). Edizioni Ca’ Foscari, Venezia. 10.30687/978-88-6969-406-6/011
Back to article
2Farina, A., Marongiu, P., & Rodda, M. A. (2024). Editorial: Representing the Ancient World through Data. Journal of Open Humanities Data, 10(57), 1–6. 10.5334/johd.245
Back to article
3Kessler, F. (2024). Towards context-aware normalization of variant characters in Classical Chinese using parallel editions and BERT. In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4 AL 2024) (pp. 141–151). Association for Computational Linguistics, Stroudsburg, PA. 10.18653/v1/2024.ml4al-1.15
Back to article
4Lee, J. (2024). A study on the intertextuality of The Story of Sui and Tang Dynasties using a text reuse detection algorithm. Chinese Literature, 120, 137–155. 10.21192/scll.120.202408.007
Back to article
5The Ministry of Education of the Republic of China. (1982). Table of standard forms of common national characters. Ministry of Education of the Republic of China. Retrieved January 30, 2025, from https://ws.moe.edu.tw/001/Upload/6/relfile/6490/38921/d190213c-7af8-45bf-b70e-48b4469aad72.pdf
Back to article
6The Ministry of Education of the Republic of China. (2017). Table of standard forms of less-common national characters. In The dictionary of Chinese variant characters. Ministry of Education of the Republic of China. Retrieved January 30, 2025, from https://zh.wikisource.org/zh-hant/%E6%AC%A1%E5%B8%B8%E7%94%A8%E5%9C%8B%E5%AD%97%E6%A8%99%E6%BA%96%E5%AD%97%E9%AB%94%E8%A1%A8
Back to article
7The Ministry of Education of the Republic of Korea. (2007). Basic Chinese characters for education. Ministry of Education of the Republic of Korea. Retrieved January 30, 2025, from https://www.textbook.or.kr/boardEditStd/filedownload.do?bfId=15&bId=22
Back to article
8The State Council of the People’s Republic of China. (2013, August 19). Notification of the State Council on the publication of the “Table of General Standard Chinese Characters”. The State Council of the People’s Republic of China. Retrieved January 29, 2025, from https://www.gov.cn/zwgk/2013-08/19/content_2469793.htm
Back to article
9Unicode Consortium. (2024, July 31). Unihan_Variants.txt (Version 16.0.0) [Unihan.zip]. Retrieved January 26, 2025, from https://www.unicode.org/Public/UNIDATA
Back to article

A Variant Character Dataset for Historical Narratives of Middle and Late Imperial China

Abstract

Paradigm

My account