Have a personal or library account? Click to login
A Variant Character Dataset for Historical Narratives of Middle and Late Imperial China Cover

A Variant Character Dataset for Historical Narratives of Middle and Late Imperial China

By: Jiwon Lee and  Youngim Jung  
Open Access
|May 2025

Abstract

Due to a tradition of valuing historical records, pre-modern China developed a wide range of historical narratives, not limited to the Official History. Computational analysis of Chinese historical narratives is hampered by the historical, regional, and orthographic diversity of classical Chinese characters.

We released a dataset of variant-representative character mapping pairs intended for the computational analysis of historical narratives produced over ten centuries, from the Tang to the Qing dynasties. The construction method of the dataset involves collecting variant character sets from authoritative sources, filtering out duplicates and non-target sets, and selecting representative character by applying prioritization principles. As a result, 2,723 structured variant-to-representative character pairs are created, ensuring consistency and usability for linguistic research and text processing.

This dataset is highly effective for the text similarity detection, and text classification of historical narratives. In addition, it can function as a versatile tool for extending such analyses to pre-modern Korean and Japanese historical narratives, which exhibit strong intertextual connections with Chinese historical records.

DOI: https://doi.org/10.5334/johd.325 | Journal eISSN: 2059-481X
Language: English
Submitted on: Mar 12, 2025
Accepted on: Apr 19, 2025
Published on: May 21, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Jiwon Lee, Youngim Jung, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.