Abstract
Due to a tradition of valuing historical records, pre-modern China developed a wide range of historical narratives, not limited to the Official History. Computational analysis of Chinese historical narratives is hampered by the historical, regional, and orthographic diversity of classical Chinese characters.
We released a dataset of variant-representative character mapping pairs intended for the computational analysis of historical narratives produced over ten centuries, from the Tang to the Qing dynasties. The construction method of the dataset involves collecting variant character sets from authoritative sources, filtering out duplicates and non-target sets, and selecting representative character by applying prioritization principles. As a result, 2,723 structured variant-to-representative character pairs are created, ensuring consistency and usability for linguistic research and text processing.
This dataset is highly effective for the text similarity detection, and text classification of historical narratives. In addition, it can function as a versatile tool for extending such analyses to pre-modern Korean and Japanese historical narratives, which exhibit strong intertextual connections with Chinese historical records.
