Have a personal or library account? Click to login
A Variant Character Dataset for Historical Narratives of Middle and Late Imperial China Cover

A Variant Character Dataset for Historical Narratives of Middle and Late Imperial China

By: Jiwon Lee and  Youngim Jung  
Open Access
|May 2025

Full Article

(1) Overview

Repository location

DOI: 10.5281/zenodo.14949172

Context

This dataset was developed as part of the project “A Study on Historical Romance in 16–17th Century China through Text Mining” (NRF-2023S1 A5B5 A16080679), supported by the National Research Foundation of Korea (NRF). The ultimate goal of this project is to analyze historical narratives from the Ming and Qing dynasties using text mining techniques, and this variant character dataset was created as a foundational resource for that purpose. Classical Chinese texts are notorious for containing numerous variant characters, which can affect computational analysis: when the same word appears in different variant forms, searches or algorithms may fail to recognize their connection. Modern encoding standards and OCR technology can exacerbate this issue. For instance, some rare variant characters in historical texts may not be assigned a Unicode value, forcing digitization projects to substitute images in place of text. Conversely, sometimes characters that share the same meaning, pronunciation, and abstract shape but only differ in style are assigned to distinct codes. For example, 説(U+8AAC) and 說(U+8AAA) are assigned different codes due to stylistic differences, even though they are essentially the same character. While these multi-coded characters help preserve the original text forms, they introduce complexities in computational analysis. Additionally, OCR software often confuses similarly shaped characters, treating them as identical (Anderl, 2020). Numerous cases have been found in which 玄(U+7384) is transcribed as 𤣥(U+248E5), or 會(U+6703) is transcribed as 㑹(U+3479) in digital transcriptions. Digital encoding issues and OCR errors arising from variant Chinese characters undermine the reliability of text mining results produced by researchers of Classical Chinese texts.

To mitigate these problems, the Unicode Consortium has introduced datasets and standards to map relationships among variant characters (Unicode Consortium, 2024). However, this dataset is not sufficient for the text analysis of classical Chinese literatures in the following respects: first, it remains small in scale and excludes many commonly used variant characters—especially those frequently encountered in real-world digital transcription processes; second, the types and forms of variant characters are so diverse that achieving a scholarly consensus on their classification remains challenging; finally, variant-to-representative character mapping tailored to historical periods and genres has not been considered at all. Prior studies employing text mining for Chinese classical texts acknowledged the issue of variant characters but did not proffer specific solutions to address them. A recent study has highlighted the limitations of the existing method, which simply replaces variant characters in Chinese classical texts, and has attempted normalization based on context. However, due to differences among texts, the rules learned from one text do not yet apply well to other texts (Kessler, 2024).

Accordingly, there is a need for expanded datasets that address variant characters in particular historical periods and genres. This dataset is based on the Unicode Consortium’s Unihan Variants but designed and expanded to support a range of digital humanities research about historical texts from China’s Middle and Late Imperial periods. An initial version of the dataset was used to examine how the Ming Dynasty historical romance The Story of Sui and Tang Dynasties reused textual content from the Song Dynasty historical work The Outline and Digest of the Comprehensive Mirror for Aid in Government, thereby shedding light on the intertextuality between the two texts (Lee, 2024).

(2) Method

The construction process of this dataset follows the steps below as illustrated in Figure 1.

johd-11-325-g1.png
Figure 1

Construction Process & Basic Datasets, Principles.

Step 1. Data Collection

This step involves gathering variant character data from multiple authoritative sources:

  • Unihan Variants (Unicode Consortium): It contains 17,919 sets of variant-to-variant characters across six fields: kSemanticVariant, kSimplifiedVariant, kSpecializedSemanticVariant, kSpoofingVariant, kTraditionalVariant, and kZVariant. However, since duplicates exist in the form of both ‘A-B’ and ‘B-A’ and some sets appear in multiple fields, the actual number is considerably lower.

  • Comparison Table of Standard Characters, Traditional Characters, and Variant Characters (State Council of the PRC): It is included as an appendix to the Table of General Standard Chinese Characters, promulgated by the State Council of the People’s Republic of China (2013), from which 1,020 pairs of traditional-to-variant characters can be obtained.

  • Variants Extracted from Historical Narratives: It is created by extracting variant characters that frequently appear in the target corpus— official histories, quasi-histories and historical romances of middle and late imperial China—provided by Wikisource and Kanripo, two prominent open repositories for Chinese classical literature. It includes 141 pairs of variant-to-representative characters.

Step 2. Data Refinement

Once the raw data is collected, the next step is to refine the dataset by performing the following steps:

  • Deleting Non-Target Sets: Some variant relationships are irrelevant for our target corpus. Because our target corpus consists of pre-modern texts, we removed variant sets belonging to the kTraditionalVariant and kSimplifiedVariant fields in the Unihan Variants. Additionally, certain variant relationships only hold under specific semantic conditions, making them unlikely to appear in the target corpus; these sets were also filtered out.

  • Removing Duplicates: Since data is collected from multiple sources, many variant pairs overlap. Deduplication ensures that each pair is listed only once.

  • Integrating Variant Relationships: Variant pairs or sets are scattered across multiple sources, so we integrated them into a unified set.

As a result, 2,036 sets were obtained. Like the original sources, most sets consist of two characters, but there is also a set that include up to eight characters as shown in Table 1.

Table 1

Example of Variants Sets in Chinese Characters.

EXAMPLES# VARIANTS
U+90F7
U+9109
2
U+5FB7
U+60B3
U+60EA
3
U+25128
𥄨
U+2C461
𬑡
U+7785
U+77C1
4
U+3A57
U+3A66
U+643A
U+64D5
U+651C
5
U+21ED5
𡻕
U+4E97
U+5C81
U+5D57
U+6B72
U+6B73
6
U+25997
𥦗
U+41AB
U+724E
U+7255
U+7A93
U+7A97
U+7ABB
7
U+21A34
𡨴
U+21A4B
𡩋
U+5BCD
U+5BD5
U+5BD7
U+5BDC
U+5BE7
U+752F
8

Selection strategy of representative Chinese character

To determine the most representative character, a set of prioritization principles is applied and illustrated in Figure 2.

  • The first priority (labeled TW1) is any character included in the Table of Standard Forms of Common National Characters (4,808 characters), published by the Ministry of Education of the Republic of China (1982). If multiple characters within a single variant set meet this first priority, then the Basic Chinese Characters for Education defined by the Ministry of Education of the Republic of Korea (2007; levels 1 and 2, each containing 900 characters—labeled KO1 and KO2, respectively) are consulted in that order to select the representative character.

  • The second priority (labeled TW2) is any character included in the Table of Standard Forms of Less-Common National Characters (6,329 characters) from the Ministry of Education of the Republic of China (2017).

  • The third priority (labeled CJK) is any character in the CJK Unified Ideographs range (U+4E00 ~ U+9FFF).

  • The fourth priority (labeled CJKE) is any character in the CJK Unified Ideographs Extension ranges.

johd-11-325-g2.png
Figure 2

Representative Selection Process.

Table 2 presents practical examples of the selection of representative character among variants.

Table 2

Example of Representative Selection.

CHARACTERUNICODETW1KO1KO2TW2CJKREPRESENTATIVE SELECTIONTYPE
U+3A57xTW1
U+3A66x
U+643Ax
U+64D5x
U+651Coo
U+55A6xxKO1
U+5CA9ox
U+5D52xx
U+5DD6ooo
U+5DD7xx
U+5384oxooKO2
U+5443oxx
U+6239xxx
U+9628oxx
U+9638xxx
𤥼U+2497CxxxxTW2
𤦏U+2498Fxxxx
𤧚U+249DAxxxx
𮴨U+2ED28xxxx
U+7481xxxoo
𮱠U+2EC60xxxxxCJK
U+3766xxxxx
U+5BEFxxxxoo

The reason we do not use Mainland Chinese characters as our principle is that Mainland China employs simplified characters derived from pre-modern Chinese forms, which differ from the standardized traditional forms adopted in Taiwan and Korea. In rare cases where multiple candidates remain within the same priority level, the character with greater structural simplicity is chosen as the representative. When this tie-breaking rule is applied, we append “–1” to the relevant label (e.g., “TW1-1,” “TW2-1”) to indicate that structural simplicity served as the deciding factor. After the selection of representative characters, each set is organized in the form of “one representative character + multiple variants,” ensuring that every variant in the dataset is assigned a unique representative character as shown in Table 3.

Table 3

Variant-to-Representative Dataset.

VARIANTUNICODEREPRESENTATIVEUNICODETYPEMATCHING TYPE
U+90F7U+9109TW1one-to-one
𡨴U+21A34U+5BE7TW1many-to-one
𡩋U+21A4BU+5BE7TW1
U+5BCDU+5BE7TW1
U+5BD5U+5BE7TW1
U+5BD7U+5BE7TW1
U+5BDCU+5BE7TW1
U+752FU+5BE7TW1

Outcome: Variant-to-Representative Dataset

After reviewing which variants are actually associated with each selected representative character, the dataset is ultimately composed of a collection of 1:1 pairs connecting each variant to its representative character. The completed dataset consists of 2,723 variant-to-representative character mapping pairs, with each pair explicitly identifying the variant and its corresponding representative character.

(3) Dataset Description

Repository name

Zenodo

Object name

A Variant Character Dataset for Historical Narratives of Middle and Late Imperial China

Format names and versions

Excel

Creation dates

2024-03-01~2025-03-12

Dataset creators

Jiwon Lee (Data Curator), Youngim Jung (Supervisor)

Language

Chinese, English (for variable names)

License

CC0

Publication date

2025-03-11

This dataset is deposited in Zenodo and publicly available at https://zenodo.org/records/14949172 under CC0 license with an appropriate citation. The dataset is read, edited and saved with Microsoft Excel, Google sheets, LibreOffice Calc, or Apache OpenOffice Calc whichever offers compatibility with Excel files.

(4) Reuse Potential

This dataset can be directly employed in the analysis of historical narratives from Middle and Late Imperial China. One of its primary applications involves substituting variant characters with their representative characters during text data preprocessing, thereby improving the accuracy of various analytical tasks. (Lee, 2024) Moreover, it can be used in linguistic studies that compare the frequency and characteristics of variant characters across different periods, regions, and genres, and it also serves as a shared resource for interdisciplinary research in areas rich in Classical Chinese texts, including Korea, Japan, and Vietnam.

In addition, reflecting broader discussions on the representation of the ancient world through data (Farina et al., 2024), this dataset exhibits strong potential for expansion and refinement. For instance, researchers may develop supplementary lists of variant characters relevant to other eras or genres, or establish connections with additional linguistic resources. Although this dataset does not encompass all variant characters found in Chinese texts from various regions and periods, sustained collaborative efforts within the research community can facilitate the ongoing growth and standardization of variant character mappings.

Competing interests

The authors have no competing interests to declare.

Author roles

Jiwon Lee: Data curation, Methodology, Writing-original draft

Youngim Jung: Methodology, Writing-review & editing

DOI: https://doi.org/10.5334/johd.325 | Journal eISSN: 2059-481X
Language: English
Submitted on: Mar 12, 2025
Accepted on: Apr 19, 2025
Published on: May 21, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Jiwon Lee, Youngim Jung, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.