Modernism’s Vector: Dataset for a Chinese-Language Journal in Late 1950s Malaya

Nicholas Y. H. Wong

doi:10.5334/johd.485

Full Article

(1) Overview

Repository location

DOI: 10.5281/zenodo.18205257

Context

Chao Foon (Jiaofeng) is a print literary journal in Chinese from Malaya (and postcolonial Malaysia) that still runs today, with a brief hiatus between 1999 and 2002. The Github repository contains .txt files organized by year (1955–1961), from issues 1 through 100. The first hundred issues of Chao Foon were chosen to reflect the journal’s years of publishing content from and about Malaya, before it started introducing world literature, with a change of editor around issue 101. Starting in November 1955 and publishing twice a month, Chao Foon became a monthly publication in November 1958 (issue 73).

The history of Cold War literary modernism has been reconsidered through expanded archives, the remapping of publishing networks, and the exposing of the hegemony of institutions like the CIA (Barnhisel, 2015). The late 1950s was a high period of decolonization, in which Third World humanism influenced literary production (So, 2020). How does a Mahua (Chinese-Malayan) literary journal that is called “modernist” in decolonizing Malaya relate, or not relate, to such global discussions? Chao Foon, not to mention other Malayan journals, has rarely been included in studies of Cold War modernism. The debates and frameworks of Cold War modernism tend to originate from Euro-American contexts, or decolonizing French contexts.

For the first time, a dataset is created for the study of Mahua literature, sometimes seen as a ‘minor’ field in Chinese studies. This dataset serves as an important benchmark for computational literary studies of Mahua modernism. Users can use the dataset to run multiple tools to evaluate their usefulness in detecting semantic change in time-sliced corpora, as well as track emergent concepts in Chao Foon.

(2) Method

The dataset of .txt files was prepared in the following ways. PDFs of one hundred issues of Chao Foon were converted into JPEGs. Optical Character Recognition (OCR) on Google Cloud Vision API was used for text extraction. Google OCR was mostly accurate, with around three to five errors per page. Manual proofreading addressed two issues: (1) typography and visual design, e.g., article titles and author/translator names, which involve calligraphy or handwritten style, were not readable by OCR, and were manually reinserted into the scanned texts; (2) layout reconstruction, e.g., OCR scrambled the line order in poetry and rearranged the order of prose that were laid out in several columns. Proofreading also corrected OCR mistakes in punctuation, English letters, and Chinese characters that look similar, such as 件 / 伴 or 會 / 曾. Chinese variant characters, such as 「塲 (場)」「卽 (既)」「眞 (真)」「却 (卻)」were standardized, and the character 「一」, omitted by OCR if it appeared at the start of a sentence, was restored.

For reuse of this dataset, users can first consult README.md for a user guide and project overview on Github. DATA_STRUCTURE.md details the corpus .txt files, naming conventions, temporal coverage, and known anomalies. MODEL_TRAINING_SPECIFICATIONS.md summarizes model training parameters and key metrics (e.g., token counts, vocabulary sizes, and related measurements). PREPROCESSING_SPECIFICATIONS.md documents the preprocessing pipeline and model-ready data, including processed token counts and BERT-processed sentence counts. A companion website shows the potential use of the dataset through visualizations: https://dh-dataset-mahua-word-embeddings.vercel.app/. The site presents coverage statistics by year (1955–1961), model overviews for Word2Vec, FastText, and BERT, and interactive embedding visualizations (PCA, t-SNE, UMAP).

DATA_STRUCTURE.md is organized into three parts: (1) corpus/ organizes the original .txt files by year using the naming convention of YYYY-MM-first|second-issue-XXX.txt. This dataset of .txt files is the main contribution; (2) rationality-related/ displays results of a sample query into issues 78 and 79 of Chao Foon based on word embeddings and visualizations related to the token lixing (rationality), a keyword linked to modernism in the Mahua context; (3) yearly-based-model-data/ provides processed model data (with character count) grouped by year for the embeddings. Model training data is stated in the json files. For contextual embeddings like BERT, the pre-trained model of bert-base-chinese was used. Sample statistics of pre-processing from 1956 are as follows: Sentences: 1,708; Total Tokens: 36,845; Unique Vocabulary: 39,403 tokens.

(3) Dataset Description

Repository name

Github https://github.com/candyyetszyu/DH_Dataset_Mahua_word_embeddings; A companion website to facilitate exploration of the dataset and its visualizations:https://dh-dataset-mahua-word-embeddings.vercel.app/

Object name

“DH Dataset Mahua Word Embeddings: Enhanced Documentation and Dataset Guide” (on Zenodo); “DH Mahua Literary Journal Dataset” (on Github)

Format names and versions

TXT: Original literary journal texts; JSON: Processed data, embeddings, and model outputs

Creation dates

Start date: 2024-09-13; End date: 2025-04-05

Dataset creators

Nicholas Y. H. Wong: conceptualization, funding acquisition, project administration, supervision, validation.

Ye Tsz Yu Candy: data curation, formal analysis, visualization.

Allie Xiang Haiyin: checking of OCR-ed issues of Chao Foon.

Language

Chinese

License

Creative Commons Attribution 4.0 International

Publication date

2025-10-27

(4) Reuse Potential

The dataset of .txt files allows external researchers to run models to generate embeddings. (Embeddings convert text into high-dimensional vectors. Vectors capture the semantic meaning of a target word as a cosine, revealing similar phrases or contexts where it frequently appears.) Our companion website provides example queries to show how the data and models can be used in practice. Word2Vec visualization correctly identified the concerns of Chao Foon issue 78 (April 1959), based on its vector similarity with the term rationality (lixing 理性). This Word2Vec analysis corroborates current findings about the importance of rationality (lixing) in discussions of Mahua modernism. Mahua writers used rationality in opposition to what they saw as irrational expression in Cold War China’s literary “realism.” This explains Mahua writers’ unique views of rationality as a component of modernism. Such a view contradicts ideas about modernism from elsewhere. For example, abstract expressionism, a key movement in modernism, is not rationalistic.

On (Chinese) Word Embeddings

Users can use the dataset to generate word embeddings and ask questions about Mahua and global modernism. Word embeddings rely on the distributional hypothesis, which is that words that appear in the same contexts tend to have similar meanings (Harris, 1954; Boleda, 2020; Lenci, 2018). Scholars have discussed the validity of the distributional hypothesis in studying shifts in language use (Arseniev-Koehler, 2024; Morrissey & Roe 2024; Algee-Hewitt, 2023). Recent applications of word embeddings use ‘minor genres’ as their corpus, to suggest strong links between conceptual history and word embeddings in newspapers (Wevers & Koolen, 2020), scientific correspondences (McGillivray, 2024); and encyclopedia entries (Morrissey & Roe, 2024).

Some believe that word and concept are functionally equivalent in collective usage (Regan, 2023), while others suggest an inexplicable “substrate” beyond words (de Bolla, 2013). Other scholars have criticized word-embedding methods for their affinity to and assumptions of structural linguistics, and are dubious if cosine similarity reflects similarity of contexts (Arseniev-Koehler, 2024). Word embeddings also reflect biases on, e.g., gender, race, class, and sexuality, given the models they are trained on. Yet such biases are worth analyzing. Literary biases in Chao Foon can be analyzed further.

As a pilot study, the dataset is limited to a five-year period and to Mahua literary journals. Users also require facility with Chinese. Preprocessing decisions in the sample query reflect methodological choices. Chao Foon publishes a wide ranges of literary forms, such as manifestos, fiction, non-fiction, and poetry. Genre forms dictate the way words are used, and the interpretation of word embeddings.

Acknowledgements

I wish to thank the HKU Arts Tech Lab.

Competing interests

The author has no competing interests to declare.

Author Contributions

Conceptualization, Funding acquisition, Project administration, Supervision, Validation, Writing – Original Draft.