1. Introduction: The Poetics of Constraints
Computational Literary Studies (CLS) confronts a mismatch between machine learning’s demand for discrete, invariant “ground truths” and the contextual, interpretive nature of literary criticism (Long & So, 2016; Piper, 2020). While skeptics argue that digital tools sever texts from their contexts (Bode, 2018, pp. 17–35; Da, 2019) and lack human-like understanding (Bender & Koller, 2020), the counter-assumption that maximizing context invariably improves performance ignores the noise inherent in large-scale data processing, the varying amounts and types of computation required by different contexts, and, most importantly, the fundamental question of what “context” actually is and whether the answer is itself context-specific. This article addresses these tensions by operationalizing “context” as a tunable hyperparameter within a classification task and by exploring the question: what is the appropriate unit of analysis for modeling a literary feature that is itself scale-dependent?
One must first delineate the computational complexity of the task at hand. I focus on parallelism (dui’yu 對語, duizhang 對仗, or dui’ou 對偶), one of the defining features of the Chinese regulated verse (lüshi 律詩) (Cai, 1989; Plaks, 1988). In the standard eight-line (“octave”) format, the poem is structurally divided into four couplets: the Head (shoulian 首聯), the Jaw (hanlian 頷聯), the Neck (jinglian 頸聯), and the Tail (weilian 尾聯). Prosodic law since the Tang (618–907) dictates that the “inner couplets” (the Jaw and the Neck) must exhibit strict parallelism, demanding a precise, one-to-one correspondence between words in identical positions across two lines, in addition to a consistent rhyme scheme and tonal arrangement (Watson, 1971, 109–113). This structural principle can be observed in Bai Juyi’s 白居易 (772–846) poem Grass on the Ancient Plain: Farewell to a Friend 賦得古原草送別:
離離原上草,一歲一枯榮。
野火燒不盡,春風吹又生。
遠芳侵古道,晴翠接荒城。
又送王孫去,萋萋滿別情。
Lush and dense is the grass on the plain,
Each year it withers, each year it flourishes again.
Wildfires can burn but cannot slay;
When the spring breeze blows, the grass grows once more.
The distant fragrance encroaches on the ancient road,
The bright green meets the ruined town.
Once again, I see my noble friend off;
The teeming grass fills with the sorrow of parting.
In the third, parallel couplet, the initial adjective yuan 遠 (“distant”) maps to qing 晴 (“bright”), and the noun fang 芳 (“fragrance”) maps to cui 翠 (“greenery”); both pairs describe sensual qualities. The central verb qin 侵 (“encroaches”) mirrors jie 接 (“meets”), simultaneously contrasting and linking nature with culture. Finally, the object phrase gu dao 古道 (“ancient road”) is counterpoised by huang cheng 荒城 (“ruined town”). Each element in the first verse finds its counterpart in the second, forming what Yu-kung Kao describes as “a complete musical line, [where] both exposition and response form a self-contained cycle” (Kao, 1986, p. 337).
However, this simplicity is deceptive. While parallelism may appear to be a lexical game of retrieving words from a dictionary, it is fundamentally a context-dependent operation. As linguists have long noted, Chinese characters lack fixed morphological markers; their grammatical function is fluid and determined by their syntactic environment rather than inherent word class (Chao, 1968, p. 501; Pulleyblank, 1995, pp. 10–13). Consider the character cui 翠 in the sixth line. In isolation, it is ambiguous: it can function as an adjective (“green” or “emerald”) or as a noun (“greenery,” “jadeite,” or even “kingfisher feathers”). One cannot simply look at the character itself to determine its meaning; we may need to verify the opposing character in the parallel line. Because cui 翠 is paired with fang 芳, which functions here more readily as a noun (“fragrance”), cui is biased toward a nominal reading (“greenery”). Similarly, in the second couplet (lines 3 and 4), the character chun 春 could be a temporal noun (“springtime”), an attributive adjective (“vernal”), or even a temporal adverb (“during spring”); its classification as an adjective modifying feng 風 (“wind”) is supported by its parallel counterpart ye 野 (“wild”), which modifies huo 火 (“fire”). While ordinary reading is “linear and forward,” the reading of parallel structure “diverts the reader’s attention to the side, demanding that he pay attention to the corresponding lateral line” (Kao, 1986, p. 367); “parallelism always recycles the syntactic flow backwards onto itself” (Kao and Mei, 1971, p. 57). The verification of parallelism is not a static lookup but a mutual resolution of ambiguity.
This complexity is further compounded by the fact that the requirement for parallelism is itself structurally bounded (Fuller, 2017, pp. 31–33). For example, a poem of four rigidly parallel couplets is traditionally disparaged as “stiff”; the outer couplets are typically exempt (Lee et al., 2018). Moreover, our assumption that the second line provides a disambiguating signal for the first line or vice versa is itself dependent on our prior knowledge (and the resulting cognitive expectation) that the couplet is intended to be parallel. “Even when it survives independently, a couplet is always part of a complete poem… a couplet describing a ‘far’ scene will often be followed by a couplet describing a ‘near’ scene; a couplet on parting ‘here’ will ask a couplet speculating on the traveler’s state of mind ‘there.’” (Owen, 1985, pp. 91, 100). Judging the parallelism of a couplet, therefore, might require looking beyond the local relationships to the entire piece.
2. The Hierarchy of Meaning: Defining the Units of Analysis
My objective is to determine the semantic parallel status of a given couplet. To this end, I operationalize the “unit of analysis” by defining three distinct scopes of context, each representing a different hypothesis about the most effective scale for recovering the parallelism signal (Figures 1, 2, and 3). In all cases, the input sequence provided to the model includes poetic content and special tokens like [CLS] (classification) and [SEP] (separator), as defined in the BERT architecture (Devlin et al., 2019).

Figure 1
Character Model architecture.

Figure 2
Couplet Model architecture.

Figure 3
Poem-4 (left) and Poem-1 (right) Model architectures.
2.1 The Micro-Context: The Character Model
At the most granular level, we isolate the character, treating parallelism as a lexical association task.
The Hypothesis: Parallelism is primarily a matter of matching semantic categories. Therefore, a model should be capable of predicting validity based solely on the embeddings of two opposing characters.
The Task: Given a pair of characters, classify them as a parallel or non-parallel pair. This is implemented using a standard BERT sequence classifier, where the input consists of two characters concatenated with a separator token. The model optimizes a cross-entropy loss based on the pooled output of the [CLS] token to predict semantic compatibility.
2.2 The Meso-Context: The Couplet Model
The next level up is the couplet, the unit of local syntax where functional roles are determined.
The Hypothesis: Semantic parallelism is a structural phenomenon. A model requires the context of the full couplet to resolve ambiguity and accurately decode the feature.
The Task: Given two lines, determine if they collectively form a parallel couplet. I employ the same architecture as the Character Model but expand the input to full lines, allowing the self-attention mechanism to capture cross-line dependencies. The classification head projects the aggregated [CLS] embedding to a probability score representing the couplet’s parallelism status.
2.3 The Macro-Context: The Poem Models
Finally, we reach the whole poem, leveraging the full global context.
The Hypothesis: The parallelism status of a couplet is influenced by the poem’s overall structural integrity. A model with access to the entire poem may be able to leverage global patterns to improve local judgments. To test this hypothesis, I implement two distinct workflows:
The Poem-4 Model (Multi-Label): This model accepts the full poem as input and outputs four separate labels, designating the parallelism status for each of the four couplets. In the custom classifier, I extract the hidden states from the positions of four injected special tokens ([CP1], [CP2], [CP3], [CP4]) during the forward pass. These distinct embeddings are then projected through a shared linear layer to produce independent parallelism logits for each couplet.
The Poem-1 Model (Binary): This model accepts the full poem as input and outputs a single binary label: 1 (one) if the poem’s inner couplets are parallel, and 0 (zero) otherwise. I treat the entire poem as a single contiguous sequence, fine-tuning a standard BERT classifier where the global [CLS] embedding is used to capture the latent structural features required for this binary judgment.
3. Methodology and Experimental Design
To empirically test the impact of different context windows, I aggregated the raw text of over 140,000 poems spanning the Tang, Song, Yuan, Ming, and Qing periods from public repositories, enforcing a regex filter to exclude non-Chinese characters, removing duplicates via full-string hashing, and filtering the corpus to isolate pentasyllabic octave poems. To establish a “ground truth,” I employed a SikuBERT-based classifier previously fine-tuned for this philological task (Kurzynski et al., 2024; 2025). This “teacher” model generated binary labels (parallel/not parallel) for every couplet in the dataset, thereby encoding the feature of parallelism at a fixed, couplet-level scale. Any poem where the prediction confidence fell below 0.8 for any constituent couplet was discarded. This “silver standard” approach (“silver” because not error-free) allowed me to generate the large-scale labeled data necessary for multi-level training.
The filtered data was then structured into a tripartite hierarchy to support different context windows, encompassing four distinct datasets. For character-level training, I used a community detection method described in an earlier study (Kurzynski et al., 2025). Specifically, I drew an edge between any two characters that occurred at the same positions in inner (nominally parallel) couplets in the dataset and then applied the Louvain method for community detection (Blondel et al., 2008) to discover groups of frequently matching characters. In the next step, I selected only those couplets where either all character pairs shared a community (perfect parallelism) or none did (complete lack of parallelism), and from these strictly filtered couplets, I extracted individual character pairs to serve as positive and negative examples. For couplet-level training, I aggregated full couplet strings with their corresponding binary labels. For poem-level tasks, I created two distinct datasets: a sequence labeling dataset (four labels per poem) and a binary classification dataset (one label per poem). Finally, I fine-tuned separate instances of a SikuBERT-based model (Wang et al., 2022) for each unit of analysis. The base experiment for all models consisted of a single training epoch; the Poem-1 and Poem-4 models were also trained for a second epoch for comparison. Where possible, both train and test splits (9,000 and 1,000 examples, respectively) were balanced to provide an equal number of binary labels for each category (see Limitations). The performance of all models was then compared on the unified task of determining couplet parallelism (as labeled by the “teacher” model) using three inference strategies:
Direct Evaluation: The Couplet Model was evaluated directly on a test set of couplets.
Bottom-Up Aggregation:
• I used the Character Model to generate an aggregated prediction for the couplet, which was determined by a majority vote: the couplet was classified as parallel only if at least three of the character pairs were predicted as parallel.
• I used the Couplet Model to predict the status of the entire poem. If the two inner couplets are predicted as parallel, the entire poem was classified as “regulated”.
Top-Down Inference: I used the Poem Models to infer predictions for the specific couplet:
• Via Poem-4: The full poem is fed to the model and the labels for each couplet are extracted.
• Via Poem-1: If the model classifies the poem as “regulated,” I infer that both inner couplets are parallel. This inferred status is then compared against the ground truth.
By comparing the accuracy of these methods on the exact same test set of couplets, we can measure the performance loss under deliberate misalignment. The entire pipeline was repeated for 100 trials, re-sampling new train/test splits for each run.
4. Results and Discussion: The Goldilocks Hypothesis of Context Alignment
The results are summarized in Table 1 and Figure 4. The following discussion explores the performance of each model as a decoder of the teacher’s signal, considering the trade-offs between accuracy and other forms of analytical insight.
Table 1
Models and inference pipelines, with performance metrics (mean ± standard deviation) acquired from 100 trials.
| MODEL/INFERENCE | ACCURACY | PRECISION | RECALL | F1 |
|---|---|---|---|---|
| Character | 0.932 ± 0.034 | 0.930 ± 0.041 | 0.936 ± 0.037 | 0.932 ± 0.032 |
| Couplet | 0.947 ± 0.019 | 0.923 ± 0.040 | 0.979 ± 0.021 | 0.949 ± 0.017 |
| Char → Couplet | 0.881 ± 0.026 | 0.897 ± 0.038 | 0.864 ± 0.046 | 0.879 ± 0.026 |
| Poem-1 | 0.886 ± 0.029 | 0.860 ± 0.059 | 0.932 ± 0.042 | 0.892 ± 0.022 |
| Poem-1 (2 epochs) | 0.905 ± 0.022 | 0.881 ± 0.045 | 0.941 ± 0.037 | 0.908 ± 0.019 |
| Couplet → Poem-1 | 0.839 ± 0.051 | 0.777 ± 0.073 | 0.969 ± 0.031 | 0.860 ± 0.035 |
| Char → Poem-1 | 0.808 ± 0.027 | 0.847 ± 0.042 | 0.760 ± 0.076 | 0.797 ± 0.036 |
| Poem-4 | 0.696 ± 0.017 | 0.742 ± 0.055 | 0.601 ± 0.102 | 0.655 ± 0.048 |
| Poem-4 (2 epochs) | 0.735 ± 0.022 | 0.759 ± 0.052 | 0.686 ± 0.077 | 0.715 ± 0.031 |
| Poem-4 → Poem-1 | 0.659 ± 0.027 | 0.673 ± 0.066 | 0.674 ± 0.144 | 0.657 ± 0.061 |

Figure 4
F1 Score Distribution by Target. Metrics computed against silver labels generated by the teacher model.
4.1 The Goldilocks Hypothesis: Optimal Context at the Meso-Level
As anticipated by the experimental design, the Couplet Model, which operates at the same meso-scale at which the parallelism signal was originally encoded, consistently outperformed all other architectures on the primary task of recovering the logic of the teacher classifier. In this experimental setup, the couplet provides “just enough” information to resolve local syntactic ambiguities without introducing the noise of broader, potentially irrelevant, structural information, and thus establishes a clear baseline against which we can measure and theorize the specific performance trade-offs of misaligned models. This result suggests a “Goldilocks hypothesis” for contextual modeling: the most effective context is not necessarily the largest possible, but the one that is aligned with the scale of the phenomenon under investigation.
4.2 The Lexical Gap: Failures of the Micro-Context (Character → Couplet)
While the Character Model achieved high accuracy on its own task of classifying individual character pairs, its performance dropped when its judgments were aggregated to classify a whole couplet. One possible explanation is that the model suffers from a “lexical gap” as it lacks the syntactic context necessary to resolve ambiguity. For example, in the couplet 高樓邀落月,疊鼓送殘更 (“The high tower invites the sinking moon; Repeated drums send off the fading watch”), the Character Model often fails, misinterpreting characters like luo 落 (“to fall”) and can 殘 (“remnant”), which function as adjectival modifiers within the line, but this information is only available at higher scales. Similarly, the Character Model struggles with metonymic parallelism, as seen in 遺文誦史漢,奇思探莊騷 (“Reciting the Shiji and Hanshu from bequeathed texts, Exploring the Zhuangzi and Li Sao with wondrous thought”), because the connection between Shiji and Hanshu (shi 史, han 漢) and Zhuangzi and Li Sao (zhuang 莊, sao 騷) is contextual and syntactic, not purely lexical.
4.3 The Structural Shadow: Interference from the Macro-Context (Poem → Couplet)
Counterintuitively, a larger context window hindered classification performance rather than improving it. Both Poem models performed worse on couplet classification than the dedicated Couplet Model, especially in terms of precision (frequently misclassifying non-regulated poems as regulated). The Poem-1 model must expend its limited capacity over a large number of tokens to learn the general rule that the inner couplets of a lüshi (as labeled by the “teacher” model) are parallel, a structural constraint that may divert computational resources away from analyzing the specific semantic content of individual couplets. This explanation is also supported by the fact that applying more computation (training for two epochs) to the Poem-1 and Poem-4 models leads to higher accuracy on the test set. The Poem-1 model may have also learned cross-couplet relationships that, while useful for the whole-poem task, proved detrimental to individual couplet predictions.
This latter phenomenon is evident in what could be termed a “structural shadow,” where a poem’s overall formal regularity biases the model’s judgment of a specific couplet. Consider Huangfu Xiao’s 皇甫涍 (1497–1546?) poem In Remembrance of Qiao Baiyan, the Sima 有懷喬白巖司馬:
風肅園陵樹,霜天戰角悲。
後湖衰草色,空對漢官儀。
辛苦趨朝日,羈危扈聖時。
江都千萬舸,老淚不禁垂。
The wind blows solemn through the mausoleum trees;
In the frosty sky, the war horns sound with grief.
By the Rear Lake, the grass has faded to a withered hue,
Alone, I face the Han official ceremonial.
You toiled bitterly, day after day, to attend court;
In perilous times you escorted the Sage.
Now at Jiangdu, amidst myriads of boats,
My old tears cannot help but fall.
In the second couplet (in bold), the Houhu 後湖 (“Rear Lake”) is a proper noun, while kong dui 空對 (“lonely facing”) is an adverb-verb construction. They are functionally incompatible, an error the Couplet Model correctly identifies across multiple trials. The Poem-1 Model, however, is likely “biased” by the macro-context, for instance, the fact that the third couplet is parallel; the poem’s otherwise rigid structure casts a shadow of validity over the non-parallel second couplet (the kong dui might be a deliberate wordplay on “empty couplet,” signifying the missing other). In the Poem-4 model, this phenomenon is further compounded by data sparsity. Because poems with parallel outer couplets are relatively rare (Lee et al., 2018), the Poem-4 model learns a spurious positional heuristic, effectively learning to predict 0 (not parallel) for the first and fourth couplets and 1 (parallel) for the inner couplets, regardless of content. This failure is not merely a flaw in the model but a finding from the benchmark itself, demonstrating that the model’s real-world performance is inextricably linked to the statistical realities of the available textual archive. How a digital humanist defines the scale of a literary phenomenon directly impacts the kind of training data required for a model to decode it.
5. Implications for Digital Humanities Benchmarking
The above results suggest that the ability to computationally recover a given feature depends on modeling alignment and as such gesture towards a regulative logic: to analyze a literary phenomenon, we proceed as if it were encoded at a particular scale and then seek a model architecture aligned with this postulated structure. A classifier designed to detect irony in dialogue may require a different contextual window from one designed to track the evolution of narrative voice. It should be noted that such a regulative argument remains agnostic as to the actual ontological status of the analyzed feature and pertains only to classifiers sensu stricto, i.e., task-specific models with a restricted output space, rather than general-purpose models only employed for classification tasks, in which case the output space is much larger (the model’s entire vocabulary). This perspective transforms the classifier from a passive discoverer of textual truth into an active participant in the hermeneutic process, testing not only the text itself but also the very viability of the scalar assumptions built into its architecture.
The results also demonstrate that digital humanities (DH) benchmarking should be a multidimensional evaluation of a model’s fitness for different scholarly purposes, one that weighs not only accuracy but also the distinction between interpretability and explainability (Rudin, 2019). In this project, the Character → Couplet classification offers structural interpretability: because the final decision is a simple aggregation of five independent character-pair judgments, we can transparently trace a couplet’s classification back to the specific predictions made for each pair. In contrast, the Poem-1 classification is much harder to interpret; we cannot determine whether the model’s validation stems from a genuine detection of parallel syntax, a recognition of the author’s style, or a simple reliance on the poem’s rhyme scheme. That said, the aligned models are more amenable to explainability, allowing for the use of post-hoc techniques to generate approximations (explanations) of their behavior.
In particular, the large-context models such as Couplet or Poem-1/Poem-4 offer geometric insights into how literary phenomena are represented internally by the Transformer models. Analysis of the Poem-1 model confirms an earlier result (Kurzynski et al., 2025): parallel couplets (unlike non-parallel ones) exhibit geometric alignment across the two lines, where corresponding positions are attended to in patterned ways, and the “key” vectors for matched characters point in similar directions in the model’s learned space. In the present study, Poem-1 learns that the inner couplets determine the “regulatedness” of the poem and consequently directs almost all its attention to them (Figure 5). Within these inner parallel couplets, the model distributes attention isomorphically across heads: higher attention on the first character of the first line, for instance, frequently corresponds to higher attention on the first character of the second line, and so on. Such insights into vector poetics carry cognitive implications, insofar as artificial models offer computational analogies to the human brain (Gärdenfors, 2000), but are by necessity unavailable to small-context models.

Figure 5
Attention distribution in a regulated poem, top layer of the fine-tuned SikuBERT classifier (Poem-1). Each heatmap is a head, each row is a couplet, each cell is a Chinese character or punctuation. Darker color indicates higher attention score from the [CLS] token. The [CLS] and [SEP] tokens have been removed for better visibility. Notice the isomorphic attention distribution in the inner couplets: in Head 1 (top left), for example, the third (parallel) couplet elicits higher attention at positions 1, 2, and 5 in both lines. Punctuation marks often serve as “attention sinks,” providing a stable anchor for information flow across layers.
6. Conclusion
The “contextual turn” in natural language processing (NLP), exemplified by the groundbreaking discovery of contextual word embeddings (see, e.g., Peters et al., 2018), has provided the humanities with powerful analytical tools, yet it also invites a re-evaluation of our methodological frameworks. The present multi-scale analysis points to an important trade-off at the core of computational literary study: while aligning the model’s scale with the feature’s scale might be necessary for maximizing performance and enabling post-hoc explainability, a deliberately misaligned model, despite its lower accuracy, can offer greater interpretability by making its decision points transparent. A distinctive contribution of humanistic benchmarking to literary studies, therefore, might lie less in a monolithic pursuit of accuracy than in articulating, and empirically testing, how different scales of modeling serve different scholarly aims, including prediction, explanation, and interpretation. In this way, the very act of designing a computational model transforms the question of “context” from a silent assumption of disciplinary intimacy into an explicit and contestable methodological choice.
7. Limitations
Several limitations of this project should be acknowledged. First, the established pipeline risks a form of circularity: training student models on teacher-generated labels can propagate teacher biases. The paper does not claim to define a final human gold standard of “regulatedness,” even though the “teacher” model had been previously evaluated on human-labeled data; instead, it claims to benchmark model behavior under nested contexts for a historically theorized feature. In other words, the accuracy of the Couplet Model reflects primarily its ability to distill the teacher’s labeling rule. Moreover, this study operationalizes (Moretti, 2013) parallelism as a binary feature (parallel/not parallel) for the sake of experimental control, which is a necessary simplification that elides the nuanced reality of both poetic practice and critical judgment. Sinological scholarship has long recognized parallelism as a spectrum, encompassing “loose” (kuan dui 寬對) or “flowing” (liushui dui 流水對) parallelism, and even human readers may disagree on borderline cases (Cai, 2008; Yu, 1987).
Furthermore, this project targets semantic parallelism and does not encode tonal templates (ping ze 平仄) or rhyme categories. Since tonal constraints are central to regulated verse, incorporating phonological features could plausibly improve poem-level discrimination by penalizing semantically “plausible” but prosodically invalid poems and providing a more disambiguating signal. At the same time, some of the attention resources otherwise focused on the inner couplets would then be reallocated to other lines to process the ending tokens of all four couplets. This might require yet another modification of classifier architecture or additional computational resources.
Finally, a note on the training process is in order. Due to data sparsity, the Character model’s training proved unstable. This has been mitigated by random seed adjustments in cases where instability has been detected, but the problem suggests that the task of classifying two isolated characters might be ill-posed for a BERT-based architecture, originally designed for longer sequences. Conversely, octave poems with a parallel final couplet are rare in the corpus, which makes class balance difficult to achieve. Consequently, the Poem-4 model learns a simple positional heuristic rather than the actual poetic patterns, achieving deceptively high accuracy simply by learning to predict “0” (not parallel) for the last couplet. This phenomenon, however, further demonstrates the importance of aligning literary features, modeling architectures, and training data.
Additional Files
The datasets accompanying this manuscript can be found at https://doi.org/10.5281/zenodo.18214284.
The GitHub repository is available at https://github.com/mcjkurz/parallelism-benchmark.
Acknowledgements
The author would like to thank the anonymous reviewers and the editors of this issue for their helpful comments and questions. Special thanks are also due to Xiaotong Xu 徐曉童 and Yu Feng 馮宇 for their assistance in data curation.
Competing Interests
The author has no competing interests to declare.
Author Contributions
Maciej Kurzynski — Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Visualization, Writing – Original Draft, Writing – Review & Editing.
