“aSimMatrix” Dimensions: A Scalable Framework for Benchmarking Intertextual Similarity

Shellie Audsley

doi:10.5334/johd.486

Full Article

(1) Context and motivation

“[I]ntertextuality” (Kristeva, 1977/1981)—the concept that textual meaning is evermore conditioned by other texts—has engendered structuralist rethinking that sought to formalize its unstable nature (Genette, 1982/1997). Its most overt manifestations as localized forms of allusion (embedded quotations, citations) have invited scrutiny by algorithmic means: Mahadevan et al. (2025) studied fragments of text reuse in a database of newspaper, and Romanello (2016) established citational networks across a corpus of classical texts. Direct quotations can be thought of as the most concrete kind of textual phenomena that represent the idea of intertextual relatedness. When processed computationally as discrete units of word combinations, they lend themselves to established methods of fuzzy string-matching and entity extraction. These fragmentary proofs of readerly engagement, however, leave unanswered questions of authorial intent and readerly perception and competence (Schubert, 2020). On the textual surface, the presence of recycled words may signal “imitation” or “parody” (Kristeva, 1977/1981, p. 73); between the lines, the cursory repetition of a quotation across two texts may not necessarily trigger the metaphorical ringing of bells that allows readerly associations to be made. Beyond surface references, the more evasive dimensions of intertextuality are encapsulated in implicit references, thematic echoes, stylistic imitation and genre engagement of varying degrees of diffusion and fuzziness (Hinds, 1998). These amorphous cognitive-linguistic units of conceptual meaning involve and effect intertextual relations where semantic meaning may be conditioned by broader patterns of symbolic recurrence and deviation. Intertextuality raises many questions about readerly and algorithmic recall and the nature of hermeneutic decomposition—what constitutes a meaningful reference, when neither lexical co-occurrences nor semantic resemblance alone can define the full conditions of its existence? This benchmark considers existing notions of intertextual similarity as they are reflected in literary theory and constrained by the epistemic implications of their associated technical apparatus.

Literary historians in the digital humanities (DH) have grappled with conceptions of intertextual properties as “inexact similarities” (Quantitative Criticism Labs [QCL]; Chaudhary & Dexter, 2023)—textual patterns that can be visualized to trace intellectual evolution (Underwood, 2019) and semantic similitude that can be modelled (Forstall & Scheirer, 2019). The breadth of methodologies towards intertextuality across DH and natural language processing (NLP) (Duan, 2025) continue to see innovation in algorithmic designs for similarity scoring (Xing, 2025), engagement with new sentence-embedding models for embedding information and structuring (meta)data (Johnson et al., 2025; Smiley, 2025) and hybridized systems for detecting asymmetric referential and paraphrastic clues (Lau & McManus, 2024). The growing popularity of pre-trained, transformer-based deep-learning language models,¹ which encode semantic information by statistically deriving correlations among linguistic elements, afforded a way of processing texts without the fragmenting tendency of existing lexical approaches. This is not to say that the classical, parametric machine learning methods implemented in the bulk of DH scholarship are inherently unsuited for nuanced literary compositions. Cooney et al. (2008), for instance, investigated a possible case of plagiarism (lexical and semantic overlaps) concerning the Encyclopédie by clustering thousands of entries and possible sources using pre-defined topical and stylometric features (using the CLUTO software). To locate the Encyclopédie’s unattributed sources, Horton et al., (2010) adapted bioinformatic techniques for sequence alignment² to track recycled texts. Seeking patterns underneath the Encyclopédie’s formal surface, Roe et al. (2016) adapted Latent Dirichlet Allocation (LDA)-based topic modelling to expose discursive practices that permeate the boundaries of the text’s ontological scheme.

Old tools have been appropriated to new critical ends: the capacity of n-gram matching, which facilitated early corpus-wide search tools in classical texts (the Tesserae project [Coffee et al., 2012])—when expanded through Latent Semantic Indexing (LSI) to handle looser textual parallels (Scheirer et al., 2016) and orthographical inconsistencies (Miller et al., 2025)—afforded new questions concerning intertextual transmission across broader historical corpora. Conversely, new tools may reiterate familiar configurations of critical outlook. Studies of genre evolution have compared the composition of entire books through static (Underwood, 2019) and book-level contextual embeddings (Barré, 2024) alike. Evolving technical frameworks continue to be occupied in (if not constrained by statistical necessity to) approaches that rely on localized units of similarity—“chunked” at different word/paragraph levels—to project the effects of intertextual similarity. Each uniform span that contains possible instances of lexical matches, co-occurring linguistic features and/or computed semantic similarity are often unified through a static, unweighted and additive logic.

The present framework for modelling intertextual similarity approaches scalability—sustainability, robustness, scope—by reflecting on the efficacy of tools whose modularity enables them to be used across variations of interpretative focal lengths. For this we turn to the computational task of gauging Semantic Textual Similarity (STS), wherein a similarity score is computed for pairwise sentences (typically) using SBERT. This facilitates a comparison of sentential similarity which can then be aligned with human judgement—through an explicit formulation of the dimensions of similarity (the Similarity Matrix, Figure 1). Despite the opacity of the models’ internal arithmetic operations, SBERT’s mode of capturing sentence-wide meaning partially mirrors readerly recognition of intertextual patterns in the traditional literary task of close reading. Both involve the plausible interrelation of judiciously selected parts (lexical elements, textual features) into a systematic whole (as semantic meaning, similarity, interpretive understanding etc.) within manageably isolated sentences and/or texts. The paraphrastic pairs of literary references collected in the benchmark dataset localize these sites of interest which form the building blocks of wider analysis—in a manner digestible for both the machine and human reader.

Similarity Matrix. The leftmost column and row headers jointly provide a notational scheme for labelling and scoring the intertextual pairs in the dataset; common intertextual entitles or referential relation are sorted into the corresponding conceptual space of similarity, including specific phenomena like “heteroglossia” (Bakhtin, 1981) in the box for “Phrase-Parallel”.

The intuitiveness and associative sense-making demanded in literary study—in close reading and intertextual analysis alike—poses a particular challenge for transformer-based models. The “compositionality gap” (Press et al, 2023, p. 5687) reflects the struggle of transformer-models to properly aggregate facts into logical conclusions in complex inductive reasoning tasks (Peng et al., 2024; Luo et al., 2024; Khandelwal et al., 2018). Literary analysis involves a mode of inductive reasoning (Goodman, 1954)—processes of interpretive decomposition and abstraction informed by the rules of grammar, the heuristics of genre conventions and the second-order conceptual permutations they endlessly facilitate. In understanding intertextual similarity, then, we may better learn about compositionality, give form to abstract textual elements and interdependencies at scale. This benchmark asks: how can any compositional error in SBERT’s generation of similarity score be accounted for through human modes of interpretive composition? How may a theory-driven account of the spectrums of literary similarity explain any specious correlations in mechanical modes of inference?

The idea of benchmarking is not confined to the measurement of individual tool’s efficacy in capturing isolated aspects of similarity (which implies their scalability for DH/NLP workflows). It maps out conceptions of resemblance beyond opaque correlations made within the sandbox of statistical studies, and expands existing STS-adjacent benchmarks (STS-B, SuperGlue, SemEval Task 1 [Takahashi et al., 2024], STS3k [Fodor et al., 2025]) that are less well-suited for aiding in the detection of what may constitute meaningful similarity in literary language. In consolidating theories of intertextuality into a flexible yet systematic framework that can accommodate changes in technology, we align the utility of computational tools with the conceptual complexities of intertextuality. The proposed Similarity Matrix (Figure 1), with a preliminary dataset that exemplifies various aspects of intertextual practice, reframes the workings of critical heuristics into a concrete act of evaluation. The gradations of intertextual similarity, which computational methods have variously reckoned as degree and scope, may be rethought in terms of their nuanced and asymmetric dimensions, structures and contexts.

(2) Dataset description

This is a literary dataset for pairwise semantic textual similarity (STS) tasks, containing 41 textually-varied instances of references and parallels from Lord Byron’s Don Juan (1819–24)—collected from Peter Cochran’s open edition of the text (2009).³ The examples are scored in degrees of abstractness according to the Similarity Matrix (Figure 1 in this paper; in Sheet 2 labelled “similarity-matrix” in the dataset). The dataset encourages reuse, expansion and adaptation 1) in studies of multilingual intertextual phenomena across different historical periods: bibliographical metadata (down to the stanza number) can be used to visualize readership networks, where applicable, and the distribution of in-text intertextual occurrences; 2) for evaluating the efficacy of DH/NLP models and pipelines in accounting for complex linguistic expressions that deviate from existing STS datasets; 3) in adversarial and alignment experiments for stress-testing DH/NLP model performance. The Excel format preserves additional typographical annotations to pinpoint areas of similarity for further manipulation: underlined text indicates semantic parallel, boldface shows direct mirroring, red means contradiction at the word/phrase levels (“WP”, “PhM” etc.).

Repository location

https://doi.org/10.7910/DVN/WZRBME

Repository name

Dataverse

Object name

aSimMatrix-benchmark-dataset-1 (aSimMatrix Benchmark Dataset of Intertextual Parallels)

Format names and versions

Excel (.xlsx)

Creation dates

2025-10-29 to 2025-11-17.

Dataset creators

Shellie Audsley (Faculty of English, University of Cambridge, Cambridge, United Kingdom).

Language

English, with a few text snippets in French and Italian.

License

CC-BY

Publication date

2025-11-19

(3) Method

(3.1) Delimiting Textual Similarity: “aSimMatrix”, A Scalable Benchmark with a Similarity Matrix

Scholars of (post-)structuralist theory and digital humanists occupied in intertextuality studies have engaged with various aspects of the concept—typifying the kinds of intertextual relations among overlapping genres like parodic “transformation” and adaptative “imitation” (Genette, 1982/1997, p. 25). Schubert (2020) has proposed a preconception of intertextuality as network—wherein referential connections are the (weighted) edges representing degrees of semantic engagement. Similarly, Trillini and Quassdorf (2010) have consolidated formal and functional facets of quotation/citation alongside other extractable elements into a comprehensively annotated dataset for the readers of Shakespeare. The continued proliferation of such theoretical parameters reflects the difficulty of systemizing the inherent fluidity of the textual condition, even as algorithms and models capture information at unprecedented scale. Like other concepts, intertexuality’s many properties, dimensions and inferable connotations bear on its meaning and interpretation—depending on which (combinations) of these are attended to. Deeper, extratextual parallels exist, without explicit quotations and other overt manifestations, beyond the confines of a formalist approach to close reading.

Similarity is a key sign of intertextual relations, and a favored critical apparatus in traditional literary analysis: semantic affinity and deviation readily extrapolate to broader patterns of literary understanding, stylistic influence and genre evolution. Hume’s (1739) naming of “Resemblance, Contiguity in time or place, and Cause and Effect” as the basis of associative thoughts reflects a view—not unlike later technical ideas (about data) underlying the manifold hypothesis—that the mind clusters (perceived) concepts based on their repeated co-occurrences and similarity. To establish linkages among possible sites of textual interest and possible sources for readerly consideration, authoritative editions of texts recover textual echoes in their footnotes through editorial reference tracing and literary knowledge. The rigor of the assessment and validation of similarity involved in this process offers a foundation for benchmarking computational measurements of semantic similarity.

This benchmark presents a collection of 41 instances of intertextual pairs identified as references/allusions/echoes in the bibliographic footnotes in Peter Cochran’s (2009) authoritative edition of Byron’s Don Juan (1819-24). Selected for its variety of asymmetrical intertextual practice and colloquialism, this mock epic (a fundamentally imitative genre) is described as “[hospitable] to verbal registers and generic derivations—including … gazette reportage, gothic fiction” (Stabler & Hopps, 2024). The evaluation task follows the STS-B format: a score is generated for pairwise comparison over the sentences/stanzas, which are notated according to what kinds of lexical, semantic, syntactical and intertextual elements are most prominently present. In this analysis, similarity scores generated by the list of models (Table 1) are compared with normalized annotator scores, which are practically the intertextual (aSimMatrix) similarity scores calculated according to the Similarity Matrix (Figure 1) and formula (Figure 4). This preliminary dataset includes outlier textual examples of poetic-from-prose thematic echo (id-21), groups of references to a symbol/trope (group-id-7) and multilingual paraphrases/versification of maxims (id-38).

Table 1

Model details.

MODEL	MODEL DETAILS
Word2Vec	(out of the box, no adjustment)
Base SBERT*	all-MiniLM-L6-v2
MPNet (masked model)*	all-mpnet-base-v2
Multilingual MPNet*	paraphrase-multilingual-mpnet-base-v2
Question-Answer & Retrieval*	multi-qa-mpnet-base-dot-v1
Distilled Question-Answer & Retrieval*	multi-qa-disilbert-cos-v1
E5*	e5-base-v2
Note	*SBERT family

The Similarity Matrix (Figure 1) guides the evaluation of the general efficacy of STS workflows in terms of which fuzzy dimension(s) are captured, to what extent and based on which intertextual entities/elements/relations—providing a frame of reference for future technical modifications aimed at exploring additional categories of intertextuality. The gap between the indeterminacy of linguistic effects and perceptual biases can be lined up through the metaphorical “middle terms” of the linguistic features and conceptual registers collected across the Similarity Matrix and dataset; the intensity of their perceived effect and formal recognizability inform how weights are assigned during the calculation of a similarity score. Aimed at reconciling the uneven forms, distributions and effects of intertextual similarity, this implementation of the idea catalogues the linguistic-structural levels within texts, at which readerly or machine attention can be directed, alongside a spectrum of similarity. The Similarity Matrix’s conceptual underpinnings are derived by decomposing the hermeneutic heuristics in close reading and categorizing intertextual phenomena implicit in literary writings and explicit in structuralist theory (e.g. Bakhtin, 1981).

Familiar properties of semantic parallels in critical discussions of intertextuality are located along the vertical “structural levels” of a text’s grammatical structure where they come into effect, and arranged along a horizontal scale of parallel threshold, with a Fuzziness Score indicative of the degree of indefiniteness in their formal manifestation (and concomitant effects). Reflecting the many assumptions about where intertextuality occurs in previous studies, the column corresponds to the level of chunking/analytical unit—lexical overlap/mirroring (direct quotations), phrasal parallel (paraphrases), thematic resemblance etc.

Coupled with the metaphorical x-axis of fuzziness threshold, this y-axis of textual depth shows how a unit of parallel can be perceivable as complete at a local or global level (e.g. imagery), engendering a granular assortment of readerly extractable entities of similitude. The matrix references the structural prerequisites for intertextual meaningfulness and their role in the more latent forms of intertextuality. The symbolic weight of words may require the length of a narrative to be established, and certain conceptual and cultural units (tropes, archetypes and genre) may only develop diachronically through iterations at a broader—and correspondingly more abstracted—structural level.

(3.2) Metrics rationale

Similarity comes in degrees of mirroring, parallel and resemblances, and can be measured in its absence (0) and negation (-1). The parallel thresholds account for the magnitude of fuzziness (formal recognizability) of the presence, extent and distribution of elements. This rating (-1 to 3) is reflective of the abstractness of a target fragment’s constituent semantic properties and the likelihood that a typical reader may recognize them. Some forms of intertextual correspondence, such as tropes, are dependent on pre-existing familiarity or readerly aptitude. On the other hand, diffused textual elements concatenate into less overt, more structurally encoded patterns—which can still have a profound though less-readily-recognizable effect. This scale incorporates observations about the direction and manner of referential relations: “a single verbal echo of an earlier phrase to systemic and sustained engagement” (Chaudhary & Dexter, 2023), in uneven acts of condensing multiple sources, or as a standalone text that refers to a portion of another text (Steyer, 2015 [cited in Kuznetsov et al., 2022]).

Words transcend dictionary definitions; the crux of literary criticism is to uncover the implications underneath their arrangement, permutation and connotations, which defer and destabilize meaning in a way that state-of-the-art contextual embeddings cannot be seen to capture. Between “similar” and “dissimilar” words can be those that are related and therefore “not dissimilar” (e.g. “annoyed”, “furious”, “appalled”). The opposite of “similarity” may also go beyond its primary definitions to connote contrast (e.g., “appeased” and “violent” can convey a sense of contradiction in the conceptual dimension of “emotional intensity”), rather than simple irrelevance (“violent” and “hungry”). SBERT befuddle diverging semantic senses when polysemous words form into sentences: 1) the woman says that a person can be violently hungry, and 2) “‘starvation has long incited one to violence,’ says the woman” [base SBERT score: 0.6566].

Similarity scores are opaque signals of aggregated semantics, stylistics and meta-contexts (direct speech vs statement), often through uncertain or brittle combinations of detected and correlated features. The reader’s logical interpretation of word relation and context is equally hard to lay bare. On one hand, algorithms may attend to 1) counts of lexical recurrence (e.g. the repetition of “one”; detectable by n-gram matches, measured as symbolic tokens by Jaccard similarity), 2) semantic overlap (“hungry” and “starvation”, “violent” and “violence”; as sub-symbolic representations by cosine similarity [i.e. calculated from the geometric proximity of word vectors in vector space]); 3) the syntactical arrangement of parts. On the other hand, human inferences of meaningful text reuse are likely informed and influenced by prior understanding of (non)idiomatic expression, stylistic effect, rhetorical function etc. Varying degrees of emphasis can be placed on one semantic parallel above a structural one, which contribute to the variability of similarity scores. For a literary example, see Byron’s parody of Keatsian poetics (id-35), where a meaningful word/lexical mirroring (“palsied”) is used to briefly register his referent while the overall paragraph/stanza is contained in a bathetic style. Figure 2 illustrates the tension between localized word-mirroring and a given structural/chunking-level, and its impact on the overall similarity score:

A portion of the dataset illustrating how word-mirroring is weighted down by paragraph-level sense of opposition.

We will return to the previous example to demonstrate the proposed annotative approach—a process of feature-effect interpretation that approximates a reader’s internal scale. The sum (indeed synthesis) of observed meaningful types of intertextual features/match at a given structural-level is weighted according to the overall effect/sense perceived, as demonstrated in Figure 3:

A demonstration of one way to label intertextual elements and calculate a similarity score, compared with basic NLP approaches to STS.

The (intertextual) similarity score derived using the Similarity Matrix (Figure 1) constructs intertextual similarity as the sum of meaningful (optionally and variably weighted) localized elements, which is dynamically scaled by their overall effect within the unit of observation/focal span. This simultaneously accounts for variable kinds/levels/hierarchies of recognizable formal features and aggregated effect/perception intensity—asymmetrical dimensions which jointly and dynamically contribute to meaningfulness.

Figure 4—a rough formalization of our generalizable framework—illustrates the relations of the components (shared intertextual elements/labels, their Fuzziness Scores, optional weighting according to localized and globalized intertextual effects) used for calculating the intertextual (aSimMatrix) similarity score. Overall, this STS-based approach recognizes intertextual meaningfulness as dependent on the lexical, semantic, relational, conceptual and hermeneutic contexts of the referent/source sentence, thereby providing a more accurate and interpretable means of representing comparable similarity than lexical and semantic methods.

A conceptual representation of the proposed intertextual similarity (aSimMatrix) score.

(3.3) Embedding models

Owing to the multiplicity of research goals in DH and the high data demand for training custom models for specific historical corpora, only a few out-of-the-box embedding models will be tested alongside n-gram matching to provide a general baseline for comparing machine measures and human judgement: Word2Vec, six sentence-models from the SBERT family and, informally, a commercial LLM via Retrieval-Augmented Generation (RAG) (Claude Sonnet 4.5) for direct score output with justification.⁴

(4) Results and discussion

(4.1) Quote fragments as lexical patterns: n-grams and static embeddings

As expected, the use of n-gram for detecting lexical matches was effective, capturing direct reuse (e.g. ids: 2, 4, 6, 13) in the forms of phrasal mirroring and partial quotations in the collection of literary paraphrases. Contrary to the dense embeddings produced by Word2Vec and transformers, n-grams supply a fixed definition of intertextual similarity as contiguous formal matching/mirroring—the degree of repetition in the form of overlapping word sequences (or other tokenized units) across texts. The drop in transformer-based similarity score in id-13, where the paraphrastic pair contains no less noticeable lexical overlap than ids-2, 4 and 6, suggests that SBERT fail to capture the more concrete segments of text reuse. This also supports findings that n-gram-based methods like Term Frequency-Inverse Document Frequency (TF-IDF) for highly localized similarity measurement are superior to transformer-based models (Joshi et al., 2020).

The interpretability of n-grams facilitates the visualization of lexical matches across the corpus. Juxtaposed with annotator labels from the dataset, Figure 5 offers an overview of the dynamics between lexical patterns and types of reader-perceived labels. Even in this small dataset, formally concrete and perceivably corresponding units of meaning show no fixed patterns: while localized within sentences/lines, they are scattered to different extents, at times positioned in different ends of a sentence.

A sample of pairwise n-gram and label distributions.

The static word embeddings of Word2Vec, GloVe and FastText are technically not suited for STS tasks. The lack of correspondence between the similarity scores of the pre-trained Word2Vec and SBERT is no surprise: the former encodes word-meaning from the fixed context of its training corpus (contemporary English) and excludes the syntactical information captured in transformer models. Adopting such a model for similarity measurement implies an epistemic anchoring of similarity to lexical meaning that is pre-determined in the initial textual contexts of its training corpus. Static embeddings by default contradict the deconstructionist rejection of fixed meanings, and are ill-suited to handle basic disambiguation. Greater ambiguity exists in phrases like “Lady’s fan [object]” (id-1), which would be semantically dissimilar to “Lady’s fan [human admirer]”, even when the former objectifyingly, or by metonymy, signifies the latter. Evaluating the semantic similarity of intertextual pairs that come from different historical periods and genres (ids-1, 2), where the modes of expression differ and semantic drift may have occurred, adds to the challenge of translating semantic nuance to numbers.

(4.2) Vague Likeness Only: SBERT models

The creators of SBERT define meaningful semantic textual similarity as the proximity of sentence embeddings in vector space (Reimers & Gurevych, 2019), in effect a method to expand the encoded meaning of one sentential unit through drawing on that of a related one. Practically, the focus on rendering sentential units as cosine similarity scores facilitates a range of large-scale classification and information retrieval tasks. Epistemically, this methodology sets up the notion of textual similarity as condensed, numerically comparable chunks of texts, though it assumes that semantic (and the concomitant conceptual) coherence occurs at the level of the sentence(s). For the study of intertextuality, these models can be tuned for multilingual inputs, allowing for translational and transnational studies across broader corpora.

Across our literary dataset, SBERT models output a score range of around 0.54 to 0.96. While these models are theoretically able to output cosine similarity scores with zero and negative values to signal semantic unrelated-ness and opposition,⁵ the absence of negative values associated with oppositional paraphrases in our dataset highlights an inefficacy that goes beyond model performance. This model’s seeming overestimation of similarity can be interpreted as an epistemic misalignment in the statistical encoding of meaning and readerly processing of semantics. The shared opacity of these mechanical and mental workings may be lessened as we untangle their technologically-determined assumptions about dis/similarity alongside ours. The underlying technical constraint is twofold: the loss function with which the models are trained, as well as default settings, make negative numerical outputs relatively unusual.⁶ Secondly, the practical objective of SBERT in semantic search and STS tasks to prioritize similarity—rather than detect or pinpoint semantic contrast—carries an implicit definition of similarity not only as semantic resemblance but also relevance. This is further reflected in existing STS benchmarks, which use a zero-to-five scale to mark dissimilar sentential pairs and do not score oppositional pairs (Cer et.al, 2017; May, 2021).

The consistently high score (average 0.9) of E5 further supports Smiley’s (2025) observation of its propensity to overstate similarity. Brute-force score normalization reveals MPNet as the model that corresponds relatively better than other models to human score, reflected also in its sensitivity to the control entry (id-10).⁷ Paraphrase-multilingual-mpnet worked well with a syntactically and semantically very close paraphrase between English and the original French maxim (ids-36, 37), though its efficacy when dealing with looser multilingual echoes (id-22) is indeterminate. The adaptation of this task in a question-answer format for a LLM (Claude Sonnet 4.5) generated similar results. The brief justifications it provided summarized the chief semantic meaning, but did not draw on any possibly encoded information in its training to inform an output that concerns style, tone, rhetorical function, etc.

The utility of STS tools (SBERT models) for aiding intertextuality studies through the detection of embedded semantic similarity is often undermined by the very way they encode syntactical/contextual information—and the combined goal to capture semantic richness and support efficient text processing. In too closely attending to sententially localized units of meaning, they “lose sight” of broader aspects of semantic interrelations that contribute to and inflect on conceptual coherence. A more comprehensive and accurate modelling of intertextual similarity invites further thinking not only about the scope but also the nature and mechanism of similarity—the ways it concatenates over great textual distances, is defined through oppositionality and other interpretive possibilities represented in literary texts.

(4.3) The extent of alignment

To evaluate how the machine similarity scores are aligned to human-defined dimensions, I have inspected the score variations across clusters of parallels that are editorially determined to be recycling a core idea/trope, and which contain varying compositions of lexical overlap, semantic dispersal, tonal changes, emotional emphasis, etc. The “household gods shivered” (group-id-1) idea in canto I stanza 36, which was later rewritten into Byron’s letter and Marino Faliero (1821), was noted as a development from the figure of Prism in the Aeneid and Lovel in Walter Scott’s The Antiquary (1816). The dissimilarity which the models located between the two stanzas and an earlier intertextual echo (id-7) prompted a rudimentary probing test⁸ on base SBERT and MPNet, which revealed that base SBERT showed a relative sensitivity to stop words. Yet it is unclear whether the models registered the diffused distribution of meaningful semantic overlap or the corresponding theme of ruin (“hearthstone … Ashes … Solitude … home”).

A more extreme instance of self-reuse (id-30; “a certain age”; Figure 6) between Don Juan and Beppo (1817) reflects the inefficacy of models in reflecting semantic/conceptual difference through a lower score. In spite of the fact that the pair is stylistically similar and has high lexical, semantic and syntactic overlap, it produces a contrary meaning in its arrangement of component parts. SBERT models do not appear to encode stylistic information alongside semantic content as well as one may suspect.

An example of semantic/conceptual difference.

Byron’s reworking of Percy Bysshe Shelley’s “Ozymandias” (1818) (id-23) contained first and foremost an explicit reference to “Old Egypt’s King”, whose collocation with “Man” establishes a paragraph-level thematic resemblance (Pa-R) with “Ozymandias” that fully comes into view at the last line. While this intertextual pair has a greater foundation of parallel than examples with simple lexical overlaps (id-15), base SBERT’s lower scoring here suggests a fissure between its mechanism of dissecting textual similarity and ours.

Examples of the false positives are many (0.6-7). A concealed referential joke (id-16) (“he … heard a voice in all the wind” and “I swear it to the trees … the winds”) may well be lost on the modern reader who is less familiar than Byron’s contemporaries with The Marriage of Figaro (translated by Thomas Holcroft, 1785). Without a tangible parallel other than word repetition, Byron’s depiction of extratextual auditory transmission draws on the play’s suggestive context to connote Juan’s teenage absent-mindedness. Most models yielded surprisingly high scores (0.7+; id-19) despite great variance across language, form (dialogue vs lyric) and content. A substantial parallel to a shipwreck survival account in a prose source (group-id-8) scored similarly highly despite the formal difference, syntactic parallel and position of semantically-related words, except for the base SBERT model, which detected a notable change (0.6-8) between this and a subsequent stanza.

(5) Implications/Applications

This overview of existing theoretical approaches and computational methods surrounding intertextuality presents a Similarity Matrix (Figure 1) alongside an editorially-curated set of referential and paraphrastic parallels for evaluating variance in notions of semantic textual similarity. In light of the asymmetric epistemic conditions imposed by algorithmic tools and the heuristics of human interpretive intuition, it formulates a framework for decomposing complex textual nuance and phenomena into multifaceted dimensions of similarity. The Similarity Matrix encompasses nebulous conceptions and elements of genre and examples of what Forstall and Scheirer (2019) in Quantitative Intertextuality refer to as “the presence of the unsaid” (p. 45)—the latent effects and indefinite implications of reused units of meaning which feed back into still more historically-contextualized, globalized patterns of meaning. Oftentimes, the omitted, displaced and deviant word contributes to meaning as much as polysemic, ambiguous sentences do.

In crystalizing post-structuralist approaches of analysis into a mechanistic method of interpretation, this benchmark recalls the idea of “algorithmic criticism” (p. xi) proposed by Stephen Ramsay in Reading Machines (2011)—which reflects on quantitatively-oriented approaches in connection with the programmatic nature of hermeneutics. The difference between human and mechanical modes of attention is best reflected in the former’s adeptness at detecting abstract semantic relatedness beyond ostensible similarity, and distinguishing surface co-occurrences of textual artefacts from more meaningful correlations. The findings in this benchmark show that text chunking strategies facilitate different conceptual vantage points, as they delimit coherent units of meaning as word- or document-spans. Where book-level embeddings such as Barrè’s (2024) provide a top-down outline of the formal and commercial aspects of genre history, the embeddings produced at the word and sentence levels illustrate how internal facets of semantic signals become lost even in small spans of meaning—not to mention more extended context(s).

The more amorphous instances of intertextuality—wherein linguistic registers are contingent on subtle contextual shifts (e.g. Byron’s parody of Keats in id-7)—reflect yet another dimension of the compositionality problem which underlines the arithmetic operations of transformer-based models. Beyond correlations of lexical, semantic, syntactical, topical, stylistic and topological features, meaningful intertextuality rests also on the non-uniform fluctuations amidst lexical-semantic relations; the discernment of out-of-place elements that are meaningful by virtue of being deviant (as in parodic segments) requires deep contextual inference and understanding of pre-existing norms and referents. While the similarity scores produced by SBERT cannot be seen as reliable proxies for humanly-perceivable gradations of similarity, its statistical abstractions can be aligned with a common, techno-epistemic scale containing explicit categories of perception. With the Similarity Matrix (Figure 1), we may understand what types of similarity can be measured, to what degree and to what interpretative end—while keeping track of intertextual entities and networks that await mapping outside of STS tasks.

The theoretical speculation of this framework is that any dense yet fuzzy concepts like “intertextual similarity” can be modelled given a comprehensively broad and manageably transparent delineation of their ontological structure, components and inter-dependencies—as asymmetrical as they may be. From there, further formalization and practical implementations can be established through what I imagine will be increasingly hybrid configurations of lexical, semantic, syntactical similarity models—operationizing theories of semiotics, reader response, cognitive linguistics, epistemology, the philosophy of language and knowledge representation. As computational methods become better suited for literary corpora and detecting and inferring (even false-positive) textual reuse and correspondence, the full nature of intertextual conditions may be discovered across as yet unknown axes of extra-semantic understanding.

Notes

[1] Including “Bidirectional Encoder Representations from Transformers” (BERT), Sentence Transformers (Sentence-BERT, henceforth SBERT, particularly the family of SBERT models analyzed; the general-purpose, standard SBERT model [all-Mini-LM-L6-v2] will be referred to as “base SBERT”; see Table 1 for the Model List), large language models (LLM) which can connect to certain NLP libraries (e.g. LangExtract [Goel, 2025]), sparse-attention Longformer etc.

[2] Specifically, Pairwise Alignment for Intertextual Relations (PAIR) and PhiloLogic Alignment (PhiloLine).

[3] https://petercochran.wordpress.com/byron-2/byrons-works/ (Last accessed: 08 January 2025).

[4] The LLM response, while unstable, may reflect embedding proximity and is therefore included for reference even though it is not applicable in many DH tasks. Notably, a recent benchmark on “interpretive reasoning” was conducted by Sui et al. (2025). The prompt used alongside a simplified dataset (in the .csv format) was: Your task is to: 1. Compare text snippets in Column B (Paraphrase) and Column C (Original) from the dataset 2. Calculate Semantic Textual Similarity (STS) scores for each row 3. Use the similarity matrix to identify similarity types 4. Provide brief justifications and output results in CSV format.

[5] Conceived of as embeddings that point to diametrically opposed directions, forming a straight line in vector space. Examples of low-score, dissimilar sentences: “Semantic Search”, SBERT.net, https://www.sbert.net/examples/sentence_transformer/applications/semantic-search/README.html (last accessed: 11 January 2026). Rafael Guerra’s post (2023) offers a detailed explanation (https://medium.com/@rgalvg/from-physics-to-data-science-the-beauty-and-power-of-cosine-similarity-f23e276afe29 [last accessed: 11 January 2026]).

[6] A full explication of the technical details is beyond the scope of this discussion; for relevant details, see the SBERT documentation, “ContrastiveLoss” in “Losses”, SBERT.net, https://sbert.net/docs/package_reference/sentence_transformer/losses.html (last accessed: 11 January 2026).

[7] This may be attributed to model size: MPnet’s vectors have 768 dimensions, while the base/standard SBERT model (MiniLM) outputs 384 dimensions.

[8] This is to check whether lexical or syntactical features are overemphasized. Shared words and stop words were removed separately and the sentence was split, followed by a re-calculation of the scores.

Acknowledgements

I am particularly grateful for the detailed feedback and recommended readings provided by the anonymous reviewers, which have significantly contributed to the refinement of my ideas.

Competing Interests

The author has no competing interests to declare.

Author Contributions

Shellie Audsley: Conceptualization, Data curation, Formal analysis, Methodology, Writing – original draft.