“aSimMatrix” Dimensions: A Scalable Framework for Benchmarking Intertextual Similarity

Shellie Audsley

doi:10.5334/johd.486

Abstract

Intertextuality—as a textual condition and interrelations—remains a theory of interest for studying literary networks, influence, genre, cultural semiotics and the history of ideas. As quantifiable textual phenomena—in the forms of text reuse, paraphrases, allusions, thematic parallels and echoes—they are illuminated to varying degrees by an assortment of natural language processing (NLP) techniques for Semantic Textual Similarity (STS) measurement and retrieval. The variety of reference tracing tasks is mirrored in the diversity of paradigms that address heterogeneous research questions and corpora, all operating with distinct epistemic assumptions and constraints. This discussion paper works towards a multilayered, model-agnostic framework for benchmarking methods and models used in the systematic mapping of intertextual similarity—accounting for a spectrum of formal, lexical, semantic and stylometric clues, discourse levels and gradations of correspondence in a Similarity Matrix which is central to the benchmark (aSimMatrix). Revolving around issues of interpretability and the compositional gap that underline classical and transformer-based embedding models, this article considers the mechanical determination of semantic interrelations alongside the associative, pattern-seeking yet logical nature of interpretation itself.

The accompanying dataset is a small selection of 41 editorially curated textual parallels from Lord Byron’s mock epic Don Juan (1819-24), illustrative of various types of referential interrelations and the proposed similarity scoring metrics. Preliminary similarity detection tests performed using n-gram, Word2Vec, six Sentence Transformers (SBERT) models and, informally, a commercial large language model (LLM) suggest a statistical tendency in the pre-trained models towards similarity overestimation, reflecting the systems’ specious correlation of semantic relevance with human conceptions of resemblance.

References

Bakhtin, M. M. (1981). The dialogic imagination: Four essays. (M. Holquist, Ed.; C. Emerson & M. Holquist, Trans.). University of Texas Press.
Search in Google Scholar Back to article
Barré, J. (2024). Latent Structures of intertextuality in French fiction: How literary recognition and subgenres are framing textuality. arXiv:2410.17759. 10.48550/arXiv.2410.17759
Open DOI Search in Google Scholar Back to article
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 1–14. 10.18653/v1/S17-2001
Open DOI Search in Google Scholar Back to article
Chaudhary, P., & Dexter, J. (2023). Intertextuality: Computational tools for identifying related passages in large corpora. Quantitative Criticism Lab. https://www.qcrit.org/research/intertextuality (last accessed: 10 November 2025).
Search in Google Scholar Back to article
Cochran, P. (2009). Byron’s Works. Peter Cochran’s Website – Film Reviews, Poems, Byron… Web. https://petercochran.wordpress.com/byron-2/byrons-works (last accessed: 08 January 2026).
Search in Google Scholar Back to article
Coffee, N., Koenig, J. P., Poornima, S., Forstall, C., Ossewaarde, R., & Jacobson, S. (2012). The tesserae project: Intertextual analysis of Latin poetry. Literary and Linguistic Computing, 28, 221–228. 10.1093/llc/fqs033
Open DOI Search in Google Scholar Back to article
Cooney, C., Horton, R., Olsen, M., Roe, G., & Voyer, R. (2008). Hidden Roads and Twisted Paths: Intertextual discovery using clusters, classifications, and similarities. Digital Humanities 2008 Book of Abstracts, 93–94. https://openresearch-repository.anu.edu.au/bitstreams/19494939-20e2-43bf-be64-a264d770889a/download (last accessed: 06 January 2026).
Search in Google Scholar Back to article
Duan, S. (2025). Quantitative intertextuality from the digital humanities perspective: A survey. arXiv:2510.27045. 10.48550/arXiv.2510.27045
Open DOI Search in Google Scholar Back to article
Fodor, J., De Deyne, S., & Suzuki, S. (2025). Compositionality and sentence meaning: Comparing semantic parsing and transformers on a challenging sentence similarity dataset. Computational Linguistics, 51(1), 139–190. 10.1162/coli_a_00536
Open DOI Search in Google Scholar Back to article
Forstall, C. W., & Scheirer, W. J. (2019). Quantitative intertextuality: Analyzing the markers of information reuse. Cham: Springer International Publishing AG. 10.1007/978-3-030-23415-7
Open DOI Search in Google Scholar Back to article
Genette, G. (with Prince, G.). (1997). Palimpsests: Literature in the second degree (C. Newman & C. Doubinsky, Trans.). University of Nebraska Press. (Original work published 1982)
Search in Google Scholar Back to article
Goel, A. (2025). LangExtract (Version 1.1.1) [Computer software]. 10.5281/zenodo.17015089
Open DOI Search in Google Scholar Back to article
Goodman, P. (1954). The Structure of Literature. University of Chicago Press.
Search in Google Scholar Back to article
Guerra, R. (2023). From physics to data science: The beauty and power of cosine similarity. Medium. https://medium.com/@rgalvg/from-physics-to-data-science-the-beauty-and-power-of-cosine-similarity-f23e276afe29 (last accessed: 11 January 2026).
Search in Google Scholar Back to article
Hinds, S. (1998). Allusion and Intertext: Dynamics of Appropriation in Roman Poetry. Cambridge University Press.
Search in Google Scholar Back to article
Horton, R., Olsen, M., & Roe, G. (2010). Something borrowed: Sequence alignment and the identification of similar passages in large text collections. Digital Studies/Le champ numérique, 2(1). http://hdl.handle.net/1885/12104
Search in Google Scholar Back to article
Hume, D. (1739). A Treatise of Human Nature. Hume Texts Online. https://davidhume.org/texts/t/1/1/4 (last accessed: 3 February 2026).
Search in Google Scholar Back to article
Johnson, N., Bertsch, A., Deal, M-E., & Strubell, E. (2025). FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction. Findings of the Association for Computational Linguistics: EMNLP 2025, 25228–25246. 10.18653/v1/2025.findings-emnlp.1375
Open DOI Search in Google Scholar Back to article
Joshi, B., Shah, N., Barbieri, F., & Neves, L. (2020). The Devil is in the Details: Evaluating Limitations of Transformer-based Methods for Granular Tasks. Proceedings of the 28th International Conference on Computational Linguistics, 3652–3659. 10.18653/v1/2020.coling-main.326
Open DOI Search in Google Scholar Back to article
Khandelwal, U., He, H., Qi, P., & Jurafsky, D. (2018). Sharp nearby, fuzzy far away: How neural language models use context. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 284–294. 10.18653/v1/P18-1027
Open DOI Search in Google Scholar Back to article
Kristeva, J. (1981). Desire in language: A semiotic approach to literature and art (T. Gora, A. Jardine, & L. S. Roudiez, Trans., L. S. Roudiez, Ed.). Basil Blackwell. (Original work published 1977)
Search in Google Scholar Back to article
Kuznetsov, I., Buchmann, J., Eichler, M., & Gurevych, I. (2022). Revise and Resubmit: An intertextual model of text-based collaboration in peer review. Computational Linguistics; 48(4), 949–986. 10.1162/coli_a_00455
Open DOI Search in Google Scholar Back to article
Lau, P. K., & McManus S. M, (2024). Mining asymmetric intertextuality. arXiv:2410.15145. 10.48550/arXiv.2410.15145
Open DOI Search in Google Scholar Back to article
Losses. SBERT.net. https://sbert.net/docs/package_reference/sentence_transformer/losses.html (last accessed: 11 January 2026).
Search in Google Scholar Back to article
Luo, M., Kumbhar, S., Shen, M., Parmar, M., Varshney, N., Banerjee, P., Aditya, S., & Baral, C. (2024). Towards LogiGLUE: A brief survey and a benchmark for analyzing logical reasoning capabilities of language models. arXiv:2310.00836v3. 10.48550/arXiv.2310.00836
Open DOI Search in Google Scholar Back to article
Mahadevan, A., Mathioudakis, M., Mäkelä, E., & Tolonen, M. (2025). Text reuse in large historical corpora: Insights from the optimization of a data science system. International Journal of Data Science and Analytics, 20(5), 4631–4643. 10.1007/s41060-025-00742-x
Open DOI Search in Google Scholar Back to article
May, P. (2021). Machine translated multilingual STS benchmark dataset. https://github.com/PhilipMay/stsb-multi-mt (last accessed: 09 January 2026).
Search in Google Scholar Back to article
Miller, H., Kuflik, T., & Lavee, M. (2025). Text Alignment in the Service of Text Reuse Detection. Applied Sciences, 15(6), 3395. 10.3390/app15063395
Open DOI Search in Google Scholar Back to article
Peng, B., Narayanan, S., & Papadimitriou, C. (2024). On limitations of the transformer architecture. arXiv:2402.08164v2. 10.48550/arXiv.2402.08164
Open DOI Search in Google Scholar Back to article
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N., & Lewis, M. (2023). Measuring and narrowing the compositionality gap in language models. Findings of the Association for Computational Linguistics: EMNLP 2023, 5687–5711. 10.18653/v1/2023.findings-emnlp.378
Open DOI Search in Google Scholar Back to article
Ramsay, S. (2011). Reading machines: Toward an Algorithmic Criticism. University of Illinois Press. 10.16995/dscn.245
Open DOI Search in Google Scholar Back to article
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. 10.18653/v1/D19-1410
Open DOI Search in Google Scholar Back to article
Roe, G., Gladstone, C., & Morrissey, R. (2016). Discourses and disciplines in the enlightenment: Topic modeling the french encyclopédie. Frontiers in Digital Humanities, 2. 10.3389/fdigh.2015.00008
Open DOI Search in Google Scholar Back to article
Romanello, M. (2016). Exploring Citation Networks to Study Intertextuality in Classics. Digital Humanities Quarterly, 10(2).
Search in Google Scholar Back to article
Semantic Search. SBERT.net. https://www.sbert.net/examples/sentence_transformer/applications/semantic-search/README.html (last accessed: 11 January 2026).
Search in Google Scholar Back to article
Scheirer, W., Forstall, C., & Coffee, N. (2016). The sense of a connection: Automatic tracing of intertextuality by meaning, Digital Scholarship in the Humanities, 31(1), 204–217. 10.1093/llc/fqu058
Open DOI Search in Google Scholar Back to article
Schubert, C. (2020). Intertextuality and Digital Humanities. it – Information technology, 62(2), 53–59. 10.1515/itit-2019-0036
Open DOI Search in Google Scholar Back to article
Smiley, D. M. (2025). Intertextual parallel detection in Biblical Hebrew: A transformer-based benchmark. arXiv:2506.24117. 10.48550/arXiv.2506.24117
Open DOI Search in Google Scholar Back to article
Stabler, J., & Hopps, G. (2024). The poems of Lord Byron – Don Juan (Vol. 4 & 5). Routledge. 10.4324/9781003571087
Open DOI Search in Google Scholar Back to article
Steyer, K. (2015). Irgendwie hängt alles mit allem zusammen – grenzen und möglichkeiten einer linguistischen kategorie ‘intertextualität’. Textbeziehungen. Linguistische und literaturwissenschaftliche Beiträge zur Intertextualität, 83 – 106.
Search in Google Scholar Back to article
Sui, P., Rodriguez, J. D., Laban, P., Murphy, J. D., Dexter, J. P., So, R. J., Baker, S., & Chaudhuri, P. (2025). KRISTEVA: Close Reading as a novel task for benchmarking interpretive reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 32829–32849. 10.18653/v1/2025.acl-long.1577
Open DOI Search in Google Scholar Back to article
Takahashi, H., Lu, X., Ishijima, S., Seo, D., Kim, T., Park, S., Song, M., Marante, K., Iso, K.,Tokura, H., & Ohman, E. (2024). OZemi at SemEval-2024 Task 1: A Simplistic Approach to Textual Relatedness Evaluation Using Transformers and Machine Translation. Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), 7–12, 10.18653/v1/2024.semeval-1.2
Open DOI Search in Google Scholar Back to article
Trillini, R. H., & Quassdorf, S. (2010). A ‘key to all quotations’? A corpus-based parameter model of intertextuality, Literary and Linguistic Computing, 25(3), 269–286. 10.1093/llc/fqq003
Open DOI Search in Google Scholar Back to article
Underwood, T. (2019). Distant horizons: Digital evidence and literary change. Chicago: The University of Chicago Press. 10.7208/chicago/9780226612973.001.0001
Open DOI Search in Google Scholar Back to article
Xing, Y. (2025). Modelling intertextuality with n-gram embeddings. arXiv:2509.06637. 10.48550/arXiv.2509.06637
Open DOI Search in Google Scholar Back to article

“aSimMatrix” Dimensions: A Scalable Framework for Benchmarking Intertextual Similarity

Abstract

Paradigm

My account