Have a personal or library account? Click to login
“aSimMatrix” Dimensions: A Scalable Framework for Benchmarking Intertextual Similarity Cover

“aSimMatrix” Dimensions: A Scalable Framework for Benchmarking Intertextual Similarity

By: Shellie Audsley  
Open Access
|Feb 2026

Abstract

Intertextuality—as a textual condition and interrelations—remains a theory of interest for studying literary networks, influence, genre, cultural semiotics and the history of ideas. As quantifiable textual phenomena—in the forms of text reuse, paraphrases, allusions, thematic parallels and echoes—they are illuminated to varying degrees by an assortment of natural language processing (NLP) techniques for Semantic Textual Similarity (STS) measurement and retrieval. The variety of reference tracing tasks is mirrored in the diversity of paradigms that address heterogeneous research questions and corpora, all operating with distinct epistemic assumptions and constraints. This discussion paper works towards a multilayered, model-agnostic framework for benchmarking methods and models used in the systematic mapping of intertextual similarity—accounting for a spectrum of formal, lexical, semantic and stylometric clues, discourse levels and gradations of correspondence in a Similarity Matrix which is central to the benchmark (aSimMatrix). Revolving around issues of interpretability and the compositional gap that underline classical and transformer-based embedding models, this article considers the mechanical determination of semantic interrelations alongside the associative, pattern-seeking yet logical nature of interpretation itself.

The accompanying dataset is a small selection of 41 editorially curated textual parallels from Lord Byron’s mock epic Don Juan (1819-24), illustrative of various types of referential interrelations and the proposed similarity scoring metrics. Preliminary similarity detection tests performed using n-gram, Word2Vec, six Sentence Transformers (SBERT) models and, informally, a commercial large language model (LLM) suggest a statistical tendency in the pre-trained models towards similarity overestimation, reflecting the systems’ specious correlation of semantic relevance with human conceptions of resemblance.

DOI: https://doi.org/10.5334/johd.486 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 19, 2025
|
Accepted on: Jan 23, 2026
|
Published on: Feb 18, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Shellie Audsley, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.