Abstract
Intertextuality—as a textual condition and interrelations—remains a theory of interest for studying literary networks, influence, genre, cultural semiotics and the history of ideas. As quantifiable textual phenomena—in the forms of text reuse, paraphrases, allusions, thematic parallels and echoes—they are illuminated to varying degrees by an assortment of natural language processing (NLP) techniques for Semantic Textual Similarity (STS) measurement and retrieval. The variety of reference tracing tasks is mirrored in the diversity of paradigms that address heterogeneous research questions and corpora, all operating with distinct epistemic assumptions and constraints. This discussion paper works towards a multilayered, model-agnostic framework for benchmarking methods and models used in the systematic mapping of intertextual similarity—accounting for a spectrum of formal, lexical, semantic and stylometric clues, discourse levels and gradations of correspondence in a Similarity Matrix which is central to the benchmark (aSimMatrix). Revolving around issues of interpretability and the compositional gap that underline classical and transformer-based embedding models, this article considers the mechanical determination of semantic interrelations alongside the associative, pattern-seeking yet logical nature of interpretation itself.
The accompanying dataset is a small selection of 41 editorially curated textual parallels from Lord Byron’s mock epic Don Juan (1819-24), illustrative of various types of referential interrelations and the proposed similarity scoring metrics. Preliminary similarity detection tests performed using n-gram, Word2Vec, six Sentence Transformers (SBERT) models and, informally, a commercial large language model (LLM) suggest a statistical tendency in the pre-trained models towards similarity overestimation, reflecting the systems’ specious correlation of semantic relevance with human conceptions of resemblance.
