Towards Multidisciplinary Annotation of Fluid Historical Corpora Using Linked Data

Anne Breitbarth; Julie M. Birkholz; Joren Six; Steven Vanderputten

doi:10.5334/johd.539

1 Context and Motivation

Historical sources are the foundation of historical, linguistic, philological, and literary research, the study of written practices, and research in Digital Humanities. Annotated historical texts, in particular, are essential for reliably identifying, disambiguating, and contextualising structural, thematic, and contextual patterns. Across disciplines, however, annotated corpora have traditionally been constructed and exploited in different ways. Linguists typically enrich digitised texts with token-level annotations (e.g., lemmatisation, part-of-speech and morphological tagging) and span-level structures (e.g., syntactic parsing), supplemented by text-level metadata (e.g., authorship, date of composition). Historians and literary scholars, by contrast, tend to focus on such text-level metadata (e.g., provenance, besides authorship and others) and ideas and arguments embedded in the texts. Despite the growth of annotated datasets and occasional multilayer resources, knowledge and data largely remain siloed along disciplinary lines (B. Bauer, 2025), preventing interdisciplinary research.

As research in the humanities becomes more data-intensive and interconnected, the need to examine relations between annotated phenomena has come into focus. Two fundamental issues impede progress: diversity of material (concerning the physical object, e.g., manuscript) and textual (the text as attested in a material object) witnesses, and seemingly disparate yet related annotations. The realisation that “text” is an abstract historical construct, and the transmission of texts is highly fluid (i.e., carrying textual variation) and plural (i.e., attested in more that one material form) (Nichols, 1990, 2024) raises the question of which realisation to include when working with historical corpora. Many projects rely on a single transcribed version (often of an edition; increasingly of a manuscript), even when multiple manuscript or printed versions survive. Editions aiming at reconstructing an archetype frequently draw on a restricted subset of witnesses, thereby occluding a substantial part of the material record, besides constituting a subjective interpretation. In extreme cases, the resulting picture of the linguistic or stylistic properties of a period can be distorted. A growing body of scholarship argues that all known material witnesses of a text should be afforded equal authority and that the full range of variation attested across witnesses ought to be annotated and compared (McGann, 1991; Nichols, 1990, 2024). While there are projects that align transcriptions of parallel manuscript versions of a text, sometimes including a link to images of the manuscripts themselves,¹ most of these are not enriched with linguistic, historical, or literary annotations, or with extensive metadata, which would be needed to inform interdisciplinary research agendas.²

To give a concrete example of variation between material witnesses of a text, consider the Miracula Sancti Remacli (BHL 7120–7138) (Bibliotheca Hagiographica Latina, 1898–99 & 1986), a medieval text recounting the posthumous miracles of Saint Remaclus of Stavelot (present-day Belgium) that underwent multiple stages of expansion and rewriting. This is for instance interesting from a linguistic perspective: Random sampling of seven tenth-century manuscripts—close in time to the text’s composition—reveals significant syntactic deviations from the preferred edition (Acta Sanctorum Septembris 1; Pien et al., 1746–1762, 1746, col. 696–721) and its exemplar, an early modern copy (ms. 137.8 of the Collectanea Bollandiana) of a lost medieval manuscript. Where the tenth-century manuscripts frequently show head-dependent order (e.g., ad persoluendum diuinum officium, fuissemus reuersi, precio sui meriti), the edition and its exemplar show dependent-head order in the same instances (ad divinum officium persolvendum, reversi fuissemus, sui meriti precio). This suggests that the edition is not suitable as a source for studying the syntax of tenth-century Latin. At the same time, there are indications that the linguistic variation is structured according to language-external variables, e.g., time and place of composition, institutional background, or intended usage of the manuscript. But historians and philologists, too, will be interested in tracking variation across the manuscript record, as the compositional form of the text, too, evolves as it circulates and is reworked. For the Miracula Sancti Remacli, eight distinct redactions can be identified, marked by expansions and reorderings across stages. Each compositional variant presents a subtly or drastically distinct account, and there is enormous variation in the material aspects (manuscript size and lay-out), all of which has significant implications for how we must understand its perception and use by medieval audiences. Differences in the portrayal of the saint’s agency, the community, and other themes prompt questions about the intentions behind each redaction. Notably, new manuscripts of earlier stages continued to be produced—many at Stavelot itself—even as later versions were already in circulation, raising further historical questions about parallel accounts of sanctity and cultic practice circulating simultaneously (Vanderputten & Breitbarth, 2026). While up to the twelfth century, smaller libelli dominate the material transmission (Poulin, 2006), later collections of hagiographic texts used larger, less portable formats, indicating that the usage of these texts evolved (Philippart, 1977; De Valeriola & Dubuisson, 2025).

The chronological and geographical spread of this unusually complex set of redactions offers further insight into shifting institutional networks and channels of exchange. Figure 1 sketches this for the manuscript transmission of the Miracula, showing markedly different geographical patterns between the 10th–13th and 15th–19th centuries. Transmission trajectories—such as which institutions commissioned or copied particular redactions, or which manuscripts travelled together in composite volumes—add additional layers of evidence. Taken together, these data help illuminate differing uses and receptions of the text. The example of the Miracula Sancti Remacli therefore underscores the need to take textual fluidity and material plurality seriously. The material ecology of historical texts records their responsiveness to ideological, stylistic, linguistic, and material pressures. The tight connection between linguistic variation, geographical provenance, institutional embedding and manuscript circulation and use makes interdisciplinary collaboration imperative.

Spatial and chronological distribution of manuscript witnesses of the *Miracula Sancti Remacli*, mapped onto 10th-century Lotharingia (Germanic/ Romance language boundary in red).

However, a second, equally significant impediment is the lack of interoperability among existing annotated historical corpora. Most platforms emerged from single-discipline projects with specific research questions and consequently support mono- or duodisciplinary annotation workflows. Bringing such resources together typically demands substantial alignment of annotation protocols, and even then they remain ‘insular’ solutions with limited capacity to interoperate with other datasets and platforms (Borek et al., 2025), despite calls across disciplinary boundaries for studying relations between contexts and domains (Birkholz & Budke, 2021; Jenset & McGillivray, 2017; Luger & Vogeler, 2025; Maza, 2017). The interpretative depth of historical texts can be considerably enhanced by embedding them in a semantically rich framework that supports multidisciplinary annotation (cf. also Balck et al., 2023) and that recognises textual fluidity and material plurality. Doing so amounts to taking the historical sociolinguistics adage to ‘use all the data’ (Lauersdorf, 2019, 2021) to its logical conclusion. In what follows, we outline how a digital research infrastructure distinguishing but linking abstract texts, their redactional states, the individual witnesses, and their material carriers can provide the backbone for integrating diverse witnesses and seemingly disparate yet overlapping annotations.

2 Infrastructure design

To address the challenges outlined in Section 1, we propose a research infrastructure that does not impose fixed hierarchical relationships between objects but instead connects heterogeneous data points through interoperable, semantically explicit links. The design rests on three defining principles: (a) it interlinks annotations from multiple fields of expertise (e.g., history, linguistics, literary studies, and material philology); (b) it enables parallel annotation of images and transcriptions of all extant textual witnesses, thereby making textual fluidity and material plurality analytically tractable; and (c) it is reusable across textual traditions, periods, and cultural contexts. These principles are operationalised within a technical framework that integrates outputs from several annotation environments—both open-source and licensed—linking them to digital surrogates and metadata of material witnesses through a shared Linked Data ontology.

2.1 Annotation layers and data formats

The infrastructure treats each annotation layer as a distinct but interoperable component. Where possible, existing tools are used to produce these layers:

Digital images of material witnesses (e.g., manuscripts and printed editions), referenced through IIIF manifests.
Material (structural) metadata, including date, format, persistent identifiers (PID/PURL), and holding institution of material witnesses.
Textual (descriptive) metadata, such as place or scriptorium of origin, scribal attributions, and other descriptive information on the texts contained in/instantiated in the material witnesses.
Digital transcriptions, generated through semi-automatic handwritten text recognition (HTR) and optical character recognition (OCR).
Historical and literary annotations, such as named entities (persons, places, organisations, ideas, attitudes), relationships between entities, and instances of textual reuse.
Linguistic annotations, including part-of-speech (POS) tags, lemmata, and Universal Dependencies (UD) labels.

A central property of the proposed infrastructure is that it mediates between heterogeneous disciplinary practices by relying on widely adopted community standards. The infrastructure is designed as an integration layer that connects existing representations and annotations through explicit semantic mappings. In line with CLARIN principles promoting sustainable and reusable language resources (De Jong et al., 2018), the annotation layers are produced using established platforms suited to different disciplinary needs. For the prototype corpus (Section 3), these include:

Madoc: a collaborative IIIF annotation environment for enriching manifests with textual and material metadata.³
Transkribus and eScriptorium: platforms for ATR-based transcription, both of which allow iterative model retraining to accommodate palaeographic variation, subsequently corrected manually and exported as XML or .txt.⁴
LabelStudio: for semi-automatic historical and literary annotation, exported as JSON.⁵
Tycho Brahe platform: for semi-automatic POS-tagging and UD parsing, exporting XML and CoNLL-U data.⁶

Digital images of manuscripts and printed editions are accessed through the International Image Interoperability Framework (IIIF) (IIIF Consortium, 2023). IIIF manifests provide stable identifiers for images, canvases, and regions, and enable the reuse of images hosted by memory institutions without local duplication. Region‑based annotations produced in IIIF-compliant environments (e.g., for layout description, material features, or alignment with transcriptions) are retained as first‑class objects in the infrastructure and linked to other annotation layers via persistent URIs. IIIF thus functions as the backbone for referencing material witnesses at the level of both whole objects and spatially delimited fragments.

Textual transcriptions are typically produced in XML‑based formats, most commonly TEI‑XML when exported from handwritten text recognition (HTR) or OCR platforms (TEI Consortium, 2023). These XML representations preserve structural information (e.g., divisions, line breaks, abbreviations). Besides, the linguistic annotation platform (Tycho Brahe) uses the tab-separated CoNLL-U format.⁷ It consists of plain text files (UTF-8) where each word or token in the text is separated by new lines and each sentence is separated by a blank line. A CoNLL-U file features ten columns per token, with columns representing linguistic information such as the word form itself, its lemma, part-of-speech tag, morphological, and syntactic information. The annotation platform internally converts CoNLL-U into XML, and exports both formats. Within the proposed infrastructure, XML is treated as an authoritative source format for text content and internal structure. Rather than querying XML directly though, elements such as tokens, spans, or structural units are transformed into RDF resources (see below) and linked back to their XML source via stable identifiers, allowing both representations to coexist.

Several annotation tools used in the workflow export data in JSON‑based formats, typically optimised for task‑specific annotation (e.g., named‑entity recognition, relation annotation, classification). JSON is widely used as a lightweight data‑exchange format in web applications and annotation platforms (Bray, 2017). These JSON structures are not replaced or normalised globally. Instead, they are ingested through platform‑specific mappings that extract identifiers, annotation targets, and semantic categories. JSON therefore serves as an intermediate exchange format.

2.2 Interoperability and Linked Open Data

In order to move from the now disparate or siloed data to an interconnected structure, the infrastructure integrates this information both technically and ontologically. Integration between platforms is achieved through a two-step process. First, the output of each annotation platform is extracted via its API, typically using REST (Representational State Transfer) or gRPC (Google Remote Procedure Calls) protocols (Johansson & Isabella, 2023).

In a second step, there is a need for a conceptual model that defines the logic of the relations and specifies the necessary (meta)data to be included. To synthesise the multimodal, multilevel annotations, the infrastructure employs a Linked Data ontology that combines digital images of manuscripts and editions, transcriptions, material and textual metadata, and multidisciplinary annotations (token-, span-, and text-based) into a single graph-based model. This positioning situates the proposed model alongside existing initiatives that have developed ontologies or semantic frameworks for manuscript and textual description, including Biblissima,⁸ Mapping Manuscript Migrations (Burrows et al., 2020), Digital Scriptorium,⁹ MEMO (Franzini et al., 2016), and the Scholastic Commentaries and Texts Archive (SCTA) (Witt, 2024). These projects have made foundational contributions to modelling manuscripts, provenance, and complex textual structures, and their work directly informs our approach. However, they typically focus on manuscript‑level description, provenance, or specific scholarly genres and do not aim to integrate fine‑grained linguistic, historical, literary, and material annotations across all surviving witnesses of a text. The ontology proposed here is therefore complementary rather than competitive: it reuses and aligns with existing models where appropriate, while extending them to support the explicit representation of textual fluidity, parallel redactions, and cross‑disciplinary annotation within a shared semantic framework. This will then be mapped to a standard vocabulary.

The ontology underpinning this infrastructure is conceived as a lightweight, modular integration model rather than a comprehensive replacement for existing standards. At its core are classes representing text (an abstract historical construct), redaction (a coherent textual state), witness (a concrete textual instantiation), and material object (manuscripts and printed artefacts), alongside entities for tokens and token spans, and annotations of all these with textual and material metadata, POS-tags, UD-labels, agents, and places. Key relations between entities include isWitnessOf, isRedactionOf, isPartOf, isVariantOf, and annotates, allowing annotations at any level—from individual words to composite codices—to be linked across layers. Where appropriate, the model aligns with established ontologies and conceptual frameworks: CIDOC‑CRM provides patterns for modelling material objects, production events, and agents;¹⁰ FRBR/LRM informs distinctions between abstract works and their instantiations (IFLA, 2017, 2026); and IIIF identifiers anchor references to digital surrogates (IIIF Consortium, 2023). Rather than enforcing a single inheritance hierarchy, the ontology functions as an alignment layer that connects discipline‑specific representations through shared identifiers and explicitly typed relations. Together, these components allow for the representation of historical texts as richly interlinked, non-hierarchical networks of evidence. They ensure that linguistic, historical, literary, and material annotations can be queried in combination across all witnesses, making the infrastructure suitable for wide-ranging research and highly reusable across projects.

The canonical integration layer of the infrastructure is RDF (Resource Description Framework) (W3C, 2014). All entities—material witnesses, texts, redactions, tokens, annotations—are identified by persistent URIs (Uniform Resource Identifiers) and connected through typed relations expressed as RDF triples (Heath & Bizer, 2011). This approach enables the infrastructure to function as Linked Open Data (LOD), supporting semantic interoperability, reuse, and alignment with external datasets (Berners-Lee, 2006), and avoiding “point-to-point” integrations of platform-specific formats, as each platform is allowed to remain independently maintained while still contributing to a common data ecosystem. A reasoning engine mediates the alignment of concepts across layers, ensuring that relationships—such as identity, part–whole structures, and cross-witness correspondences—are encoded consistently. The RDF model does not aim to replace existing standards such as TEI, but to mediate between them by making relationships explicit and machine‑interpretable. XML, JSON, and IIIF resources thus remain authoritative within their respective domains, while RDF provides a shared semantic layer across disciplinary boundaries.

By grounding the infrastructure in open standards, data can be accessed through multiple mechanisms: SPARQL queries over the RDF graph (W3C, 2013), API‑based harvesting of specific layers, or reuse of original XML, JSON, and IIIF resources. This layered standards strategy supports long‑term sustainability, allows projects to adopt only those components relevant to their needs, and ensures that annotations produced within the infrastructure remain reusable beyond the lifetime of any single platform or corpus.

2.3 Front end

To make the Linked Data infrastructure accessible and usable in practice, the envisaged system will provide both machine-oriented and human-oriented interfaces. Machine access will be supported through data dumps, APIs with varying levels of expressivity, and a constrained SPARQL endpoint, enabling programmatic querying, harvesting, and large-scale reuse of the structured data. Depending on the needs of the interface, appropriate indexes will be created on the server-side triple store to optimise performance. For human access, the infrastructure will deploy a customised SAMPO-UI instance, a user-friendly interface designed for exploratory search, faceted browsing, and the visualisation of relationships within Linked Data graphs (Hyvönen, 2022).¹¹ The interface is intended to serve as a portal to the SPARQL endpoint to be set up, and enables users to construct and export new corpora, inspect annotation layers, and navigate between material witnesses, transcriptions, and annotations. While it will initially be implemented for the prototype corpus (Section 3), the front end is conceived as a reusable, configurable component for any project adopting the underlying ontology and workflow. Figure 2 illustrates the overall workflow and modular architecture.

2.4 Implementation status

Funding for the development of the proposed infrastructure was secured through the Research Foundation–Flanders (FWO) in late April 2026 (M-PATCH: Multidisciplinary Platform for Annotating Textual Corpora and Histories, grant number I004326N). The project is thus currently transitioning from conceptual design to active implementation. Several core components, specifically, the annotation and capture platforms discussed in Section 2.1, Madoc, Transkribus, eScriptorium, LabelStudio, and the Tycho Brahe platform, are already available and in use. Building on these existing tools, dedicated pipelines for structured annotation, quality control, and manual correction across all annotation layers are currently being developed. In parallel, the Ghent Centre for Digital Humanities (GhentCDH) has initiated work on the Linked Data integration layer, including the RDF data model (cf. 2.2), the population of the triple store, and the implementation of a SPARQL endpoint (cf. Section 2.3). These components constitute the key innovative contribution of M‑PATCH, as they enable the semantic integration and joint querying of heterogeneous annotations (cf. Section 4).

3 Dataset description: prototype corpus

To evaluate the capabilities and generalisability of the proposed infrastructure, we selected a prototype dataset consisting of 189 Latin hagiographic texts from Lotharingia —encompassing saints’ lives, miracle collections, and cult narratives (T. Bauer, 1997; Brunhölzl, 1996; Herrick, 2020; Krönert, 2021; Philippart, 1994–2020)—dating from the Long Tenth Century (circa 880–1030). This corpus is particularly well suited to testing the Linked Data model for two reasons.

First, medieval hagiographic writing is a paradigm case of textual fluidity and material plurality. Numerous studies have shown that these texts were frequently rewritten, expanded, and reorganised to address new institutional, devotional, or political needs (Goullet, 2005). The 189 texts in the corpus survive in circa 1,400 manuscript witnesses, and appear in 794 printed editions. This diversity makes this a rigorous testbed for modelling variation and the factors influencing it.

Second, the corpus allows exploration of a wide range of factors. Hagiographies were widely disseminated across political, institutional, geographical, linguistic and other boundaries, over an extremely long temporal span: many texts were copied and reworked for nearly a millennium. Variation may also reflect scribal multilingualism: Lotharingia was a contact zone between Romance and Germanic linguistic areas, and Latin was no longer a spoken vernacular in this period (Barrau, 2011; Clackson & Horrocks, 2011; Wright, 2002). Finally, we see that hagiographies typically eluded the expectations of textual rigidity that applied to the reproduction of biblical texts. This means that the composition, style, spelling, syntax, and in some cases also the narrative contents of the works could be adapted to match the commissioners’ preferences and requirements. The corpus therefore provides abundant opportunities to study how language-external factors shape textual practices.

While the texts have been examined individually in historical and literary scholarship, they have not previously been treated within a unified, multidisciplinary framework. The Linked Data model enables the annotation layers—linguistic, historical, literary, and material—to inform one another. To continue on the example discussed above, the annotation of named entities might point to significant developments regarding gender, which might be interesting for historical analysis, but the textual or material metadata do not warrant safely situating the texts in space and time. However, the analysis of the linguistic variation, for instance by applying and further refining methods from computational dialectology (cf. also Section 4 below), drawing on the linguistic annotation layers, may help resolving this issue. Therefore, the linking of different disciplinary layers of annotation and metadata not only facilitates traditional forms of analysis but also enables new computational approaches such as semi-automated topic modelling, the detection of literary motifs, sentiment analysis, network analysis, and quantitative investigation of linguistic variation and change.

Because markables and their attributes are represented as nodes and relations in the knowledge graph, they can function both as explanatory factors and as response variables in multivariate analyses. This integrated structure supports research questions that would previously have required multiple incompatible corpora. For instance, modelling linguistic variation as a function of geographical provenance or institutional networks may reveal the formation of communities of practice (Eckert & McConnell-Ginet, 1992; Putzu, 2026). The use of Semantic Web technologies further enables linking with external resources, such as authority files, historical databases, and repositories containing named entities, events, or intertextual references. By treating each witness as an independent but interlinked object, the infrastructure enables rigorous comparative analysis across the entire textual tradition.

A substantial part of the data for the prototype corpus has already been brought together by the first and last author prior to the integration phase. For each of the 189 texts, normalised digital transcriptions of the preferential edition (according to current disciplinary standards) are available and serve as a stable reference layer for annotation and experimentation. In addition, a relational database was compiled, containing extensive descriptive and historical metadata for the entire corpus (ca. 1,400 manuscript witnesses and 794 printed editions). Digital surrogates have been identified at scale: IIIF-resources of 435 manuscripts and scans of 699 early printed editions have been localised. Supplementary image material has been sourced for an additional 244 manuscripts, providing broad coverage of the material record. Of the annotation processes, Named Entity Recognition (NER) and syntactic annotation are in progress for selected sub‑corpora based on the transcribed editions, and will be incrementally expanded. Pilot transcription experiments using Transkribus, eScriptorium, and Google AI Studio have been conducted, including the training and refinement of custom HTR/OCR models. All required annotation platforms are operational. The semantic linking of these transcription, annotation, and metadata layers, together with the implementation of the Linked Data model, triple store, and SPARQL endpoint, constitutes the core objective of the recently awarded M-PATCH project, and is scheduled for development during the project’s runtime (2026–2030).

4 Applications and outlook

The infrastructure described above facilitates the creation of richly annotated corpora of historical texts that acknowledge both textual fluidity and material plurality. By adopting a non-hierarchical model of textual transmission and enabling integration of multidisciplinary annotation layers, the framework supports research that is difficult or impossible to conduct using existing mono- or duodisciplinary platforms. This approach to enriching monodisciplinary data from a multidisciplinary perspective is key to advancing the state of the art in historical text research. Unlike other models that are currently available, the infrastructure will also create an environment in which text corpora are transcribed and annotated with an a priori view to multidisciplinary annotation, which will inevitably change attitudes towards and approaches to presenting the texts, and open up new research avenues. Annotation layers may originate within a single project or be drawn from separate initiatives and time periods; in both cases, the Linked Data model encourages the reuse and integration of previously created annotations, vocabularies, and ontologies through SPARQL queries leveraging the SAMPO interface.

By relying on open standards and semantic technologies, the model enables researchers to discover connections between data types and annotation layers that would otherwise remain hidden. Machine-readable data structures allow new datasets to be integrated without extensive custom programming, reducing the overhead required for long-term sustainability and future expansion. Beyond addressing conceptual and methodological bottlenecks in the multidisciplinary study of historical texts, and beyond enabling the reuse of all available information in a transparent manner, the infrastructure opens new research avenues in computational humanities. After discussing an illustrative end-to-end use scenario, we propose three examples for future (and planned) applications.

4.1 End‑to‑end use scenario

A concrete example of a use scenario would be a collaboration between a historical linguist investigating word order variation, and a historian researching evidence of ideological, personal, and institutional networking patterns, in a body of tenth-century Latin hagiographies. In a first step, they select a set of relevant manuscript witnesses through their IIIF manifests, filtered by material characteristics such as date range, place of production, and codicological features recorded in the material metadata layer. In a next step, they retrieve the transcriptions linked to these material witnesses, and access span- and token‑level annotations via the Linked Data graph. The historian would for instance extract named entity annotations, while the linguist would do so for the linguistic annotations (lemmata, POS tags, UD relations). The crucial advantage of the infrastructure lies in its potential for cross-disciplinary enrichment: textual and material metadata can inform the linguistic analysis, e.g., as language-external factors in modelling the determinants of linguistic variation such as the formation of institutional networks or communities of practice as suggested above. Conversely, systematic patterns of linguistic variation detected across witnesses may prompt the re‑evaluation and refinement of uncertain datings, localisations, or assumptions about scribal practice encoded in the textual or material metadata. In this way, disciplinary annotations do not merely coexist but actively inform and refine one another through their integration in the shared Linked Data model. Such a mutually constraining and mutually enriching interaction between linguistic analysis, material evidence, and historical interpretation would be impractical to realise using isolated annotation platforms or flat data representations.

4.2 Refining computational dialectology and modelling linguistic similarity

The richness and granularity of the combined annotations allow for the development of new methods in computational dialectology, including refined metrics for mapping the similarity of linguistic varieties and modelling the factors that shape patterns of variation. Because many linguistic varieties in medieval manuscript traditions are shaped by external factors—such as institutional networks, scribal communities, or the geographical proximity of scriptoria—the identification of larger-scale patterns can feed back into historical and cultural analysis. The Linked Data representation makes it possible to conduct such analyses across all witnesses, rather than on isolated case studies or arbitrarily selected subsets. An FWO postdoctoral project in the field of computational dialectology proposing to work with the prototype corpus has recently been granted and will commence working on this from October 2026 (grant number 1262627N).

4.3 Simulation-assisted modelling of scribal and textual variation

A second application concerns the simulation and quantitative modelling of scribal behaviour and textual variation. Traditionally, manuscript transmission has been examined qualitatively, within localised chronological and geographical contexts, and based on small samples. This has obscured broader structural tendencies and the interplay of factors such as political boundaries, linguistic contact zones, and institutional networks. By training machine learning models on the linked annotations, it becomes possible to predict patterns of variation as a function of the textual, linguistic, and material characteristics encoded in the graph. This in turn enables the generation of “virtual” manuscript versions, which can be used to calibrate algorithms for stemmatology or other reconstruction tasks. Similarly, the multivariate modelling of variation can support the contextual classification of manuscripts with unknown or uncertain origins (e.g., dating, location, or scriptorium identification). A collaborative project between Sébastien de Valeriola, Steven Vanderputten, and Anne Breitbarth, using the prototype corpus and in particular its metadata, is currently being prepared.

4.4 Robustness meta-analyses of annotation practices

A third avenue concerns robustness analysis across annotation layers. The field of digital humanities lacks systematic studies evaluating the impact of annotation quality, uncertainty, and interpretative choices on downstream results (Piotrowski & Xanthos, 2020). The Linked Data infrastructure enables meta-analyses that assess how different annotation practices—or uncertainties in dating, localisation, or codicological characteristics—affect outcomes such as automated author identification, topic modelling, or stemmatic reconstruction. Applying the same quantitative methods across all attested versions of a text provides a robust means of evaluating the stability or sensitivity of the results (Herrera Malatesta & de Valeriola, 2024). Such analyses strengthen theoretical models and improve the scientific rigour and reproducibility of computational approaches. This research line integrates into Sébastien De Valeriola’s lab.

In addition to enabling these research applications, the infrastructure feeds back into the broader Semantic Web ecosystem. By developing reusable ontological patterns for historical textual data, the project contributes to community standards that can be adopted across corpora and disciplines. As new modules and annotation layers are added, the model can be extended to support a wide range of research questions, encouraging cumulative and interdisciplinary knowledge production. The framework thus provides not only a solution for the multidisciplinary study of medieval textual traditions but also a general blueprint for the construction of semantically interoperable corpora in the humanities.

Notes

[1] E.g., Biblia Medieval: https://corpus.bibliamedieval.es/; Homer Multitext Project: https://www.homermultitext.org/; Parzival-Projekt: https://parzival.unibe.ch/parzdb/index.php.

[2] See the Wiedererzählen im Norden (WiN) project for an exception, which provides a parallel corpus of Early New High German and Middle Low German translations of early modern printed narrative texts with annotations for linguistic, literary, and translation studies (https://www.uni-goettingen.de/de/607371.html), but without images of the prints.

[3] https://madoc.digirati.com.

[4] https://www.transkribus.org and https://gitlab.com/scripta/escriptorium.

[5] https://labelstud.io.

[6] https://www.tycho.iel.unicamp.br/search/.

[7] https://universaldependencies.org/format.html.

[8] https://portail.biblissima.fr/en.

[9] https://digital-scriptorium.org.

[10] https://www.cidoc-crm.org.

[11] https://seco.cs.aalto.fi/tools/sampo-ui/.

Acknowledgements

We thank François De Vriendt for facilitating access to the archives of the Bollandist Society, and Barbara McGillivray for encouraging us to write this paper.

Author Contributions

Anne Breitbarth: conceptualisation, writing – original draft, review & editing, data transcription & annotation, figures, code.

Julie M. Birkholz & Joren Six: technical conceptualisation, writing – review & editing.

Steven Vanderputten: project leader, conceptualisation, writing – original draft, review & editing, metadata logging & analysis.