A Multi-Dimensional Evaluation Framework for Assessing LLM Performance in TEI Encoding

Sabrina Strutz

doi:10.5334/johd.484

Full Article

1 Context and motivation

1.1 The Promise and Challenge of LLMs in Digital Scholarly Editing

The emergence of Large Language Models (LLMs) has sparked polarised reactions within the Digital Humanities (DH) community. Sceptics dismiss these technologies as unreliable black boxes, unsuitable for rigorous scholarly work, while enthusiasts herald them as transformative tools that could reshape humanities research practices. Between these extremes lies a crucial need for systematic, empirical evaluation that moves beyond subjective assessment and anecdotal evidence to understand the actual capabilities and limitations of LLMs for scholarly tasks.

Text annotation using the community standard of the Text Encoding Initiative (TEI) represents a particularly compelling test case for assessing LLM capabilities in DH workflows. The labour-intensive nature of TEI annotation has long been a critical bottleneck for digital edition projects, making reliable automation assessment essential for responsible resource allocation. Text encoding demands not merely rule-based pattern matching, but genuine semantic understanding – for instance, recognising the communicative function of a letter’s closing formula, or disambiguating historical person names across variant spellings. Recent advances in contextual reasoning and semantic interpretation across multilingual content suggest that these models may finally possess contextual awareness to support annotation tasks requiring semantic understanding (Brown et al., 2020; Scholger et al., 2024; Wei et al., 2022), including historical documents with additional orthographical, terminological, and code-switching challenges (McGillivray et al., 2020).

Recent Natural Language Processing (NLP) research demonstrating LLMs’ potential for annotation tasks (Ding et al., 2023; Gilardi et al., 2023) suggests these capabilities might also transfer to structured scholarly encoding required for digital editions. Additionally, initial experiments within the DH community in the past two years have demonstrated the feasibility of LLM-assisted automated XML markup (DeRose, 2024), knowledge extraction from TEI editions (De Cristofaro & Zilio, 2025; Santini, 2024), and TEI encoding for specific projects and document types (Pollin et al., 2023; Scholger et al., 2024; Strutz, 2025; Pollin et al., 2024; Pollin et al., 2025). These preliminary investigations confirm that LLMs can generate structurally valid TEI markup and identify semantic features such as named entities and document structures.

However, LLMs also exhibit behaviours that conflict with scholarly editing principles. Content alteration, including the normalisation of historical spelling through adaptation to modern orthography and the insertion of plausible, yet fictitious text, compromises source fidelity. Systematic biases towards modern language conventions can erase linguistic features that might be the actual research focus. The inconsistent application of encoding decisions undermines the methodological consistency required by scholarly editions.

Current experiments evaluate effectiveness through manually crafted success rate evaluations (De Cristofaro & Zilio, 2025) or decide to focus on isolated subtasks like Named Entity Recognition (NER) or categorisation tasks where single quantitative metrics like F1 scores suffice (Pagel et al., 2024; Rastinger, 2024). The Humanities Data Benchmark¹ covers only image-processing tasks – metadata extraction and Optical Character Recognition (OCR)/Handwritten Text Recognition (HTR) – while omitting TEI encoding or similar text-to-structure evaluations. Without systematic assessment, basic questions remain unanswered: Which aspects of TEI encoding can LLMs reliably perform? Where do they consistently fail? Addressing these questions first requires defining what constitutes TEI encoding quality and how to assess it at scale while maintaining sensitivity to the interpretive flexibility inherent in scholarly editing.

1.2 The Evaluation Gap

The challenge lies not merely in generating TEI encodings with LLMs, but in systematically assessing their quality – a task complicated by both the technical characteristics of TEI and the tension between quantitative and qualitative traditions. TEI encoding presents unique evaluation challenges that distinguish it from typical natural language processing tasks. First, TEI’s hierarchical XML structure creates cascading dependencies, where a single syntactic error can invalidate an entire document, yet well-formed XML provides no guarantee of semantic appropriateness. Second, TEI deliberately accommodates interpretive multiplicity: the Guidelines explicitly permit multiple valid encoding approaches for identical textual phenomena, reflecting diverse scholarly perspectives and project-specific methodologies (Cummings, 2019). A person’s name might legitimately appear as <persName>, <name type=“person”>, or even <rs type=“person”> depending on editorial approach – variations that computational metrics might penalise as errors despite representing acceptable scholarly choices. Third, historical documents introduce linguistic complexities including non-standard orthography, period-specific vocabulary, and multilingual code-switching (McGillivray et al., 2020) that challenge both automated processing and evaluation.

These complexities reflect a methodological divide between computational and humanities evaluation cultures. Standard NLP evaluation approaches, as surveyed by Chang et al. (2023) and Guo et al. (2023), emphasise quantitative metrics (precision, recall, F1 scores²) applied against gold standard annotations treated as singular ground truth.³ Such approaches excel at scalability and reproducibility, but often assume flat annotation structures and focus on one aspect only. Conversely, digital scholarly edition evaluation, grounded in the traditions articulated by Henny (2018) and applied through review frameworks like Review of Infrastructure for Digital Editions (RIDE) (Sahle, 2014), privileges qualitative expert judgement, methodological transparency, and contextual appropriateness. Although sensitive to interpretive nuance and editorial variation, this approach lacks scalability. As Dobson (2020) argues, computational methods in humanities research must prioritise interpretability over technical performance metrics alone.

However, this evaluation gap has practical consequences: without reliable assessment frameworks, DH projects cannot determine whether LLM assistance offers productivity gains or introduces unacceptable quality risks. This paper addresses this issue by developing a stratified framework for assessing LLM-generated TEI encoding. The framework uses systematic, quantitative metrics informed by humanities scholarship principles. First, we establish a taxonomy that decomposes TEI encoding into subtasks ranging from syntactic XML generation through semantic recognition to interpretive processing. This stratification enables targeted evaluation strategies matched to task characteristics, identifying which aspects can be assessed through automated validation and which require expert human oversight. Second, we develop and implement evaluation metrics that translate TEI encoding quality criteria into measurable indicators looking at multiple complementary aspects. These metrics leverage computational approaches, while embedding humanities values through relaxed matching algorithms⁴ that accommodate legitimate interpretive variation and error weighting that reflects scholarly priorities. The framework also explores whether these metrics can be synthesised into composite quality scores. Third, we demonstrate the framework’s practical applicability through case study validation on a challenging historical corpus: the Joseph von Hammer-Purgstall correspondence, comprising multilingual 18th–19th century letter transcriptions. By evaluating four diverse models (GPT-5, Claude Sonnet 4.5, Qwen3, OLMo2) across 100 representative documents, we provide systematic cross-model comparison, identify reliable automation targets and persistent failure patterns, and establish metrics that might be used as quantitative benchmarks for TEI encoding.

2 Method

2.1 Task Taxonomy

DH literature often categorises annotation through the lens of the hermeneutic process – Franken et al. (2020) distinguish between ‘process-oriented’ (exploratory, theory-generating) and ‘product-oriented’ (consistent, category-applying) annotation. General scholarly editing models (Pollin et al., 2025) describe pipelines evolving from raw transcription to annotated text, but identify only NER and entity linking as concrete annotation tasks. However, evaluating LLM performance requires a different perspective: Rather than treating TEI encoding as monolithic, we decompose the encoding process and product into ‘measurable’ sub-tasks that LLMs perform to transform plain text transcriptions into scholarly TEI XML. We identify seven task domains aligned with evaluation dimensions (see Table 1). Although demonstrated for correspondence encoding, these categories generalise to other textual genres.

Table 1

Task Taxonomy for TEI Encoding.

DIMENSION	TASK CATEGORY	ENCODING TASKS
0	Format Conversion	Transforming plain text into valid XML
1	Source Preservation	Preserving evidence of the source’s textual characteristics
2	Schema Application	Selecting and applying appropriate TEI elements and attributes according to TEI P5 Guidelines and project-specific constraints
3	Structural Markup	Constructing document scaffolding: segmenting texts into structural units (e.g., <div>, <opener>, <closer>, paragraph boundaries), and ensuring correct hierarchy and ordering
4	Semantic Markup	Annotating meaning-bearing spans and editorial phenomena, including named entities, temporal expressions, discourse markers, etc.
5	Contextual Enrichment	Linking entities to authority records, resolving references, and normalisations
6	Metadata Management	Extracting and normalising descriptive or administrative metadata from sources, and enriching records with external information
7	Collection Management	Maintaining consistent encoding depth and conventions across documents, monitoring quality drift, and checking interoperability standards

The task taxonomy is not meant to be a strict classification, but guides understanding of encoding complexity, and addresses both required LLM capabilities and systematic failure modes. Consider encoding a salutation in a letter ‘I remain, Sir, yr humble servant’: an LLM might successfully identify the salutation semantically (demonstrating Dimension 3 and 4 capability) yet fail to generate valid XML syntax with an unclosed tag like <salute>I remain, Sir, yr humble servant (Dimension 0 failure). Alternatively, it might produce well-formed XML but expand the abbreviation ‘yr’ to ‘your’ (Dimension 1 error), or select an inappropriate element like <salutation> instead of the required TEI valid <salute> (Dimension 2 failure). Even with correct syntax and schema compliance, the model might incorrectly identify salutation boundaries, encoding only ‘your humble servant’ instead of the full ‘I remain, Sir, your humble servant’ (Dimension 4 failure, but debatable depending on the project-specific guidelines). This multi-dimensional approach reflects the inherent complexity of TEI encoding, where technical correctness, textual and structural fidelity, semantic accuracy, interpretive appropriateness, and systematic consistency must all be assessed to determine encoding quality.

2.2 Evaluation Framework Design

Having established what LLMs must encode (Section 2.1), we now address how to systematically evaluate automated encoding performance. Rather than applying uniform metrics across all dimensions, the framework matches evaluation approaches to task characteristics: Dimensions 0–2 employ fully automated, reference-free validation. Dimensions 3–4 use computational metrics with ground truth comparison. Dimensions 5–6 require expert-centred review, while Dimension 7 relies on automated statistical analysis. This paper discusses only the first five dimensions, as the current use case processes input text without adding external knowledge. Each dimension description addresses: scope (what aspects are evaluated), methods (how evaluation is conducted), justification (why we use automated or reference-based assessment), metrics (how performance is quantified), and limitations (what are the inherent boundaries of the approach).

2.2.1 Dimension 0: Syntactic Validation

Dimension 0 establishes the technical foundation for all following evaluation dimensions by assessing XML well-formedness – proper tag structure, nesting, and character encoding – as errors block subsequent processing. The evaluation employs an XML parser for validation, supplemented by comprehensive error categorisation (tag structure, character encoding, attribute syntax, document structure) when documents are malformed. Additionally, normalised string comparison accounts for legitimate XML transformations (character entity encoding, Unicode normalisation, whitespace handling), while detecting substantive alterations. Syntactic processing enables fully automated assessment, requiring no domain expertise. The dimension reports a binary pass/fail status, with error categorisation distinguishing files correctable via formatting tools from those requiring manual intervention. However, well-formed XML guarantees neither project rule adherence, nor structural validity or semantic appropriateness, and cascading parse errors may require iterative debugging.

2.2.2 Dimension 1: Source Fidelity Evaluation

Dimension 1 evaluates a fundamental DH requirement: fidelity to the source text, detecting alterations to input, e.g., manuscript transcriptions. This assessment addresses documented LLM tendencies towards hypernormalisation and hallucination. The evaluation extracts plain text content from the original transcription and the LLM-generated output, normalises XML entities and Unicode (NFC), and removes whitespaces, to then compute similarity ratios between these processed strings. This is complemented by a character-level diff analysis, categorising alterations by operation type (substitutions, insertions, deletions). Preserving input text is critical for scholarly editions, where historical orthography constitutes research data that LLMs often misinterpret as errors requiring correction. Beyond binary pass/fail status, we report text similarity scores, alteration type analysis, and detailed content differences. However, even high similarity scores (98–99%) may mask critical alterations, as the evaluation cannot distinguish substantive scholarly concerns from minor punctuation variations that might represent acceptable quality trade-offs given potential automation time savings.

2.2.3 Dimension 2: Schema Compliance and Standard Usage Evaluation

Dimension 2 validates conformity to TEI standards and project-specific customisations. This includes TEI P5 schema compliance (valid elements, attributes, nesting) and/or project-specific constraint validation where applicable. Together, these ensure community and cross-project interoperability as well as project-internal consistency. Schema validation employs, for instance, Jing RelaxNG validator⁵ for rule-based structural assessment, following a decision logic that accommodates diverse project approaches: standard TEI projects undergo TEI P5 validation only; extended TEI receives both TEI P5 and project validation; projects with customised schemas deviating from TEI P5 receive project validation only. This automated assessment reveals whether LLMs possess TEI knowledge and are capable of applying rules from prompts. The evaluation reports binary compliance status (valid/invalid). Detailed violation breakdowns are currently unavailable as validators either suppress subsequent errors (Jing validator) or have issues with the error reporting quality.⁶ However, schema compliance validates only general structural requirements. The actual document structure of the input text can only be assessed against a ground truth. Also, schema validation alone cannot ensure semantic appropriateness – that is, whether the chosen TEI elements accurately reflect the meaning and structure observed in the annotated text. This makes schema validation a necessary but insufficient condition for assessing encoding quality.

2.2.4 Dimension 3: Structural Fidelity Evaluation

While schema validation (Dimension 2) ensures valid nesting structures, it cannot tell whether the LLM has identified the correct quantity or positioning of elements. This represents a first level of semantic correctness, as structural encoding reflects interpretation of the communicative functions. Dimension 3 evaluation extracts hierarchical element sequences from LLM output and reference files, and applies a longest common subsequence (LCS) algorithm to compute structural similarity. This flags omissions, over-identification, and structural normalisation tendencies, where models impose conventional patterns onto non-standard arrangements (e.g., relocating <dateline> from <opener> to <closer>). Such changes also raise Dimension 1 issues, but remain undetected in Dimension 2 as they are still schema-compliant. We rejected XMLDiff and Tree Edit Distance as their operation-based scoring punishes element boundary shifts twice (deletion and reinsertion), even for defensible editorial choices like variations in salutation boundaries. Instead, our LCS metric is order-preserving, length-normalised but boundary-tolerant: elements retaining the same relative order receive credit even if intermediate elements differ. This keeps the focus on hierarchical similarity, leaving content divergences to Dimension 4. The evaluation also provides a per-element-type report identifying systematically missing or over-identified components, and XMLDiff operation categorisation to identify systematic LLM behaviours. However, this dimension requires reference encoding for comparison, limiting scalability, and cannot distinguish legitimate editorial variations or over-identification from encoding errors without project-specific context.

2.2.5 Dimension 4: Semantic Content Accuracy Evaluation

Dimension 4 evaluates the element-content relationship. It assesses whether LLMs associate appropriate text content with element types through substring analysis and reference comparison, examining the element content rather than its position. The evaluation automatically identifies all text-bearing elements in both LLM output and manually crafted reference files. The comparison workflow extracts elements by type using XPath queries, then applies a best-match pairing: for each LLM element, the system finds the best content match based on substring relationships, avoiding duplicate pairings. The element-level content matching identifies four match types: exact match (identical content after normalisation), over-inclusion (LLM content contains reference plus additional text), under-inclusion (LLM content captures substring of reference), and no match (no substring relationship exists). Since element boundaries are often flexible, the evaluation flags all substring relationships as practical matches and reports inclusion/coverage ratios. As a pragmatic approach to distinguish between boundary errors and hallucinations, cross-dimensional context is incorporated. When content is preserved in Dimension 1, partial matches likely indicate boundary discrepancies; otherwise, they may signal hallucinations. An adjusted semantic score then applies weights for these practical matches appropriately (0.8 for preserved content, 0.6 for altered content). Metrics include exact match rate and practical match rate for elements exhibiting substring relationships, calculated only for elements present in both files, since missing elements are captured in Dimension 3. The report also shows inclusion ratios and coverage ratios, quantifying partial match quality, and per-element-type analysis identifying systematically challenging elements. A drawback of this dimension is that again, reference encodings are required. Also, substring analysis cannot detect semantic inappropriateness, where content is technically present, but contextually wrong, requiring human-in-the-loop review to reliably assess whether contextual understanding translates to correct component identification and boundary recognition.

2.2.6 Proposed Composite Scoring

While the established evaluation framework identifies key observable dimensions, this section introduces a tiered scoring system that combines individual dimension assessments into a composite quality metric (Figure 1). This accounts for blocking failures, and applies penalties or adapted weighting considering post-processing difficulty.

Evaluation Framework and Tiered scoring with D0 as validity gate, D1-D4 contributing Final Score with Core Quality (equal 33.33% weighting D1, D3, D4), and D2 applying a multiplicative adjustment factor. D1 also provides context-aware weighting for D4.

Tier 0: Validity Gate. Files failing XML well-formedness validation (D0) receive the status INVALID with no numeric score, as malformed XML cannot be processed by standard parsers. This blocking condition reflects technical necessity, rather than quality judgement.

Tier 1: Core Quality Score. For well-formed files, three dimensions contribute equally to core quality assessment (0–100 scale):

Source Fidelity (D1): 33.33% – Measures fidelity to source transcription
Structural Accuracy (D3): 33.33% – LCS-based hierarchical similarity
Semantic Accuracy (D4): 33.33% – Pragmatic content matching using Macro F1⁷

The D4 component employs context-aware weighting: when D1 confirms content preservation, practical matches (partial overlaps) receive weight 0.8, interpreted as boundary detection issues; when source fidelity fails, practical matches receive weight 0.6, reflecting reduced confidence against potential hallucinations. This cross-dimensional interpretation distinguishes boundary errors from potential hallucinations.

The core quality score is calculated as:

Core Quality = \frac{1}{3} ({D1}_{content} + {D3}_{LCS} + {D4}_{adjusted})

The equal weighting reflects a principled starting point: without empirical evidence favouring specific weightings for TEI encoding, uniform distribution avoids unjustified bias toward any dimension, while providing a baseline that projects can adapt to specific editorial priorities. Future work should empirically validate whether alternative weightings better predict human quality judgements.

Tier 2: Schema Compliance Adjustment. Schema validation (D2) results apply a multiplicative adjustment reflecting post-processing difficulty. User-configured validation modes (TEI-only, project-specific, both, or none) determine the applicable schema. Compliance factors are: full compliance = 1.0 (baseline requirement), non-compliance = 0.75 (25% penalty). Unlike D0 failures, which render documents entirely unprocessable, schema-invalid files remain parseable and retain valuable structural and semantic information. The 25% penalty acknowledges that schema violations require manual intervention, while recognising that schema violations – like using non-TEI element names – already cascade into D3 and D4 scores: misnamed elements reduce structural similarity and semantic accuracy (elements matched by type find no reference counterparts). The 25% adjustment therefore signals schema non-compliance without over-penalising errors already captured in core dimensions.

The final overall score is:

Overall Score = min (100, Core Quality \times D2 Factor)

2.3 Implementation Strategy

The framework uses a modular Python architecture organised into four core layers: dimension-specific evaluation modules, data models with standardised result and error structures, reporting modules, and utility components (namespace handling, XML parsing via lxml, schema validation via Jing, etc.). Each dimension operates independently, with standardised interfaces, enabling selective evaluation based on project requirements. The architecture supports genre-agnostic assessment through centralised path configuration for input, reference, schema, and output directories. Bundled resources include TEI P5 schemas and the Jing validator. Output generation provides multi-format reporting with detailed error logs: JSON for programmatic processing, and Excel for human analysis.

The complete framework, including documentation and example datasets, is available as open-source software under an MIT licence at https://github.com/strubrina/tei-evaluation, supporting community adoption, refinement, and extension to diverse digital edition projects.

3 Empirical validation: The Hammer-Purgstall case study

To demonstrate the diagnostic capabilities for identifying encoding issues, we conduct a systematic evaluation of LLM-generated TEI encoding for the Hammer-Purgstall correspondence.

3.1 Dataset Description

Repository location

https://doi.org/10.5281/zenodo.17643901

Repository name

Zenodo

Object name

Hammer-Purgstall Correspondence TEI Evaluation Dataset

hpe-correspondence-metadata.json :
JSON correspondence metadata (language, sender, recipient, date, letter length, etc.) for the main sample of 100 letters
hpe-correspondence-transcriptions.zip :
Plain text files with letter transcriptions
hpe-correspondence-tei-reference.zip :
TEI XML files with manually crafted TEI annotations
hpe-correspondence-llm-encodings.zip :
TEI XML files with LLM-generated encodings from four models
hpe-correspondence-evaluation.zip :
Evaluation results in JSON format with aggregate Excel reports
hpe-prompt-templates.zip :
Five prompt scenarios for LLM processing including encoding instructions & few-shot samples

Format names and versions

Plain text (UTF-8), TEI XML (P5), JSON

Creation dates

2025-10-01 to 2025-11-17

Dataset creators

Sabrina Strutz (dataset compilation, reference TEI encoding), Alexandra Wagner (reference TEI encoding).

Language

German (primary), English, French, Italian.

License

CC-BY 4.0

Publication date

2025-11-18

Description

The evaluation dataset derives from the correspondence of Joseph von Hammer-Purgstall (1774–1856), an Austrian orientalist, historian, and diplomat whose letter exchange comprises over 8,500 handwritten documents spanning six decades (1790–1850). The correspondence exhibits significant linguistic diversity, containing letters primarily in German alongside English, French, and Italian, with instances of code-switching (including Latin, Greek, and Arabic text segments) reflecting the multilingual scholarly networks of early 19th-century Europe. More than 3,400 letters have been transcribed and edited for print publication (Höflechner, 2021), providing a substantial corpus for digital edition development.

From this corpus, a representative sample of 100 letters was selected through systematic, stratified sampling addressing three prioritised criteria: (1) language distribution – ensuring proportional representation of the main letter languages; (2) writer diversity – including letters from multiple correspondents to capture varied writing conventions and structural patterns rather than idiosyncrasies of single individuals; (3) letter length variation – balancing shorter and longer documents (measured by an estimated token⁸ count of letter body content). The 8,692 catalogue entries were reduced to 2,984 processable letters through exclusion of untranscribed documents, non-letter materials, and letters with graphic elements requiring separate OCR/HTR processing beyond current scope. Statistical analysis validated sample representativeness across language distribution, writer diversity, and text length. XPath queries on the reference encodings supported the analysis of certain structural features (dateline positioning, different combinations of inline or offset salute elements, presence of address and/or postscript elements, etc.). This confirmed that both the 10-letter and the 100-letter sample capture similar structural diversity.

The dataset comprises four components supporting the evaluation: (1) Input component – 100 plain text letter transcriptions (UTF-8) that were cleaned from any editorial interventions and footnotes; (2) Reference component – 100 manually encoded TEI XML (P5) reference annotations, following TEI Guidelines for correspondence as well as a customised project schema, serving as evaluation baseline; (3) Output component – LLM-generated TEI XML from four models totalling 400 encodings; (4) Evaluation component – assessment results across all framework dimensions in JSON format with aggregate Excel reports and comparative visualisations. This multi-component structure supports reproducible evaluation, cross-model comparison, and framework validation, while enabling community reuse for benchmark development. The released dataset and evaluation code support testing different prompting strategies – modifying encoding rules and/or few-shot⁹ examples – to investigate the sensitivity of LLMs to these variations. Also, further LLMs not included in the current evaluation could be assessed. A more labour-intensive reuse scenario would involve applying the evaluation framework to other text genres with distinct microstructures, defined in the TEI Guidelines (e.g., dictionary entries). However, this requires defining new encoding rules and examples, as well as manually encoding reference annotations for the target genre.

3.2 Experimental Design

The empirical validation employed a two-stage approach: an initial 10-letter sample for prompt refinement and framework validation, followed by comprehensive evaluation across the full 100-letter dataset. This incremental methodology reduced computational costs associated with large-scale LLM processing, while validating the methodology before full dataset evaluation.

To identify optimal prompt-model pairings, the preliminary sample was tested with five prompt configurations with varying encoding rule complexity (simple vs. detailed) and few-shot example set sizes (two to five input-output pairs). Since letter structure encoding is relatively straightforward, we deliberately complicated the task by instructing LLMs to allow salute elements within paragraphs. This addresses a known TEI limitation: <salute> and  are both block-level elements that currently cannot be nested (Forney et al., 2020), and existing workarounds are inadequate for inline salutations embedded mid-sentence or within paragraphs. In addition, we asked the models not to use <ab> as a wrapper for <address>.

Model selection balanced state-of-the-art performance, openness, and academic reproducibility, as well as benchmark proxies. We examined leaderboards like LLM Arena¹⁰ for coding and overall capability, or StructEval-T¹¹ for structured output benchmarks (Yang et al., 2025). The latter one only served as a selection proxy; actual generation used standard text completion without structured output enforcement, allowing D0 results to reflect inherent XML generation capability. For the openness aspect, we consulted the European Open Source AI Index,¹² which catalogues the most open models. Based on these criteria, we selected two proprietary frontier models – not the flagship offerings (GPT-5, Claude Opus), but capable, cost-manageable alternatives suitable for research budgets – and two open-weight models. We take into account that open-source models are known to lag behind their proprietary counterparts (Somala & Emberson, 2025) and therefore do not expect them to outperform frontier models. Their inclusion addresses a practical question for DH projects: how suitable are open alternatives when budget constraints, data privacy requirements, or institutional policies preclude proprietary APIs? The selection includes: GPT-5-mini, a less expensive, medium frontier model of OpenAI’s GPT series, which are currently the best proprietary models for format conversion tasks; Claude Sonnet 4.5, a strong yet cost-manageable alternative to Claude Opus, and a leading model for coding and overall capability; the quantised¹³ Qwen3-14B-Q6 model combining novel architecture, scale, and competitive leaderboard performance among open-source systems; and another quantised open model, namely OLMo2-32B-instruct-Q4, with the highest openness index in the European Open Source AI Index. Both local models were run on a GPU with 24GB VRAM. Detailed information on model versions and configurations is available in the processing metadata within the evaluation results deposited in the Zenodo repository.¹⁴

4 Results and discussion

4.1 Prompt Refinement Evaluation & Empirical Application

The performance of the model-prompt pairings – visualised across five runs in Figures 2, 3, 4, 5 – reveals clear capability stratification. Each visualisation provides a three-tiered assessment: (1) a validation summary (top left), tracking the number of letters that are well-formed (D0), schema-valid (D2), completely source-faithful (D1), and structurally identical to the gold standard XML trees (D3); (2) a performance scoreboard (top right) with mean percentages for source fidelity (D1), structural similarity (D3), and semantic content accuracy (D4) – all contributing to the final score; and (3) a score distribution matrix (bottom), illustrating file frequency across performance deciles, which indicates how closely LLM-generated encodings overlap with the ground truth.

As the overall score distribution of Figure 2 demonstrates, OLMo’s results highlight critical prompt limitations, as the model failed to generate evaluable letters under five-shot prompting conditions. However, it also exhibited high failure rates for other prompt scenarios. For D0, it successfully processed nine letters under two-shot conditions, but yielded no valid XML with five-shot prompts, as these exceeded the context window.¹⁵ However, in D2, OLMo generally benefited from input-output pairs, as evidenced by its failure to produce a single schema-valid TEI when provided only with encoding rules. Moreover, it generated no outputs with perfectly maintained source text (D1), or expected XML structure (D3), even under its optimal two-shot configuration.

OLMo2 processings with five different prompt configurations (10-letter sample).

For Qwen (Figure 3), which consistently produced well-formed XML across all prompt conditions, few-shot examples proved more influential than detailed encoding rules for ensuring schema adherence – achieving eight to ten valid letters versus only three to five in zero-shot configurations. Qwen demonstrated strong source fidelity, achieving minimum seven perfect matches in most runs. Interestingly, the run with the highest source fidelity achieved the lowest overall performance. Furthermore, XML trees exhibited higher overlap with the gold standard when few-shot examples were provided.

Qwen3 processings (no thinking) with five different prompt configurations (10-letter sample).

GPT (Figure 4) demonstrated robust syntactic processing, achieving 100% well-formedness across all prompt scenarios. It produced fewer project-valid TEIs, with less detailed encoding rules, and exhibited reduced structural similarity only when no input-output examples were provided. Like Qwen, GPT achieved high D1 scores, contributing to high overall performance.

GPT-5-mini processings with five different prompt configurations (10-letter sample).

Claude (Figure 5) consistently achieved 100% well-formedness and high schema adherence in few-shot scenarios. Uniquely, D1 revealed a highly systematic alteration pattern: while producing six letters with 100% matches, analysis revealed that the model consistently converted right single quotation marks to straight apostrophes in the remaining samples. Regarding D3, Claude showed significant gains when provided with examples, matching GPT and Qwen’s pattern.

Claude Sonnet 4.5 processings with five different prompt configurations (10-letter sample).

Ultimately, the composite overall score ranked Claude’s five-shot prompt with detailed rules highest (94.0%) due to superior structural fidelity (D3), approximately 7% higher than GPT’s best result in this dimension. Despite using fewer parameters and a quantised local model version, Qwen’s results proved competitive with proprietary models, while OLMo lagged significantly behind. The comparison also reveals optimal prompt configurations varied by model: GPT and Claude achieved peak performance when prompted with detailed encoding rules and five input-output pairs, whereas OLMo’s context window limitations permitted only two examples. Interestingly, Qwen yielded higher performance with shorter encoding rules. These findings influenced model and configuration selection for the larger 100-letter sample.

In terms of framework validation, evaluation revealed interdependencies between dimensions that require methodological consideration:

D0 → D2 cascading: Malformed files cannot be schema-validated, so D0 failures obscure potential D2 issues. Files that might satisfy TEI requirements are flagged as validation failures due to parsing errors.

D1 ↔ D4 asymmetries: Content alterations affect multiple dimensions. For example, apostrophe substitution lowers both D1 (string mismatch) and D4 (element matching yields only practical matches due to character differences).

D1-D3-D4 diagnostic overlap: When textual content is entirely omitted, both dimensions flag the issue. However, this overlap proves diagnostically valuable: D4 distinguishes content loss (no match) from boundary errors (practical match).

D3–D4 penalty imbalances: 1) Both dimensions capture missing elements, creating potential double-counting. However, while limiting D4 to elements present in both files would avoid overlap, this would also yield misleadingly high semantic accuracy. 2) Nested element shifts (e.g., <salute> moved from  to <closer>) trigger only D3 penalties despite requiring case-to-case correction. Conversely, boundary variations (e.g., a <salute> encompassing an entire concluding paragraph vs. only the salutation itself) incur both D3 and D4 penalties despite potentially representing defensible editorial choices.

These asymmetries raise open questions: Should we accept these discrepancies, exclude one dimension from scoring, or adjust weights?

4.2 Cross-Model Comparison

The 100-letter dataset (Figure 6) confirms the 10-letter-sample trends, with slight ranking changes: GPT achieved the highest aggregate score (93.6%), followed by Claude (92.4%), Qwen (87.5%), and OLMo (54.9%). While proprietary models generated exact reference matches for ~30% of documents and 91–99% matches for ~40%, these quantitative successes require qualification, as even scores exceeding 90% can demand substantial post-processing for reference quality.

Cross-model comparison with 100-letter dataset.

The proposed multi-dimensional analysis facilitates diagnostic comparisons across five key areas, revealing distinct capability profiles beyond aggregate scores:

Proprietary models and Qwen consistently generated well-formed XML (D0), although Claude and Qwen shared a specific syntactic error: failure to escape ampersands. OLMo’s 25% XML malformation rate correlated with longer letters (>1500 tokens), indicating context window limitations.

D1 showed nuanced differences in the correctability of content alteration and hypernormalisation tendencies. GPT and Qwen achieved high source fidelity rates, yet their sporadic character-level substitutions and content insertions require case-by-case verification, resisting automated post-processing. Conversely, Claude’s lower rate belied error quality: its systematic apostrophe replacement (curly to straight quotes) and unescaped ampersands are correctable through simple find-replace operations or XML escaping scripts.

D2 confirmed models’ capacity to apply encoding rules, with performance significantly enhanced by few-shot examples. While Qwen experienced more project validity issues than its proprietary counterparts, OLMo’s five-shot configurations were not evaluable due to context window limitations that produced cascading malformed XML. Furthermore, OLMo’s use of non-TEI elements, like   and <postscriptum>, indicates lacking TEI knowledge and/or rule application capability.

D3 issues across GPT and Claude primarily involved added <salute> or <address> elements, and modified  counts. Claude maintained superior structural fidelity, despite lower D1 and D4 scores from apostrophe replacement. OLMo overfitted¹⁶ to input-output examples, frequently adding <address> elements reflecting 50% prevalence in two-shot prompting.

Analysis revealed that, besides D3, especially D4 exposed systematic difficulties that quantitative metrics alone cannot fully characterise. Operating at interpretive boundaries, these dimensions require targeted human investigation to distinguish legitimate editorial choices from systematic failure.

4.3 Limitations

The framework established a foundation for error pattern analysis through detailed dimensional reports, which help distinguish capability limitations from acceptable scholarly alternatives. However, statistical correlation analysis remains necessary.

The assessment capabilities encounter several methodological boundaries. D4 practical match evaluation treats all element types uniformly, yet acceptability of over/under-inclusion varies contextually – salutations permit interpretive boundary variation, while  elements demand stricter matching. Current implementation cannot distinguish these nuances. Additionally, D4’s adjusted semantic score applies document-level content preservation (D1) to weight element-level matching. However, content alterations may occur in different elements than those evaluated, potentially misattributing reliability penalties. Element-level content preservation checking would provide more precise weighting, but increases computational complexity.

The present evaluation relies on single-encoder reference standards, yet TEI encoding inherently permits interpretive diversity. Inter-annotator agreement baselines could establish expected human variation ranges to contextualise LLM performance: 85% structural accuracy might match or significantly underperform human encoders working independently. Human encoders themselves may produce schema-non-compliant markup without real-time validation, suggesting LLM error patterns may partially reflect general task difficulty, rather than capability deficits.

This study still lacks output determinism and reliability assessment through repeated test runs. Substantial stochastic variation means apparent improvements might reflect coincidental generation rather than genuine optimisation, affecting confidence in comparative rankings and prompt configuration recommendations. Systematic testing across multiple runs would be necessary to distinguish stable capability patterns and reliability from random variation.

We acknowledge that larger open models might narrow the gap with proprietary systems. However, local hosting was prioritised to ensure reproducible conditions, stable batch processing, and observable processing times without API latency. Our infrastructure constrained model selection to those under 20GB, necessitating smaller parameter counts or quantisation.

Beyond methodological constraints, the dataset itself carries limitations affecting generalisability. The reference encodings reflect single-encoder decisions, and certain encoding choices – particularly salutation boundaries – remain subject to scholarly debate. Moreover, the sample represents only the Hammer-Purgstall correspondence. Structural features typical of this corpus (e.g., dateline positioning, address conventions, postscript frequency) and orthographical or terminological patterns may not align with other correspondence collections from different periods, regions, or social contexts, limiting direct transferability of performance benchmarks.

4.4 Conclusions and Future Work

The empirical evaluation reveals that current LLMs can generate well-formed XML and apply TEI schema rules with a high degree of accuracy when provided with few-shot examples. However, dimensions critical to humanistic scholarship – source fidelity, structural interpretation, and semantic accuracy – show more variable performance. Even high-scoring outputs frequently require post-processing and exhibit concerning source fidelity issues, which many humanists would consider the most important criterion. While LLMs demonstrate potential as encoding assistants, the results suggest they are not yet reliable as autonomous encoders for scholarly editions.

Future research should establish human encoder benchmarks to contextualise LLM performance and implement evaluation after systematic post-processing – such as fixing minor XML well-formedness issues or Unicode adaptations – to address cascading errors like those observed in Claude’s results. Iterative evaluation-revision loops could enable models to self-correct based on dimensional feedback. Moreover, future evaluations should also consider the environmental impact of LLM-assisted workflows, comparing computational costs against labour savings to inform sustainable adoption decisions. The framework could further be extended in several ways: testing additional genres and encoding complexities; analysing whether structural complexity (e.g., letters with postscripts, multiple datelines) or different languages affect performance; investigating document-specific versus model-specific failure patterns; and incorporating attribute evaluation currently excluded from assessment. Looking ahead, we hope that the presented work supports TEI encoders to make an informed decision about the integration of LLMs as encoding assistants and also provides methodological foundations for Digital Humanities benchmarking.

GenAI Declaration

The author used generative AI tools to identify and correct errors and typos, rephrase content, and condense sections of the manuscript. All content was originally authored by humans and verified for accuracy.

Notes

[1] https://rise-unibas.github.io/humanities_data_benchmark.

[2] Standard machine learning metrics: precision measures the accuracy of positive identifications, recall measures completeness of identification, and F1 is their harmonic mean.

[3] Manually created reference data used as the evaluation standard.

[4] In contrast to strict string matching, which only recognises identical character sequences, relaxed matching permits approximate matches through substring matching.

[5] https://relaxng.org/jclark/jing.html.

[6] https://lxml.de/validation.html.

[7] Macro F1 treats all classes equally regardless of their frequency, ensuring balanced evaluation across both common and rare categories.

[8] Tokens are the basic unit of text processing in LLMs and may correspond to whole words or subword fragments.

[9] n-shot prompting are prompting strategies where the model is either not provided with any examples (zero-shot), or a handful of examples (one-shot, two-shot, few-shot) of the task at hand.

[10] https://lmarena.ai/leaderboard.

[11] https://huggingface.co/spaces/Bowieee/StructEval_leaderboard.

[12] https://osai-index.eu/the-index.

[13] Quantisation is a model compression technique reducing weight precision to reduce size and computational cost.

[14] https://doi.org/10.5281/zenodo.17643901.

[15] The context window indicates the maximum tokens an LLM processes; exceeding it causes truncation or failure.

[16] Overfitting occurs when a model learns training-specific patterns rather than generalisable rules, resulting in poor performance on new, unseen data.

Acknowledgements

The author acknowledges the financial support of the University of Graz.

Competing Interests

The author has no competing interests to declare.

Author Contributions

Sabrina Strutz: Conceptualisation, Data Curation, Formal Analysis, Investigation, Methodology, Software, Visualisation, Writing – Original Draft.