Abstract
The integration of Large Language Models (LLMs) into digital scholarly editing workflows presents opportunities for automating labour-intensive encoding tasks using the Text Encoding Initiative (TEI). Yet the field lacks systematic evaluation frameworks for assessing LLM-generated encoding quality at scale. This paper addresses this gap by developing a comprehensive evaluation framework that bridges computational methods with humanities scholarship principles. While traditional Natural Language Processing (NLP) evaluation metrics are valuable for specific aspects, they appear insufficient for a holistic assessment of TEI encoding. This is because TEI encoding presents unique evaluation challenges, including hierarchical XML structures and interpretive flexibility, which allow for multiple valid encoding approaches. Additionally, LLM-generated encodings exhibit concerning behaviours, including content alteration, systematic biases towards modern language conventions and inconsistent application of encoding decisions. This research develops a stratified evaluation methodology assessing LLM-generated TEI documents across multiple dimensions: syntactic validity, source fidelity, schema compliance, structural fidelity, and semantic recognition. The framework aims to identify which aspects can be reliably assessed through automated validation for scalable evaluation and which dimensions require targeted human-in-the-loop review. Empirical validation employs the Joseph von Hammer-Purgstall correspondence – a multilingual 18th-19th century letter corpus – to test the framework across multiple state-of-the-art models (GPT-5-mini, Claude Sonnet 4.5, Qwen3-14B, OLMo2-32B). By providing reusable evaluation strategies, this framework enables cross-model comparison and iterative improvement in LLM-assisted encoding workflows, contributing foundational infrastructure for benchmarking practices in Digital Humanities.
