A Multi-Dimensional Evaluation Framework for Assessing LLM Performance in TEI Encoding

Sabrina Strutz

doi:10.5334/johd.484

Abstract

The integration of Large Language Models (LLMs) into digital scholarly editing workflows presents opportunities for automating labour-intensive encoding tasks using the Text Encoding Initiative (TEI). Yet the field lacks systematic evaluation frameworks for assessing LLM-generated encoding quality at scale. This paper addresses this gap by developing a comprehensive evaluation framework that bridges computational methods with humanities scholarship principles. While traditional Natural Language Processing (NLP) evaluation metrics are valuable for specific aspects, they appear insufficient for a holistic assessment of TEI encoding. This is because TEI encoding presents unique evaluation challenges, including hierarchical XML structures and interpretive flexibility, which allow for multiple valid encoding approaches. Additionally, LLM-generated encodings exhibit concerning behaviours, including content alteration, systematic biases towards modern language conventions and inconsistent application of encoding decisions. This research develops a stratified evaluation methodology assessing LLM-generated TEI documents across multiple dimensions: syntactic validity, source fidelity, schema compliance, structural fidelity, and semantic recognition. The framework aims to identify which aspects can be reliably assessed through automated validation for scalable evaluation and which dimensions require targeted human-in-the-loop review. Empirical validation employs the Joseph von Hammer-Purgstall correspondence – a multilingual 18th-19th century letter corpus – to test the framework across multiple state-of-the-art models (GPT-5-mini, Claude Sonnet 4.5, Qwen3-14B, OLMo2-32B). By providing reusable evaluation strategies, this framework enables cross-model comparison and iterative improvement in LLM-assisted encoding workflows, contributing foundational infrastructure for benchmarking practices in Digital Humanities.

References

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … Amodei, D. (2020). Language Models are Few-Shot Learners. In Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Retrieved from https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Search in Google Scholar Back to article
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., … Xie, X. (2023). A Survey on Evaluation of Large Language Models. Retrieved from http://arxiv.org/abs/2307.03109 (last accessed 10 August 2024).
Search in Google Scholar Back to article
Cummings, J. (2019). A world of difference: Myths and misconceptions about the TEI. Digital Scholarship in the Humanities, 34, i58–i79. 10.1093/llc/fqy071
Open DOI Search in Google Scholar Back to article
De Cristofaro, M., & Zilio, D. (2025). Automating XML-TEI Encoding of Unpublished Correspondence: A Comparative Analysis of two LLM Approaches. Quaderni di Umanistica Digitale. 10.6092/UNIBO/AMSACTA/8380
Open DOI Search in Google Scholar Back to article
DeRose, S. J. (2024). Can LLMs help with XML? Retrieved from https://www.balisage.net/Proceedings/vol29/print/DeRose01/BalisageVol29-DeRose01.html (last accessed 2 August 2025).
Search in Google Scholar Back to article
Ding, B., Qin, C., Liu, L., Chia, Y. K., Joty, S., Li, B., & Bing, L. (2023). Is GPT-3 a Good Data Annotator? In Proceedings of the 61st annual meeting of the association for computational linguistics (pp. 11173–11195). Retrieved from https://aclanthology.org/2023.acl-long.625
Search in Google Scholar Back to article
Dobson, J. (2020). Interpretable Outputs: Criteria for Machine Learning in the Humanities. Digital Humanities Quarterly, 15(2).
Search in Google Scholar Back to article
Forney, C., Haaf, S., & Kirsten, L. (2020). Letter Openers and Closers. Retrieved from https://encoding-correspondence.bbaw.de/v1/openers-closers.html#c-3-2 (last accessed 9 November 2025).
Search in Google Scholar Back to article
Franken, L., Koch, G., & Zinsmeister, H. (2020). Observations on Annotations. In J. Nantke & F. Schlupkothen (Eds.), Annotations in Scholarly Editions and Research (pp. 299–324). De Gruyter. 10.1515/9783110689112-014
Open DOI Search in Google Scholar Back to article
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30). 10.1073/pnas.2305016120
Open DOI Search in Google Scholar Back to article
Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, … Xiong, D. (2023). Evaluating Large Language Models: A Comprehensive Survey. Retrieved from http://arxiv.org/abs/2310.19736 (last accessed 03 March 2024).
Search in Google Scholar Back to article
Henny, U. (2018). Reviewing von digitalen Editionen im Kontext der Evaluation digitaler Forschungsergebnisse. Sonderband der Zeitschrift für digitale Geisteswissenschaften, 2. 10.17175/sb002_006
Open DOI Search in Google Scholar Back to article
Höflechner, W. (2021). Joseph von Hammer-Purgstall 1774–1856. Ein altösterreichisches Gelehrtenleben. Eine Annäherung. ADEVA.
Search in Google Scholar Back to article
McGillivray, B., Poibeau, T., & Fabo, P. R. (2020). Digital Humanities and Natural Language Processing: “Je t’aime… moi non plus”. Digital Humanities Quarterly, 14(2).
Search in Google Scholar Back to article
Pagel, A., Pichler, A., & Reiter, N. (2024). Über Prompt Brittleness, Prompt Generalisierbarkeit und Prompt Optimierung. Erste Erkenntnisse aus Fallstudien in den Computational Literary Studies. Retrieved from https://web.archive.org/web/20241121155350/https://agki-dh.github.io/pages/webinar/page-10.html (last accessed 23 September 2025).
Search in Google Scholar Back to article
Pollin, C., Czmiel, A., Dumont, S., Fischer, F., Sahle, P., Schaßan, T., … Henny-Krahmer, U. (2024). Generative KI, LLMs und GPT bei digitalen Editionen. In Dhd 2024 quo vadis dh. Passau, Deutschland. 10.5281/zenodo.10698210
Open DOI Search in Google Scholar Back to article
Pollin, C., Fischer, F., Sahle, P., Scholger, M., & Vogeler, G. (2025). When it was 2024 – Generative AI in the Field of Digital Scholarly Editions. Zeitschrift für digitale Geisteswissenschaften, 10. 10.17175/2025_008
Open DOI Search in Google Scholar Back to article
Pollin, C., Steiner, C., & Zach, B. (2023). New Ways of Creating Research Data: Conversion of Unstructured Text to TEI XML using GPT on the Correspondence of Hugo Schuchardt with a Web Prototype for Prompt Engineering. Retrieved from https://chpollin.github.io/GM-DH/ (last accessed 27 October 2025).
Search in Google Scholar Back to article
Rastinger, N. (2024). Named Entity Recognition mit LLMs. Retrieved from https://web.archive.org/web/20241121155350/https://agki-dh.github.io/slides/06_1_nlp_llm.pdf (last accessed 23 September 2025).
Search in Google Scholar Back to article
Sahle, P. (2014). Kriterien für die Besprechung digitaler Editionen. Retrieved from https://www.i-d-e.de/publikationen/weitereschriften/kriterien-version-1-1/ (unter Mitarbeit von Georg Vogeler und den Mitgliedern des IDE, v1.1, last accessed 28 October 2025).
Search in Google Scholar Back to article
Santini, C. (2024). Combining language models for knowledge extraction from Italian TEI editions. Frontiers in Computer Science, 6. 10.3389/fcomp.2024.1472512
Open DOI Search in Google Scholar Back to article
Scholger, M., Strutz, S., & Pollin, C. (2024). Empowering Text Encoding with Large Language Models: Benefits and Challenges. Retrieved from https://zenodo.org/records/13969082 (last accessed 26 May 2025).
Search in Google Scholar Back to article
Somala, V., & Emberson, L. (2025). Frontier AI performance becomes accessible on consumer hardware within a year. Retrieved from https://epoch.ai/data-insights/consumer-gpu-model-gap (last accessed 31 October 2025).
Search in Google Scholar Back to article
Strutz, S. (2025). Towards an Evaluation Framework for Assessing Large Language Models in Text Encoding. In Adho digital humanities conference 2025 (dh 2025). NOVA FCSH, Lissabon: Zenodo. 10.5281/zenodo.16364518
Open DOI Search in Google Scholar Back to article
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … Fedus, W. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. Retrieved from https://openreview.net/forum?id=yzkSU5zdwD (last accessed 10 August 2024).
Search in Google Scholar Back to article
Yang, J., Jiang, D., He, L., Siu, S., Zhang, Y., Liao, D., … Chen, W. (2025). StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs. Retrieved from http://arxiv.org/abs/2505.20139 (arXiv:2505.20139).
Search in Google Scholar Back to article

A Multi-Dimensional Evaluation Framework for Assessing LLM Performance in TEI Encoding

Abstract

Paradigm

My account