Semi-Automatic Annotation of Babylonian Cuneiform Texts

Tero Alstola; Aleksi Sahala; Jonathan Valk; Matthew Ong

doi:10.5334/johd.494

Abstract

This article presents the methods and workflow for semi-automatic linguistic annotation of Akkadian cuneiform texts and a Neo-Babylonian corpus created with them. The backbone of our workflow is BabyLemmatizer, a neural annotation pipeline developed especially for the purpose of annotating cuneiform texts. We used the lemmatizer to annotate a corpus of 6,099 Babylonian archival texts from the first millennium BCE. As the texts contained words and word forms not available in the training data of the lemmatizer, we manually added the most common out-of-vocabulary words to the lemmatizer’s override lexicon in an iterative process. The annotated texts are available as CoNLL-U files for computational analysis, but we also wanted to make our data available to the wider community of philologists and historians. Therefore, the texts are published in the corpus search tool Korp and partially on the Open Richly Annotated Cuneiform Corpus (Oracc). Moreover, we have created word co-occurrence networks that are well suited for the exploration of lexical semantics. Our raw datasets, their online editions on Korp and Oracc, and semantic networks can be used for teaching purposes in Assyriology, linguistics, and digital humanities.

References

Abraham, K., & Jursa, M. (2025). NaBuCCo: A Neo-Babylonian cuneiform corpus project. Retrieved from https://nabucco.acdh.oeaw.ac.at/ (last accessed 21 January 2026).
Search in Google Scholar Back to article
Alstola, T., Jauhiainen, H., Svärd, S., Sahala, A., & Lindén, K. (2023). Digital approaches to analyzing and translating emotion: What is love? In K. Sonik & U. Steinert (Eds.), The Routledge handbook of emotions in the ancient Near East (pp. 88–116). London: Routledge. 10.4324/9780367822873-6
Open DOI Search in Google Scholar Back to article
Alstola, T., Sahala, A., Valk, J., & Ong, M. (2025a). Babylonian administrative and legal texts. The Open Richly Annotated Cuneiform Corpus. Retrieved from http://oracc.org/balt (last accessed 21 January 2026).
Search in Google Scholar Back to article
Alstola, T., Sahala, A., Valk, J., & Ong, M. (2025b). Achemenet Babylonian texts – Kielipankki version 2020-12, Korp [dataset]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2023062102 (last accessed 21 January 2026).
Search in Google Scholar Back to article
Alstola, T., Sahala, A., Valk, J., & Ong, M. (2025c). Balt: Babylonian administrative and legal texts – Kielipankki version 2025-02, Korp [dataset]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2025022609 (last accessed 21 January 2026).
Search in Google Scholar Back to article
Alstola, T., Sahala, A., Valk, J., Ong, M., & Hardwick, S. (2025d). Neo-Babylonian lexical networks. University of Helsinki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2025052001 (last accessed 21 January 2026).
Search in Google Scholar Back to article
Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating networks. In E. Adar, M. Hurst, T. Finin, N. Glance, N. Nicolov, & B. Tseng (Eds.), Proceedings of the third international AAAI conference on weblogs and social media (pp. 361–362). Menlo Park, CA: AAAI Press. 10.1609/icwsm.v3i1.13937
Open DOI Search in Google Scholar Back to article
Bennett, E. (2023). Age and masculinities during the Neo-Assyrian period. Journal of Cuneiform Studies, 75, 123–154. 10.1086/725222
Open DOI Search in Google Scholar Back to article
Black, J., George, A., & Postgate, N. (2000). A concise dictionary of Akkadian (second, corrected ed.). Wiesbaden: Harrassowitz.
Search in Google Scholar Back to article
Borin, L., Forsberg, M., & Roxendal, J. (2012). Korp – the corpus infrastructure of Språkbanken. In N. Calzolari et al. (Eds.), Proceedings of the eight international conference on language resources and evaluation (LREC’12) (pp. 474–478). Istanbul: European Language Resources Association. Retrieved from http://www.lrec-conf.org/proceedings/lrec2012/pdf/248_Paper.pdf
Search in Google Scholar Back to article
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29. Retrieved from https://dl.acm.org/doi/10.5555/89086.89095
Search in Google Scholar Back to article
Gardin, J.-C., & Garelli, P. (1961). Étude des établissements assyriens en Cappadoce par ordinateurs. Annales, 16, 837–876. 10.3406/ahess.1961.420758
Open DOI Search in Google Scholar Back to article
Gutherz, G., Gordin, S., Sáenz, L., Levy, O., & Berant, J. (2023). Translating Akkadian to English with neural machine translation. PNAS Nexus, 2, pgad096. 10.1093/pnasnexus/pgad096
Open DOI Search in Google Scholar Back to article
Hackl, J. (2021). Late Babylonian. In J.-P. Vita (Ed.), History of the Akkadian language (pp. 1431–1458). Leiden: Brill. 10.1163/9789004445215_025
Open DOI Search in Google Scholar Back to article
Hackl, J., Janković, B., & Jursa, M. (2011). Das Briefdossier des Šumu-ukīn. KASKAL, 8, 177–221. 10.1400/190438
Open DOI Search in Google Scholar Back to article
Hackl, J., Jursa, M., & Schmidl, M. (2014). Spätbabylonische Privatbriefe. Münster: Ugarit-Verlag.
Search in Google Scholar Back to article
Hardwick, S. (2025). ANEE lexical portal, version 2.0-NB. GitHub. Retrieved from https://github.com/Traubert/anee-lexical-portal/releases/tag/v2.0-NB (last accessed 21 January 2026).
Search in Google Scholar Back to article
Jursa, M. (2005). Neo-Babylonian legal and administrative documents: Typology, contents and archives. Münster: Ugarit-Verlag.
Search in Google Scholar Back to article
Khan, A. F., Chiarcos, C., Declerck, T., Gifu, D., García, E. G.-B., Gracia, J., … Truică, C.-O. (2022). When linguistics meets web technologies: Recent advances in modelling linguistic linked data. Semantic Web, 13, 987–1050. 10.3233/sw-222859
Open DOI Search in Google Scholar Back to article
King, R., & Pirngruber, R. (2022). Slavery in Achaemenid-period Babylonia: The social world of Rībat, son of Bēl-erība. Journal of Ancient Near Eastern History, 9, 113–145. 10.1515/janeh-2020-0025
Open DOI Search in Google Scholar Back to article
Lahnakoski, J. M., Bennett, E., Nummenmaa, L., Steinert, U., Sams, M., & Svärd, S. (2024). Embodied emotions in ancient Neo-Assyrian texts revealed by bodily mapping of emotional semantics. iScience, 27, 111365. 10.1016/j.isci.2024.111365
Open DOI Search in Google Scholar Back to article
Levavi, Y. (2018). Administrative epistolography in the formative phase of the Neo-Babylonian empire. Münster: Zaphon. 10.2307/jj.18654693
Open DOI Search in Google Scholar Back to article
Luukko, M., Sahala, A., Hardwick, S., & Lindén, K. (2020). Akkadian treebank for early Neo-Assyrian royal inscriptions. In K. Evang, L. Kallmeyer, R. Ehren, S. Petitjean, E. Seyffarth, & D. Seddah (Eds.), Proceedings of the 19th international workshop on treebanks and linguistic theories (pp. 124–134). Düsseldorf: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.tlt-1.11/
Search in Google Scholar Back to article
Munich Open-Access Cuneiform Corpus Initiative. (2015–2025). The royal inscriptions of Babylonia online, Babylon 7: The inscriptions of the Neo-Babylonian dynasty. The Open Richly Annotated Cuneiform Corpus. Retrieved from http://oracc.org/ribo/babylon7/ (last accessed 21 January 2026).
Search in Google Scholar Back to article
Nielsen, J. P. (2011). Sons and descendants: A social history of kin groups and family names in the early Neo-Babylonian period, 747–626 BC. Leiden: Brill. 10.1163/ej.9789004189638.i-336
Open DOI Search in Google Scholar Back to article
Nurmikko-Fuller, T. (2023). Getting LOADed: Practical considerations, tools, and workflows for producing linked open Assyriological data. In V. Bigot Juloux, A. Di Ludovico, & S. Matskevich (Eds.), The ancient world goes digital: Case studies on archaeology, texts, online publishing, digital archiving, and preservation (pp. 335–370). Leiden: Brill. 10.1163/9789004527119_013
Open DOI Search in Google Scholar Back to article
Ong, M., & Gordin, S. (2024). Linguistic annotation of cuneiform texts using treebanks and deep learning. Digital Scholarship in the Humanities, 39, 296–307. 10.1093/llc/fqae002
Open DOI Search in Google Scholar Back to article
Parpola, S. (1970). Neo-Assyrian toponyms. Kevelaer: Butzon & Bercker; Neukirchen-Vluyn: Neukirchener Verlag.
Search in Google Scholar Back to article
Robson, E. (2019). AKK: Oracc linguistic annotation for Akkadian. Oracc: The Open Richly Annotated Cuneiform Corpus. Retrieved from http://oracc.org/doc/help/languages/akkadian/ (last accessed 21 January 2026).
Search in Google Scholar Back to article
Robson, E., & Tinney, S. (2019). QPN: Oracc linguistic annotation for proper nouns. Oracc: The Open Richly Annotated Cuneiform Corpus. Retrieved from http://oracc.org/doc/help/languages/propernouns/ (last accessed 21 January 2026).
Search in Google Scholar Back to article
Sahala, A. (2019). Pmizer: A tool for calculating word association measures. GitHub. Retrieved from https://github.com/asahala/Pmizer (last accessed 21 January 2026).
Search in Google Scholar Back to article
Sahala, A. (2024). BabyLemmatizer 2.2. GitHub. Retrieved from https://github.com/asahala/BabyLemmatizer (last accessed 21 January 2026).
Search in Google Scholar Back to article
Sahala, A., Alstola, T., Valk, J., & Lindén, K. (2023). Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and dictionary-based post-correction. In T. Erjavec & M. Eskevich (Eds.), Selected papers from the CLARIN annual conference 2022 (pp. 111–119). Utrecht: CLARIN ERIC. 10.3384/ecp198011
Open DOI Search in Google Scholar Back to article
Sahala, A., Jauhiainen, H., Alstola, T., Hardwick, S., Bennett, E., Jauhiainen, T., … Lindén, K. (2022). ANEE lexical networks v. 2.0 – the dataset. Zenodo. 10.5281/zenodo.7124351
Open DOI Search in Google Scholar Back to article
Sahala, A., & Lindén, K. (2023). BabyLemmatizer 2.0 – a neural pipeline for lemmatizing and pos-tagging cuneiform languages. In A. Anderson, S. Gordin, B. Li, Y. Liu, & M. C. Passarotti (Eds.), Proceedings of the ancient language processing workshop associated with the 14th international conference on recent advances in natural language processing RANLP 2023 (pp. 203–212). Shoumen: Incoma. Retrieved from https://aclanthology.org/2023.alp-1.23
Search in Google Scholar Back to article
Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). Automated phonological transcription of Akkadian cuneiform text. In N. Calzolari et al. (Eds.), Proceedings of the twelfth language resources and evaluation conference (pp. 3528–3534). Marseille: European Language Resources Association. Retrieved from https://aclanthology.org/2020.lrec-1.433/
Search in Google Scholar Back to article
Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A., & Lindén, K. (2021). Fear in Akkadian texts: New digital perspectives on lexical semantics. In J. Llop Raduà & S.-W. Hsu (Eds.), The expression of emotions in ancient Egypt and Mesopotamia (pp. 470–502). Leiden: Brill. 10.1163/9789004430761_019
Open DOI Search in Google Scholar Back to article
Tinney, S., & Robson, E. (2019). Oracc ATF primer. Oracc: The Open Richly Annotated Cuneiform Corpus. Retrieved from http://oracc.org/doc/help/editinginatf/primer/ (last accessed 21 January 2026).
Search in Google Scholar Back to article
Velt, R. (2011). JavaScript GEXF viewer for Gephi. GitHub. Retrieved from https://github.com/raphv/gexf-js (last accessed 21 January 2026).
Search in Google Scholar Back to article
Waerzeggers, C. (2014). Marduk-rēmanni: Local networks and imperial politics in Achaemenid Babylonia. Leuven: Peeters.
Search in Google Scholar Back to article
Waerzeggers, C., & Groß, M. (2019). Prosobab: Prosopography of Babylonia (c. 620–330 BCE). Leiden University. Retrieved from https://prosobab.leidenuniv.nl (last accessed 21 January 2026).
Search in Google Scholar Back to article

Semi-Automatic Annotation of Babylonian Cuneiform Texts

Abstract

Paradigm

My account