Have a personal or library account? Click to login
Semi-Automatic Annotation of Babylonian Cuneiform Texts Cover

Semi-Automatic Annotation of Babylonian Cuneiform Texts

Open Access
|Mar 2026

Abstract

This article presents the methods and workflow for semi-automatic linguistic annotation of Akkadian cuneiform texts and a Neo-Babylonian corpus created with them. The backbone of our workflow is BabyLemmatizer, a neural annotation pipeline developed especially for the purpose of annotating cuneiform texts. We used the lemmatizer to annotate a corpus of 6,099 Babylonian archival texts from the first millennium BCE. As the texts contained words and word forms not available in the training data of the lemmatizer, we manually added the most common out-of-vocabulary words to the lemmatizer’s override lexicon in an iterative process. The annotated texts are available as CoNLL-U files for computational analysis, but we also wanted to make our data available to the wider community of philologists and historians. Therefore, the texts are published in the corpus search tool Korp and partially on the Open Richly Annotated Cuneiform Corpus (Oracc). Moreover, we have created word co-occurrence networks that are well suited for the exploration of lexical semantics. Our raw datasets, their online editions on Korp and Oracc, and semantic networks can be used for teaching purposes in Assyriology, linguistics, and digital humanities.

DOI: https://doi.org/10.5334/johd.494 | Journal eISSN: 2059-481X
Language: English
Submitted on: Dec 11, 2025
|
Accepted on: Feb 3, 2026
|
Published on: Mar 6, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Tero Alstola, Aleksi Sahala, Jonathan Valk, Matthew Ong, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.