Abstract
This article presents the methods and workflow for semi-automatic linguistic annotation of Akkadian cuneiform texts and a Neo-Babylonian corpus created with them. The backbone of our workflow is BabyLemmatizer, a neural annotation pipeline developed especially for the purpose of annotating cuneiform texts. We used the lemmatizer to annotate a corpus of 6,099 Babylonian archival texts from the first millennium BCE. As the texts contained words and word forms not available in the training data of the lemmatizer, we manually added the most common out-of-vocabulary words to the lemmatizer’s override lexicon in an iterative process. The annotated texts are available as CoNLL-U files for computational analysis, but we also wanted to make our data available to the wider community of philologists and historians. Therefore, the texts are published in the corpus search tool Korp and partially on the Open Richly Annotated Cuneiform Corpus (Oracc). Moreover, we have created word co-occurrence networks that are well suited for the exploration of lexical semantics. Our raw datasets, their online editions on Korp and Oracc, and semantic networks can be used for teaching purposes in Assyriology, linguistics, and digital humanities.
