Semi-Automatic Annotation of Babylonian Cuneiform Texts

Tero Alstola; Aleksi Sahala; Jonathan Valk; Matthew Ong

doi:10.5334/johd.494

Full Article

1 Context and motivation

Linguistic annotation is a prerequisite for many tasks in corpus linguistics and the computational analysis of ancient texts. Tens of thousands of annotated Akkadian cuneiform texts are available in the Open Richly Annotated Cuneiform Corpus (Oracc),¹ Archibab,² and the Electronic Babylonian Library (eBL),³ but the chronological coverage of these projects is uneven. For example, only a small percentage of some 50,000 Babylonian archival texts from c. 620 BCE to 80 CE (henceforth “Neo-Babylonian texts”; see Hackl, 2021; Jursa, 2005) are available in an annotated format. This is in stark contrast to the Old Babylonian and Neo-Assyrian periods, as a significant portion of their rich textual records are available as annotated files via Archibab and Oracc. Recently, the eBL project has begun lemmatizing Neo-Babylonian archival documents, including texts in our corpus.

Computational Assyriology is not a new field (Gardin & Garelli, 1961; Parpola, 1970), but it has been growing at an unprecedented pace during the past decade. In addition to using computers to aid philological and historical work, network analysis and computational linguistics have been used increasingly as key research methods in Assyriological publications (Gutherz et al., 2023; King & Pirngruber, 2022; Lahnakoski et al., 2024). The application of computational methods to the Neo-Babylonian texts has been one of the research goals of the Centre of Excellence in Ancient Near Eastern Empires, but the lack of linguistically annotated texts has hindered computational research into this rich corpus. In order to create a sufficiently large corpus while avoiding the cumbersome and time-consuming work of manual annotation, we created a workflow for semi-automatic annotation of large corpora of transliterated cuneiform texts.

This article presents our annotation methods and workflow and the data we created with them. Furthermore, it discusses the platforms for making the annotated texts available to scholars working with traditional philological and historical methods. In addition to publishing the annotated texts as CoNLL-U files, we have also made them available on the corpus search tool Korp and partially on Oracc. The semantics of the texts can be explored in interactive word co-occurrence networks.

2 Dataset

Overview and origin

The corpus consists of 6,099 cuneiform texts (ca. 464,000 words) that are divided into two groups and ten sub-corpora (Table 1). The texts are divided into two groups as a consequence of their origin. The first group of 2,830 texts originates from the Achemenet project.⁴ These texts were written in the Persian period (539–330 BCE), and they were delivered to us as a database export in several XLSX files. The second group of 3,269 texts is more diverse, stretching from the Neo-Babylonian (626–539 BCE) to the Parthian period (141 BCE – c. 80 CE). We call this group “BALT: Babylonian Administrative and Legal Texts.” More than half of these texts are legacy data of the late János Everling, who was a pioneer in making transliterated cuneiform texts available online. The remaining BALT texts were collected from the recent publications of Hackl et al. (2011), Hackl et al. (2014), Levavi (2018), and Waerzeggers (2014).

Table 1

Distribution of the texts by genre and sub-corpus.

	ADMINISTRATIVE	LEGAL	LETTER	LITERARY	OFFICIAL	SCHOLARLY	UNCERTAIN
Achemenet
CT 55	160	111	18	0	0	1	5
Jursa, Bēl-rēmanni	13	121	4	0	0	0	18
Murašû archive	0	768	0	0	0	0	4
Strassmaier	108	559	33	0	0	1	710
YOS 7	196	0	0	0	0	0	0
BALT
Everling	1535	727	11	2	1	3	314
Hackl et al., 2011	0	0	29	0	0	0	0
Hackl et al., 2014	0	0	242	0	0	0	0
Levavi, 2018	0	0	216	0	0	0	0
Waerzeggers, 2014	0	180	0	0	0	0	9
Total	2012	2466	553	2	1	5	1060

The great majority of extant first-millennium Babylonian cuneiform tablets are archival texts, comprising legal and administrative documents and letters (Jursa, 2005). Neo-Babylonian royal inscriptions have recently been published as annotated editions (Munich Open-Access Cuneiform Corpus Initiative, 2015–2025), and the literary and scientific texts are being annotated on eBL and in several Oracc projects. Most of the texts in our corpus are legal and administrative documents, but we have purposefully included a large sample of letters to diversify the contents and vocabulary of the corpus. The genre distribution of the corpus can be seen in Table 1. About one thousand texts are classified as being of unknown genre. This does not indicate that their genre could not be determined, but rather that the information is not included in the metadata available to us. More detailed information about the corpus is available in the online repository.

Data and models for computational linguistics

The primary aim of the project was to annotate transliterated cuneiform texts with lemmas and part-of-speech tags, so that the texts could be used for various tasks in computational linguistics. Following the workflow described in section 3, the texts were annotated semi-automatically using BabyLemmatizer 2 (Sahala & Lindén, 2023). The annotated CoNLL-U files of Achemenet are available as dataset A on Zenodo, and the CoNLL-U and metadata files of BALT as dataset B. The source code of BabyLemmatizer 2 is available on GitHub (Sahala, 2024), and the Neo-Babylonian model for BabyLemmatizer as dataset C on Zenodo. The lemmatizer and the model can be used to annotate new batches of transliterated Neo-Babylonian cuneiform texts.

The Oracc project

For years, Oracc has been the primary platform for publishing linguistically annotated cuneiform texts online. To enhance the visibility and accessibility of our data, we made part of the BALT corpus (sub-corpora Everling, Levavi, and Waerzeggers) available on Oracc (Alstola et al., 2025a). Using the Oracc web interface, the user can read the texts in transliteration and partially in translation, perform simple searches, and use dictionaries to explore the attestation of words in the corpus (Figure 1). All the Oracc content is downloadable in a standardized format as JSON files, making our texts available for users who want to use them as part of a larger cuneiform dataset. The raw files used to create the Oracc project are available as dataset D on Zenodo.

The BALT project on Oracc, showing an annotated edition of a text.

The corpus search tool Korp

Philological work on texts greatly benefits from flexible search capabilities. Therefore, we made the annotated texts available on the Korp service of the Language Bank of Finland (Borin et al., 2012). Korp allows the researcher to conduct simple or more complicated searches in one or more corpora, and shows the results as a keyword-in-context list (Figure 2). The simple and intuitive graphical user interface makes the texts easily available to all Assyriologists. We provide metadata for each text, and the user can follow links to access other online resources related to the texts. Korp also displays statistics about the results of a search query and allows the user to download both the results and the related statistics. Achemenet (Alstola et al., 2025b) and BALT (Alstola et al., 2025c) are available as separate corpora on Korp. The user can easily modify their search to target either or both corpora, or one or more of their sub-corpora.

Korp’s user interface, showing the search results for the Akkadian word *suluppu*, “date (fruit).”

Semantic networks

Work on lexical semantics can be cumbersome, especially if a researcher is not intimately familiar with the corpus that they are investigating. When developing or starting a new research project, it can prove useful to freely explore the semantic relationships in a corpus and see which words typically cluster together. The visualization of words as a semantic co-occurrence network provides an intuitive platform for such exploration, and when used in tandem with philological methods it enables the user to toggle efficiently between micro and macro perspectives (Alstola et al., 2023). Using all the texts in our corpus, we built lexical semantic networks that show which words in the corpus typically co-occur (Alstola et al., 2025d). There are three different networks.⁵ One of them contains all nouns, verbs, and adjectives, and it is available in Akkadian and in English translation. The second one includes common nouns, verbs, and adjectives but excludes proper nouns. It is also available in Akkadian and in English translation. Finally, the third network only contains proper nouns. Personal names are not included in the networks, because they are numerous and would therefore obscure semantic structures. The networks serve different research questions, as the inclusion of proper nouns may affect co-occurrence patterns significantly. The networks are available via an online interface (Figure 3), from where they can be downloaded as GEXF (Graph Exchange XML Format) files for further analysis. The raw data used to create the networks is available as dataset E on Zenodo.

Dataset description

Repository name

Zenodo

Object name and location

Dataset A: “Linguistically Annotated Achemenet Babylonian Texts” (https://doi.org/10.5281/zenodo.14223709).

Dataset B: “BALT: Babylonian Administrative and Legal Texts” (https://doi.org/10.5281/zenodo.14186072).

Dataset C: “Neo-Babylonian Model for BabyLemmatizer 2.1” (https://doi.org/10.5281/zenodo.14978872).

Dataset D: “BALT: Babylonian Administrative and Legal Texts on Oracc” (https://doi.org/10.5281/zenodo.15496287).

Dataset E: “Neo-Babylonian Lexical Networks – the Dataset” (https://doi.org/10.5281/zenodo.15355779).

Format names and versions

ATF, CoNLL-U, CSV, GEXF, IPYNB, JAVA, JSON, PT, PY, TSV, TXT, VRT, XML, XLSX, YAML

Creation dates

2018-10-18 – 2025-05-23

Dataset creators

Tero Alstola (University of Helsinki) was responsible for conceptualization, data curation, investigation, methodology, project administration, and validation. Aleksi Sahala (University of Helsinki) was responsible for data curation, investigation, methodology, and software. Jonathan Valk (Spanish National Research Council, CSIC) was responsible for data curation, investigation, and validation. Matthew Ong (University of Helsinki) was responsible for data curation, investigation, methodology, software, and validation. Sam Hardwick (CSC – IT Center for Science) was responsible for methodology and software for creating semantic networks. Linda Leinonen, Matias Sakko, Senja Salmi, and Repekka Uotila (University of Helsinki) worked on data curation.

Language

Akkadian, English

License

Datasets A, B, and E: Creative Commons Attribution 4.0. Datasets C and D: Creative Commons Attribution Share Alike 3.0.

Publication date

Dataset A: 2025-02-13

Dataset B: 2025-02-13

Dataset C: 2025-03-06

Dataset D: 2025-05-23

Dataset E: 2025-05-22

3 Workflow and methods

Preprocessing

We received the original transliterations in various different file formats. Some were plain text while others were DOC(X), HTML, or PDF files. Moreover, different authors had used different conventions of transliterating Akkadian cuneiform into the Latin script. Therefore, we had to use both automated and manual means to convert the files to plain text and to standardize the transliteration conventions. We opted for the ASCII Transliteration Format (ATF) used in Oracc, as it is widely utilized and well-documented and because there is an easy-to-use validity checker for ATF files (Tinney & Robson, 2019). DOC(X) and PDF files were first converted manually to plain text, and text and HTML files were subsequently converted to ATF with automated scripts. Because of discrepancies in the original files, the automatically generated ATF files were not fully compatible with Oracc standards. We used the Oracc validity checker to track down and manually correct these issues. Validity tests ensured that the transliteration follows a defined syntax and that all encountered cuneiform sign values are attested in the Oracc sign list.⁶ After the ATF files passed the validity test, they were automatically converted to CoNLL-U files for linguistic annotation. A collection of scripts used to manipulate the files is available in datasets A and B on Zenodo.

Lemmatization and part-of-speech tagging

We provided our corpus with lemmatization, part-of-speech (POS) tagging, and phonemic transcription using BabyLemmatizer 2, a neural annotation pipeline (Sahala & Lindén, 2023). For lemmatization and POS tagging, we trained our initial model based on Oracc’s Babylonian data from the first millennium BCE (downloaded in 2023), comprising ca. 715,000 tokens (dataset C). Consequently, we use the POS tags defined by Oracc (Robson, 2019; Robson & Tinney, 2019). Because personal names were inadequately presented in the training data, we used transliteration-lemma lists from Prosobab (Waerzeggers & Groß, 2019) as part of the lemmatizer’s override lexicon. We used the default context windows in BabyLemmatizer, namely one preceding and following word for the tagger and the two preceding and two following POS tags for the lemmatizer. In our annotation process, we first annotated one sub-corpus of the whole dataset. Afterwards, we manually corrected the most common lemmas and POS tags to which BabyLemmatizer assigned the lowest confidence scores, and then imported these changes back to the lemmatizer’s override lexicon to be taken into account in the following sub-corpora. At this stage, we also detected and corrected some additional issues in transliteration. Once all the sub-corpora had been lemmatized, we manually corrected some repetitive but ambiguous lemmas and POS tags that could not be corrected with the override lexicon.

Experiments on phonemic transcription

Phonemic transcription of the corpus was not our original aim, and it is not available in datasets A and B, Korp, and semantic networks. However, Oracc requires transcription as part of linguistic annotation, and thus we had to provide it for the BALT project on Oracc. Although an existing model for predicting Akkadian phonemic transcriptions exists (Sahala et al., 2020), we chose to train a new custom model using BabyLemmatizer to streamline our annotation process. Generally, automatic phonemic transcription of Akkadian is challenging due to the opaque writing system that does not always distinguish clearly between different vowel lengths (short, long, overlong), and because the relations between logograms and their transcriptions are suppletive and predictable only by their sentence context.

Our phonemic transcription model is based on Oracc’s Babylonian data from the first millennium BCE (downloaded in December 2024). It consists of 914,000 tokens split into 80/10/10 training/development/test sets. Because the model is rudimentary and we did not develop it further, we do not publish it as part of our dataset. We attempted multiple different ways to map transliterations with their phonemic transcriptions. One was to predict transcriptions for syllabic transliterations as simple one-to-one mappings while increasing the context for logograms to the two preceding and following words. Another attempt involved using context windows of a fixed size to provide information on the surrounding transliterated or lemmatized words in order to predict the correct transcription in context. We also experimented with data augmentation methods for logograms, such as artificially replacing syllabically spelled words with logograms to generate more training data. Although augmentation improved the results in some cases, the improvement always fell within the margin of error. We therefore chose the simplest approach and first predicted POS tags using the default BabyLemmatizer model, and then predicted transcriptions from transliterations using the POS tags of the two preceding and following words for context. Unlike lemmas and POS tags, transcriptions were not manually corrected at any stage.

Metadata creation

We also wanted to provide our texts with metadata concerning their genre, place and time of writing, publication history, etc. This data was available to us in different formats and our rights to distribute it varied.

For Achemenet, the database export included metadata for each text. We harmonized the metadata slightly and translated some terminology from French to English to make it easily accessible on Korp. Furthermore, some metadata was transformed to make it compatible with the conventions used by the Cuneiform Digital Library Initiative (CDLI).⁷ For example, information on a king’s reign was converted to a standardized name for a historical period. Artifact ID numbers from CDLI were also added to texts when available. The metadata for Achemenet texts is available on Korp, and the scripts used to manipulate it are available in dataset A on Zenodo.

For the BALT texts we used the existing metadata from CDLI, augmented it, or created new metadata entries for CDLI. For the creation of new CDLI metadata, we primarily used the information provided in the original publications. Occasionally, we also used metadata records from the NaBuCCo project (Abraham & Jursa, 2025). The quantity and quality of the metadata depends on the available source material, as we did not have the resources to consult the text editions themselves and create our own metadata for each text. Data collection and manipulation was primarily a manual process, but facilitated by some scripting and the use of a large language model (Microsoft Copilot). We created new CDLI entries for almost a thousand texts and updated the existing entries of some 300 texts. The metadata for BALT texts is available in dataset B, Korp, and CDLI.

The Oracc project

BabyLemmatizer produces linguistic annotation in CoNLL-U files, but Oracc uses ATF files as its input file format. Because our CoNLL-U files were based directly on validated ATF files, lemmas, POS tags, and transcribed forms could be reinserted back into the ATF files as annotation metadata.

In ATF, annotations are provided line by line in parallel with the transliteration. For example, line 6 of P248168 = UCP 9/1, I, 47 is annotated as follows:

6. {m}{d}EN-iq-bi A {m}{d}AG.GI

#lem: +Bel-iqbi[1]PN$; +māru[son//son]N’N$$mār; +Nabu-ušallim[1]PN$

The top line is the transliteration, while the bottom line provides the annotation. Each token in the bottom line specifies the lemma, meaning, POS, and transcription of the token above. The meaning in fact specifies a pair of features, namely the guide word (i.e., basic meaning) and the context-specific sense. Parallel to this are general and context-specific POS tags. In manual annotation, the general and context-specific sense and POS tags of a token are based on expert judgment in consultation with a standard dictionary reference such as the Concise Dictionary of Akkadian (CDA – see Black et al., 2000). In our automatic processing, we use a list of lemma-guide word pairs created from the CDA data at eBL, and set the context-specific sense equal to the guide word. Similarly, context-specific POS is set as equal to general POS. Unfortunately, this method introduces occasional mislabeling. For personal names (PN), guide word and context-specific sense are reduced into a single numeral distinguishing individual persons. In our case, we set all numbers equal to the default “1”.

After constructing annotation lines, certain adjustments were made to the ATF files by processing scripts and manual operations so that they would be uniform in format and pass the obligatory syntax check. Transliterations provided by Levavi (2018) and Waerzeggers (2014) came with translations which were inserted in the ATF files. The resulting files were uploaded to the Oracc server, where they were combined with text metadata. Finally, the files were used to generate the glossary files and web pages available to the user. A set of scripts used to create the Oracc ATF files is available in dataset D on Zenodo.

Korp and semantic networks

Korp is part of the research infrastructure provided by FIN-CLARIN at the Language Bank of Finland.⁸ We therefore only needed to prepare input files in the VRT format while the personnel at FIN-CLARIN took care of making the dataset available. More importantly, FIN-CLARIN takes care of maintaining Korp, which ensures the long-term accessibility of the service and data.

Our lexical semantic networks show which words typically co-occur in the same context. We used Pmizer (Sahala, 2019) to calculate similarity scores with a version of pointwise mutual information (PMI; Church & Hanks, 1990) for each pair of words in the corpus. For each word, we picked the 15 collocates with the highest PMI scores and used the scores as edge weights in the networks. We created three separate networks: 1) proper nouns; 2) verbs, adjectives, and common nouns; and 3) verbs, adjectives, and all nouns. Personal names are not included in any network. Networks 2 and 3 are available in two versions, one showing the words in Akkadian and the other in English translation. Using the JavaScript GEXF Viewer for Gephi (Bastian et al., 2009; Velt, 2011), Sam Hardwick (CSC – IT Center for Science) developed an interactive online platform for exploring the networks. The scripts used to create the platform are available in dataset E and Hardwick (2025). An older version of the scripts can be accessed in Sahala et al. (2022).

4 Results

We evaluated the accuracy of lemmatization and POS tagging with two samples of 500 randomly selected words, one from the Achemenet texts and the other from the BALT texts. In the evaluation, we discarded words that had a broken or otherwise corrupted transliteration or that were attested in a badly damaged context. These account for 2,3% of the words in the two samples. The results of the evaluation are presented in Table 2. In the two samples, 92,3% of words had been assigned a correct lemma and POS tag. The results for the Achemenet texts are very similar to those we achieved using a previous version of BabyLemmatizer (Sahala et al., 2023).

Table 2

Accuracy of lemmatization and POS tagging.

	LEMMA	POS	LEMMA + POS
BALT	93,4%	95,9%	90,9%
Achemenet	96,3%	96,1%	93,7%
Total	94,9%	96,0%	92,3%

The accuracy of POS tagging was practically identical for the BALT and Achemenet texts, but the quality of lemmatizations was better in the Achemenet corpus. This appears to be due to some random features in the two text groups. Most importantly, the spelling “IGI-ir” is regularly (232 times) used to write the verb mahāru in the BALT texts, but this is not the case in the Achemenet texts (6 times). Because the logogram IGI can also represent the verb amāru, our lemmatizer ascribed the erroneous lemma amāru to the spelling “IGI-ir” and assigned it a confidence score of 4. Since words with the highest confidence scores were not checked manually, this repetitive mistake was able to creep into the corpus. It was similarly difficult for the lemmatizer to decide whether the logogram “IGI” should have been lemmatized as mahru or pānu and whether the word ša is a determinative or relative pronoun. In the lemmatizer’s defense, these determinations are not always easy for qualified philologists to make either. The issues primarily arise from inadequate training data, which is a common challenge in the linguistic annotation of ancient languages. In future work, it is advisable to make manual spot checks on words with higher confidence scores to detect possible repetitive errors.

A significant proportion of incorrect lemmas and POS tags relate to personal names and other proper nouns. Oracc assigns different POS tags to personal names and to the so-called lineage names that were used by the urban upper class of Babylonian society (Nielsen, 2011). Lineage names typically appear as the last element in a three-part genealogy beginning with a person’s name and patronym. However, because the patronym was sometimes omitted, the last element of a two-part genealogy can sometimes represent a lineage name rather than a patronym. This ambiguity resulted in a number of erroneous POS tags. Moreover, lineage names often relate to professions, which also causes ambiguity in their lemmatization and POS tagging. The wealth of different proper nouns and the variation in their spelling led to a number of erroneous lemmatizations as well.

Finally, the conjugation of Babylonian verbs can be quite complicated, and forms such as “in-na-aʾ” from nadānu and “it-tal-ku-ú” from alāku resulted in erroneous lemmatizations. To minimize manual work, we did not add the most uncommon out-of-vocabulary words to the override lexicon. As a result, some rare out-of-vocabulary words are given lemmas that may not even exist in the Akkadian language. This is again caused by inadequate training data, and the issue can be only fixed by doing manual work on the override lexicon. Each research group needs to consider their needs and resources in this regard: the cleaner the data needs to be, the more manual work is required.

Phonemic transcription of Neo-Babylonian is complicated by the inherent distance between cuneiform orthography and phonological reality—a challenge common to many historical and modern writing systems. In addition to logograms, syllabic spellings can be imprecise and use conservative forms. For example, short final vowels in words had mostly dropped out in pronunciation, but they are still regularly represented in writing, albeit in a haphazard manner (Hackl, 2021). Because this issue is not treated systematically in the training data, it makes the transcription of logographic spellings particularly difficult. It was thus expected that BabyLemmatizer would not produce transcriptions that are of the same quality as lemmas and POS tags. We evaluated the quality of the transcriptions using a sample of 50 words each from BALT and Achemenet. The accuracy of the transcriptions was ca. 70%. Errors mostly related to proper nouns, rare words, and the endings of words written logographically.

Our methods and workflow provided satisfying results for the present needs, but there is room for further development. For example, the lemmatizer should not annotate individual signs in damaged contexts, as these do not present full words that can be analyzed meaningfully. Here we refer to cases when the scholar preparing the transliteration has been able to read individual signs but not to join the signs together to form meaningful words. Currently, the lemmatizer annotates each sign as a separate word, which is obviously an erroneous practice. Furthermore, the output of the lemmatizer is not fully aligned with the Oracc guidelines on linguistic annotation (Robson, 2019). This relates to issues such as inconsistency in lemmatization and POS tagging of statives: in our dataset, they are sometimes classified as adjectives (parsu) and sometimes as verbs (parāsu). Both classifications are linguistically possible, but the Oracc guidelines recommend always treating statives as adjectives. Improving our workflow and lemmatizer in this regard would ensure consistency and compatibility with the existing Oracc data. Finally, transcriptions of personal names could be easily improved by using lemmas as transcriptions. Personal names are not declined like other nouns, and the Prosobab data used in the override lexicon for lemmas and POS tags contains many names not included in the training data for transcriptions.

5 Relevance and applications

Our methods and workflow offer a means for the semi-automatic annotation of large corpora of Akkadian cuneiform texts. Given that the lemmatizer, its documentation, and a Neo-Babylonian model are openly available, the most obvious task to which our tools are suited is the lemmatization of further Babylonian texts from the first millennium BCE. However, the lemmatizer can be relatively easily applied to annotate data from other periods, provided that sufficient training data is available. For periods and genres with limited training data, manual annotation of some data is necessary before the lemmatizer can achieve results equivalent to those we report here. Our workflow and methods can be most fruitfully applied by a team that has expertise in both Assyriology and computational linguistics.

The annotated data can be used with both philological and computational methods in research and teaching. First, flexible corpus queries can be performed in Korp. Because Korp enables simple and advanced searches, it can be used in classroom settings and for philological research. Most simply, the user can search for transliterated or lemmatized words and delimit the search by word or text-specific features such as part of speech, genre, or provenance. If one is interested in an expression that consists of two or more words, several elements can be combined in a single query. For example, one could search for Neo-Babylonian guarantee clauses by looking for the lemmas pūtu and našû and allow 0–10 words to appear between them in the text. In addition to displaying the results of a query in the keyword-in-context view, Korp shows statistics about the results, allowing the user to control the way the statistics are compiled. We can see, for example, that pūtu šanû našû (“one guarantees for the other”) is the most common lemma combination used in guarantee clauses in our corpus. For the Achemenet texts, Korp provides a search feature that is not available on the Achemenet website. Links to the texts on Achemenet give the user easy access to the original editions.

Assyriologists have come to know Oracc as the platform through which to access annotated online editions of cuneiform texts. Due to copyright restrictions, we could only make part of our corpus available on Oracc. Nevertheless, the BALT project adds a non-negligible number of 2,990 texts to Oracc. Users can browse transliterations and translations, view glossary items, see attestation data for individual terms, and search by keyword terms. This is suitable for teaching purposes and simple corpus queries. The possibility of downloading the annotated texts as JSON files makes our texts easily available to computational Assyriologists who use Oracc data in their research.

Semantic networks are used optimally in tandem with corpus searches in Korp. The networks provide a macro perspective on lexical semantics in our corpus and are well suited for exploratory research. After detecting a phenomenon that is worth studying further, the user can simply click the word of interest to access the relevant search results in Korp. The networks have great potential as a teaching tool, as they have the capability to illustrate semantic phenomena for students with limited knowledge of the source material. For example, students can use the networks to detect a few interrelated words they find interesting, search for their occurrences in Korp, and then move on to study the original texts in which the words are attested.

Finally, the annotations can be downloaded as a set of CoNLL-U files for maximum flexibility in query and data manipulation. These files are well suited for many tasks in computational linguistics. These include morpho-syntactic annotation (Luukko et al., 2020; Ong & Gordin, 2024), the creation of linked open data from text metadata and annotations (Khan et al., 2022; Nurmikko-Fuller, 2023), and the study of lexical semantics with tools like pointwise mutual information and word vectors (Bennett, 2023; Svärd et al., 2021).

Notes

[1] http://oracc.org/ (last accessed 21 January 2026).

[2] https://www.archibab.fr/ (last accessed 21 January 2026).

[3] https://www.ebl.lmu.de/ (last accessed 21 January 2026).

[4] http://www.achemenet.com/ (last accessed 21 January 2026).

[5] http://urn.fi/urn:nbn:fi:lb-2025052001 (last accessed 21 January 2026).

[6] http://oracc.org/osl/ (last accessed 21 January 2026).

[7] https://cdli.earth/ (last accessed 21 January 2026).

[8] https://www.kielipankki.fi/language-bank/ (last accessed 21 January 2026).

Acknowledgements

We have annotated texts that were transliterated by numerous colleagues. We wish to thank the Achemenet project as well as Johannes Hackl, Bojana Janković, Michael Jursa, Yuval Levavi, Martina Schmidl, and Caroline Waerzeggers for permission to use their transliterations. János Everling’s legacy data is published in an annotated format to honor his pioneering work in making transliterated cuneiform texts available online. We extend our thanks to Kathleen Abraham, Michael Jursa, and Shai Gordin for giving us access to the NaBuCCo metadata for certain texts. Finally, we thank Niek Veldhuis and Heidi Jauhiainen for their help at various stages of the project, especially for writing some useful scripts for data manipulation.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Tero Alstola: conceptualization, data curation, investigation, methodology, project administration, validation, writing – original draft, writing – review & editing

Aleksi Sahala: data curation, investigation, methodology, software, writing – original draft

Jonathan Valk: data curation, investigation, validation, writing – review & editing

Matthew Ong: data curation, investigation, methodology, software, validation, writing – original draft

Semi-Automatic Annotation of Babylonian Cuneiform Texts

Full Article

1 Context and motivation

2 Dataset

Overview and origin

Table 1

Data and models for computational linguistics

The Oracc project

Figure 1

The corpus search tool Korp

Figure 2

Semantic networks

Figure 3

Dataset description

Repository name

Object name and location

Format names and versions

Creation dates

Dataset creators

Language

License

Publication date

3 Workflow and methods

Preprocessing

Lemmatization and part-of-speech tagging

Experiments on phonemic transcription

Metadata creation

The Oracc project

Korp and semantic networks

4 Results

Table 2

5 Relevance and applications

Notes

Acknowledgements

Competing Interests

Author Contributions

Paradigm

My account