1 Overview
Repository location
Zenodo: https://doi.org/10.5281/zenodo.10018951
GitHub: https://github.com/ElectronicBabylonianLiterature/transliterated-fragments
Context
Cuneiform writing was originally developed around 3,200 BCE in southern Mesopotamia (modern-day Iraq) to write the Sumerian language, and was later adapted for languages like Hittite, Hurrian, Luwian, Urartian, and most prominently Akkadian, one of the best attested languages from antiquity. Cuneiform is formed by wedge-shaped imprints (Latin cuneus) pressed into damp clay tablets (see Figure 1). The most recent known cuneiform tablet is dated to 75 CE (Sachs, 1976). The tablets that have been excavated are today stored in different museums worldwide (Streck, 2010), some of which have started digitizing their collections. A major digitization effort, spearheaded by the CDLI (Cuneiform Digital Library Initiative, 2023), has resulted in the digital capture of tens of thousands of tablets over the past 30 years.

Figure 1
Fragment ND.5513 with ATF-Transliteration https://www.ebl.lmu.de/fragmentarium/ND.5513 displayed in browser.
The representation in Latin characters of signs of a tablet is called transliteration. A digital transliteration system that includes markup for all phonetic and graphic features, the ASCII-Transliteration-Format (ATF) (a name that is now anachronistic, since the format can use Unicode) was created by the CDLI in the late 1990s and early 2000s, and later modified by Tinney (2023) and named Oracc ATF.
CDLI, a database that contains records for more than 300,000 tablets, represents the largest publicly available digital collection of photographs and transliterations of cuneiform tablets. About 50 percent of the roughly half a million cuneiform tablets which have been excavated so far have not yet been transliterated or published (Streck, 2010).
The dataset in this paper was collected as part of the Electronic Babylonian Literature (eBL) project. The core of eBL is its online platform which provides easy access to an extensive collection of transliterations of cuneiform tablets along with tools that allow users to search the data and produce new transliterations of formerly unedited tablets. This way eBL seeks to offer a solution to the challenges posed by the fragmentary nature of the Mesopotamian written documentation. The eBL website and associated software projects are open-source and the individual records can be freely accessed through the browser (cf. Figure 1). The public API together with the Python code presented in this paper aims to make the entire dataset easy to access and process using our ATF parser.
2 Method
Sources
The catalogue of the eBL platform currently contains 262,717 records of cuneiform tablets, comprising the cuneiform collections of the British Museum, the Penn Museum, the Yale Babylonian Collection, the Hilprecht Collection, and the Vorderasiatisches Museum, among others. Of these, almost 25,000 are available in transliteration, and 52,105 in photographs; eBL is authorized to utilize and showcase the images with consent of the specific museum mentioned in each image. The initial list of museum numbers of the eBL platform was compiled using the catalogues of the CDLI, The British Museum Digital Collections (2023), Yale Babylonian Collection (2023) and other published and unpublished catalogues; the fields of the catalogue have been populated by the eBL team, who has also produced the transliterations. New tablets are constantly added and each document is subject to a careful revision process by the team before being entered into the database.
Steps
Around 20,000 photos have been produced by photographers working for the eBL project. They cannot be reproduced without the explicit consent of the collections in which the objects are kept. The transliterations in the dataset have been produced by Assyriologists working at the eBL project, starting in 2018. In addition, transliterations have been entered by Assyriologists working at the projects Edition of the Omen Series Šumma Alu (Mittermayer, 2017–2021) and Typology and potential of the excerpt tablets of Šumma alu (Mittermayer, 2022–2023); Introducing Assyrian Medicine: Healthcare Fit for a King (Taylor, 2020–2023) Reading the Library of Ashurbanipal: A Multi-sectional Analysis of Assyriology’s Foundational Corpus (Taylor & Jiménez, 2020–2023), and Cuneiform Artefacts of Iraq in Context (Jiménez, Sallaberger, & Radner, 2023–2046). Many of the over 25,000 transliterations have been produced solely on the basis of the photographs and have not been checked against the originals in museums. The transliterations are created using an online ATF editor that is part of the eBL platform. Once saved, the transliterations are parsed to a JSON tree using our ATF parser and saved in the database.
Quality Control
A permission and revision system was implemented at the beginning of the project to maintain high quality of the data. Each transliteration is reviewed by another expert and changes are tracked, documenting the edit history of each document.
2.1 Dataset Description
2.2 Description
The dataset is a single JSON file which contains a list of objects (so-called “fragments”, since most cuneiform tablets in the dataset are fragmentary). Each fragment contains an id (e.g. ND.5513) which can be used to find the fragment in the browser (see Figure 1), a short description, metadata such as the name of the collection, the museum and information on the publication history. There is additional information on the editors and the edit history of the transliteration, specified under “records”, the genre, script type, pointers to external collections containing the item and many more properties. The transliteration of the fragment is saved as the “atf” property (as plain text, i.e. a string) which can be parsed into a JSON tree, as explained in detail below.
3 Downloading and processing the data
The eBL fragments Python code (see 1) can be used to download and parse all openly available transliterated documents using our public API. The eBL-ATF parser, which is an integral part of the eBL-API, has been made accessible as a standalone Python package in the eBL fragments Python code. Since eBL-ATF is a superset of standard ATF, the latter can be easily converted to eBL-ATF. For details on the parser and compatibility with Oracc ATF, the reader is referred to our documentation. The dataset at Zenodo contains all the fragments available on the 1st of September 2023. To get an up-to-date version, the eBL fragments Python code provided should be used.
Object name
fragments.json
Format names and versions
JSON.
Creation dates
2018-05-29 to 2023-08-31.
Dataset creators
Sophie Cohen – Data curation
Zsombor Földi – Data curation
Ekaterine Gogokhia – Data curation
Aino Hätinen – Data curation
Adrian Heinrich – Data curation
Tonio Mitto – Data curation
Felix Müller – Data curation
Jeremiah Peterson – Data curation
Geraldina Rozzi – Data curation
Luis Sáenz – Data curation
Babette Schnitzlein – Data curation
Krisztián Simkó – Data curation
Henry Stadhouders – Data curation
Catherine Mittermayer – Data curation
Fabienne Huber Vuillet – Data curation
Kaira Boddy – Data curation
Jon Taylor – Data curation
Enrique Jiménez – Data curation, Project administration, Writing – review & editing, Funding acquisition
Language
English.
License
eBL fragments Python code: MIT License
Data (fragments.json): Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Photographs: Reproduction of the images requires explicit consent from both the funding projects, the relevant institutions, as well as the institutions in which the cuneiform tablets are kept. Users are directed to review the conditions for image reproduction in the image captions.
Repository name
Zenodo, GitHub
Publication date
2023-08-31
4 Reuse Potential
For traditional philology the dataset is of enormous value, since it allows access to tens of thousands of cuneiform tablets previously unpublished. It has been estimated that all the tablets preserved in the world’s museums as a whole contain about 10,000,000 words (Streck, 2010, 54–55): the dataset published here, compiled mostly from scratch, contains 350,000 lines that were previously inaccessible. This wealth of new data has already propelled the process of piecing back fragments for reconstructing fragments that were in a fragmentary state: alone in the compilation of the corpus 1,250 “joins” (i.e., fragments that belong together) have been detected. The dataset has also been used for easy identification of the content of fragments that would otherwise be difficult to identify.
NLP tasks for cuneiform scripts include, among others, generating automatic transliterations from signs to readings (Gordin et al., 2020), restoring damaged signs (Fetaya, Lifshitz, Aaron, & Gordin, 2020), matching fragments with their corresponding parts to reconstruct complete fragments, and machine translation from Akkadian to English (Gutherz, Gordin, Sáenz, Levy, & Berant, 2023). For an overview of different NLP tasks in Assyriology see Sahala (2021). The images can be used for semi-supervised or unsupervised OCR methods (Rusakov, Somel, Fink, & Müller, 2020). For recent advances in visual methods for cuneiform script see Bogacz and Mara (2022).
Acknowledgements
The photographs of tablets from The British Museum’s Kuyunjik collection were produced in 2009–2013, as part of the ongoing “Ashurbanipal Library Project” (2002–present), thanks to funding provided by The Andrew Mellon Foundation. The photographs were produced by Marieka Arksey, Kristin A. Phelps, Sarah Readings, and Ana Tam, with the assistance of Alberto Giannese, Gina Konstantopoulos, Chiara Salvador, and Mathilde Touillon-Ricci. They are displayed on the eBL website courtesy of Dr. Jon Taylor, director of the “Ashurbanipal Library Project.” The photographs of the The British Museum’s Babylon collection are taken by Alberto Giannese and Ivor Kerslake (eBL Project, 2019–present). The photographs of the tablets in the Iraq Museum have been produced by Anmar A. Fadhil (University of Baghdad – eBL Project), and displayed by permission of the State Board of Antiquities and Heritage and The Iraq Museum. The photographs of the tablets in the Yale Babylonian Collection are produced by Klaus Wagensonner (Yale University) and used with the kind permission of Agnete W. Lassen (Associate Curator of the Yale Babylonian Collection, Yale Peabody Museum).
Funding Information
The research has been supported by a Sofja Kovalevskaja Award (Alexander von Humboldt Foundation).
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Yunus Cobanoglu – Software, Writing – original draft
Jussi Laasonen – Software
Fabian Simonjetz – Software, Writing – review & editing
Ilya Khait – Software
Sophie Cohen – Data curation
Zsombor Földi – Data curation
Aino Hätinen – Data curation
Adrian Heinrich – Data curation
Tonio Mitto – Data curation
Geraldina Rozzi – Data curation
Luis Sáenz – Data curation – review & editing
Enrique Jiménez – Data curation, Project administration, Writing – review & editing, Funding acquisition
