A Large-Scale Dataset of Annotated Cuneiform Sign Images for Digital Palaeography

Or Lewenstein; Daniel López; Cyrill Dankwardt; Mays Fadhil Alrawi; Louisa Grill; Brian Mak; Albert Setälä; Fiammetta Gori; Aino Hätinen; Felix Rauchhaus; Zsombor Földi; Enrique Jiménez

doi:10.5334/johd.503

Full Article

1 Overview

Repository location

DOI: 10.5281/zenodo.17949595

Context

Cuneiform is the oldest known writing system, and also one of the most long-lived (ca. 3200 BCE–75 CE). The script is named after the distinctive shape of its characters (Latin cuneus, “wedge”), which are composed of individual wedge-shaped impressions in clay made with a stylus (Finkel & Taylor, 2015, 75). The most common medium for cuneiform writing was clay tablets. In the course of editing these tablets, philologists create scientific transliterations of the cuneiform text, i.e. they assign to each individual character its corresponding phonetic value in Latin script. The vast majority of non-administrative tablets are undated. Dating these texts often poses a considerable challenge for researchers. Palaeography, the study of ancient writing, offers a potential method for assigning approximate dates to these undated documents, as the forms of cuneiform signs evolved. For example, the group of four vertical wedges used in the sign az is known to progressively shift to the right over time (see Table 1). However, the practical application of palaeographic dating is often limited by the scarcity of securely dated signs available for comparison.

Table 1

Neo- and Late Babylonian dated instances of the sign az in the dataset, as displayed on the eBL platform (last accessed: 3 March 2026).


BM.34110 obv. 11	YBC.4111 obv. 7	BM.33541 l. 11
(12 January 553 BCE)	(13 June 528 BCE)	(16 September 287 BCE)

BM.33491 obv. 11	BM.33013 l. 6	BM.33014 l. 9
(ca. 125 BCE)	(18 May 93 BCE)	(19 June 93 BCE)

As far back as the 1920s, the French Assyriologist Charles Fossey (1869–1946) attempted to assemble a comprehensive compilation of cuneiform characters from all time periods (see Figure 1)—the last such effort to create a systematic palaeographic reference tool. For various reasons, not least that he worked from hand copies rather than from original tablets or photographs, his compilation is heavily outdated and of limited practical value today. Nevertheless, with more than 30,000 instances of signs, it remains the only available resource of its kind.

Excerpt from section on the sign az in Fossey’s sign list (Fossey, 1926).

Since there is no standardized modern repertoire of cuneiform sign forms, there are no clear criteria when it comes to dating a tablet on palaeographic grounds. As a result, researchers in the field of cuneiform studies sometimes reach quite different conclusions, depending on their experience and field of expertise. Certain supposedly “diagnostic” sign forms are frequently discussed in passing notes and used for chronological classification by specialists (Jursa, 2015; Peterson, 2011, 154 fn. 5). However, evidence for the actual diagnostic significance of these forms remains lacking, as many were used across multiple time periods. Reliable dating would require analyzing a vast number of different sign forms—a task too large in scale for manual analysis.

It has become increasingly apparent in recent years that digital methods are essential for palaeographic analysis at the scale required for reliable dating. A pioneer in the field of digital palaeography was the Cuneiform Digital Palaeography Project (CDPP; 2002–2004) at the University of Birmingham. This project produced a database of 11,610 characters (Hügel, 2014) and a series of articles describing its infrastructure (Arvanitis et al., 2002; Woolley et al., 2002). The Late Babylonian Signs project (LaBaSi; 2014–2017) focused on administrative texts from the so-called long sixth century, creating a reference tool with 12,897 sign examples. The project Computer-unterstützte Keilschriftanalyse (2018–2023, CuKa) has made available a dataset of 22,800 cuneiform signs from Hittite tablets (ca. 1595–1000 BCE), as well as innovative instruments for classifying signs (Rusakov et al., 2021) and for data augmentation (Rest et al., 2022).

We present here the largest dataset of cuneiform sign photographs to date, featuring 158,946 instances (see Figure 2). Alongside the much smaller CDPP dataset, it is the only resource covering multiple periods of cuneiform writing. The collection is designed to support scholars in dating tablets and to enable computer-assisted analysis of the script.

Distribution of the cuneiform sign dataset across historical periods.

After appropriate data preparation (image normalization, segmentation, class assignment), this dataset would also be an ideal basis for training a powerful OCR model, not only enabling new quantitative analyses of the development of cuneiform script over thousands of years, but also accelerating mass digitization (see 4. Reuse Potential).

2 Method

Steps

The majority of the signs (133,916, i.e., 85.24%) were annotated manually. The process began with creating editions of cuneiform tablets on the eBL platform. These editions, formatted in the eBL flavor of the ATF format,¹ were produced following standard domain practices. Metadata was simultaneously created to assign each tablet to its respective historical period, among other attributes. Within each tablet, all eligible cuneiform signs were subsequently annotated based on the existing transliteration using the eBL sign annotation tool, which is built upon the react-image-annotation library.² This tool enables the tagging of cuneiform tablet images through bounding boxes that surround each individual sign (see Figure 3). While these bounding boxes may overlap, each contains exactly one complete sign, ensuring that every sign image corresponds directly to a specific sign in a line of transliteration within the associated eBL entry. In total, it took ca. 70 months to produce the transliterations of the cuneiform tablets and to annotate the photos according to the transliteration.

eBL sign annotation tool (https://www.ebl.lmu.de/library/BM.33013/annotate) (last accessed: 3 March 2026).

Sampling Strategy

As a rule, signs were selected for annotation only when they were fully preserved, clearly legible, well lit, and horizontally aligned. The annotations follow the eBL sign list,³ which is based on the conventions of the standard cuneiform sign list (Borger, 2010). The few signs that coalesce in ligatures were split into their constituent parts only when their constituent signs could be reliably separated into distinct bounding frames.

Quality Control

A subset of common signs was selectively sampled and manually inspected by the team to identify potential annotation errors. Any errors detected during this assessment were corrected.

Semi-automatic extraction

Signs were also extracted using a semi-automated method. First, a list of sign coordinates obtained from a process of optical character recognition (OCR) was made available in (Cobanoglu et al., 2024). There were 1,091,033 signs (including a small number of duplicates, estimated at 1–2 percent) in that list.

These signs were further filtered to aid the subsequent step of manual checking. Signs were excluded that:

were already manually annotated
occurred in tablets that still had no transliteration (as of December 2025) on the eBL platform.

After this process, there remained 240,455 signs. Furthermore, signs were excluded if they and the two signs around them did not appear in (monotonic) order compared to the transliteration. For example, ri in the OCR sign sequence below would be included because bi, ri, and en appear in order in the transliterated signs:

signs from OCR = [‘DI’, ‘BI’, ‘RI’, ‘EN’]
transliterated signs = [‘BI’, ‘DI’, ‘RI’, ‘EN’, ‘KA’, ‘RA’]

Scripts containing these filtering steps can be found on the eBL platform’s GitHub repository.⁴ After this, 40,872 signs remain. These were then manually filtered, after which 23,186 (i.e., 14.76% of the total) signs remain and were added to the eBL signs collection (type-property: OCR). A JSON file of these signs has been uploaded to Zenodo.⁵

3 Dataset Description

Repository name

The dataset is hosted on the Electronic Babylonian Library (eBL) platform, a digital research environment for cuneiform studies. The core dataset of annotated sign crops and metadata is publicly accessible via its API and downloadable from the platform’s GitHub repository.⁶ A snapshot of the dataset (date: 2025-12-19) has also been archived on Zenodo.⁷

Object name

The primary dataset consists of two main components:

Compressed archive of sign images: A collection of 158,946 individual .jpg image files. Each file is named with a unique ID (e.g., K.2210_0.jpg) corresponding to the cropped sign image from a cuneiform tablet.
Annotation metadata file: A structured JSON file containing the complete metadata for all annotated signs. The file format is a list of JSON objects, one per sign annotation, as shown in the provided sample.

Format names and versions

Images: JPEG (.jpg), RGB color format. Dimensions vary per individual sign bounding box.
Metadata: UTF-8 encoded JSON (.json).
Distribution: The complete dataset is packaged and distributed as a .tar.gz archive.

Creation dates

Data creation and annotation were carried out between March 2020 and December 2025. The manual annotation process spanned approximately 70 months of effort. The semi-automated extraction and validation of additional signs were completed in mid-2025.

Dataset creators

Principal Investigators: Enrique Jiménez (LMU Munich) – Project oversight and design.
Annotation Supervision: Zsombor Földi (LMU Munich) – Manual transliteration of tablet texts (ATF format) and supervision of the annotation process.
Senior Researchers: Fiammetta Gori, Aino Hätinen, Felix Rauchhaus (all LMU Munich) – Manual transliteration of tablet texts (ATF format).
Annotators (Student Assistants): Cyrill Dankwardt, Louisa Grill, Or Lewenstein, Mays Fadhil Alrawi, Albert Setälä (all LMU Munich) – Annotation of sign bounding boxes and metadata using the eBL platform tools.
Data Scientists/Engineers: Wentao Che, Yunus Cobanoglu, Brian Mak, Asim Niaz (all LMU Munich) – Development of the annotation platform, data pipeline architecture, and semi-automated extraction/filtering scripts.
All authors of this paper contributed to various aspects of dataset creation.

Language

The metadata is in English; the cuneiform tablets annotated are written in Akkadian and Sumerian languages. Sign identifiers (signName) use the standard Assyriological nomenclature (based on Borger, 2010).

License

The image crops in the dataset come from photographs of cuneiform tablets taken in various collections between 2009 and 2025. These include the British Museum (both from the “Ashurbanipal Library Project” and the eBL’s photography of the Babylon Collection), the Iraq Museum (from the “Cuneiform Artefacts of Iraq in Context” Project), the Yale Babylonian Collection, the Hilprecht Collection in Jena, and several other collections. The photographs of these tablets are published under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, which allows for free sharing and adaptation of the material for non-commercial purposes, provided appropriate credit is given. Accordingly, the cropped regions derived from these photographs are also made available under the same CC BY-NC 4.0 license.

Publication date

The first version of this dataset, associated with Cobanoglu et al. (2024), was published on 22 February 2024 (https://zenodo.org/records/10693601) and comprised 52,102 signs. The current version, published on 16 December 2025, comprises 158,946 signs and represents a substantial expansion of the original dataset.

Detailed Data Structure

The JSON metadata for each sign image follows this schema for every entry in the JSON array:

{
      “_id”: “K.2210_0”, // Unique identifier (tabletNumber_sequentialIndex)
      “tabletNumber”: “K.2210”, // Catalogue number
      “label”:, // Optional label (line/position information)
      “signName”: “SAG”, // Standard sign name
      “type”: “MANUALLY_ANNOTATED_SIGN”, // Annotation type
      “value”: “SAG”, // Transliteration value
      “genres”: [], // Genres property from the fragments collection
      “script”: “NA, // Script metadata from the fragments collection
      “imageFilename”: “K.2210_0.jpg” // Corresponding image filename}

Dataset Statistics

Total sign crops: 158,946
Unique artefacts: 9,276
Average signs per tablet: 17
Annotation types: MANUALLY_ANNOTATED_SIGN, MANUALLY_ANNOTATED_SIGN_DAMAGED, NUMBER, OCR, RULING_LINE, SURFACE_LABEL, UNCLEAR_SIGN
Data size: 11.11 GB for images + 36.9 MB for metadata

4 Reuse Potential

Traditional uses

As already mentioned above (see Context), variations in so-called “diagnostic” signs (see above Table 1) can be used not only to assign tablets to a specific time period, but also to establish scribal hands and even regional variation. The Data can also be linked or reused in other digital portals, such as ORACC. The ORACC Sign List, for example, provides instances of such usage, as does the LaBaSi project mentioned above.

Computational uses

One of the potential computational applications of this dataset is the training and evaluation of machine learning models of Optical Character Recognition (OCR) on cuneiform tablets. Specifically, earlier, smaller versions of this dataset were instrumental in benchmarking state-of-the-art sign detection methods, including both two-stage and single-stage object detection architectures, to automate the localization and classification of signs on raw 2D images (Cobanoglu et al., 2024). Additionally, earlier versions of the dataset facilitated research into stylistic analysis of the script. Yugay et al. (2024) demonstrated its utility by training ResNet-based Convolutional Neural Networks to distinguish between Neo-Assyrian and Neo-Babylonian sign forms. This can be expanded further to other historical periods and uncover dominant modes of variation of signs. Finally, accurate sign detection models trained on this corpus serve as a prerequisite for downstream Natural Language Processing (NLP) tasks, such as automatic transliteration and the reconstruction of fragmented texts.

Supplementary Files

A snapshot of the dataset (date: 2025-12-19) has been archived on Zenodo: 10.5281/zenodo.17949595. An associated Github repository contains the code and instructions associated with this paper: https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr-data.

Notes

[1] https://github.com/ElectronicBabylonianLiterature/ebl-api/blob/master/docs/ebl-atf.md (last accessed: 3 March 2026).

[2] https://github.com/Secretmapper/react-image-annotation (last accessed: 3 March 2026).

[3] https://www.ebl.lmu.de/signs (last accessed: 3 March 2026).

[4] https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr-data (last accessed: 3 March 2026).

[5] https://zenodo.org/records/17871055.

[6] https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr-data (last accessed: 3 March 2026).

[7] DOI: 10.5281/zenodo.17949595.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Or Lewenstein: Data curation, Investigation, Validation, Writing – original draft. Daniel López: Methodology, Writing – review & editing. Cyrill Dankwardt: Data curation, Investigation, Validation. Mays Fadhil Alrawi: Data curation, Investigation, Validation. Louisa Grill: Data curation, Investigation, Validation. Brian Mak: Data curation, Software, Methodology, Writing – original draft. Albert Setälä: Data curation, Investigation, Validation. Fiammetta Gori: Data curation, Investigation. Aino Hätinen: Data curation, Investigation. Felix Rauchhaus: Data curation, Investigation. Zsombor Földi: Supervision, Data curation, Investigation, Validation. Enrique Jiménez: Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review & editing.