The Mainz Cuneiform Benchmark Dataset Series: Sign Annotations of 3D Rendered Tablets

Timo Homburg; Lukas Ahlborn; Kai-Christian Bruhn; Hubert Mara

doi:10.5334/joad.172

Full Article

(1) Overview

Context

Spatial coverage

Frau Professor Hilprecht Collection of Babylonian Antiquities

Friedrich-Schiller Universität Jena

Fürstengraben 6, 07743 Jena, Thuringia, Germany

Northern Boundary: 50.9300953

Southern Boundary: 50.9302309

Eastern Boundary: 11.5893116

Western Boundary: 11.589562

Haft Tappeh and Choghazanbil Museum

Susa-Shushtar Road, Haft Tappeh, Khuzestan Province, Iran

Northern Boundary: 32.0800798

Southern Boundary: 32.0804943

Eastern Boundary: 48.3296285

Western Boundary: 48.3301068

Temporal coverage

The MaiCuBeDa dataset [1] contains a selection of cuneiform tablets from the “Professor Hilprecht Collection of Babylonian Antiquities” [2]. They cover a broad chronological range from the Ur III period (2112 BC – 2004 BC), the Old Assyrian period (2025 – 1363 BC), to the Middle Assyrian period (1400 – 1000 BC). The dating refers to the so-called middle chronology.

The MaiCuBeDa HT [3] dataset contains annotations on cuneiform tablets of the Haft Tappeh corpus from the Middle Elamite period 1500–1400BC.

(2) Methods

Both datasets adhere to a set of guidelines that emerged from previous work for the creation [4] and annotation of renderings of 3D models, the baselines of which have been published in [5]. Annotations were then created on the surfaces of the selected cuneiform tablets, rendered using the software GigaMesh [6], in the same way as the already published dataset HeiCuBeDa [9] demonstrates.

Annotations were then cropped using a Python script as a part of the Cuneur Cuneiform Annotator application, along with metadata and further generated contents, as illustrated in the following paragraphs.

Steps

The annotations of these corpora are based on the following sources:

Transliterations of the Tablets published by the project “Digitale Edition der Keilschrifttexte aus Haft Tappeh” in Elamica 12 [7] and 13 [8], which were used as a template for the annotations for MaiCuBeDa HT and transliterations extracted from the Cuneiform Digital Library Initiative for the MaiCuBeDa dataset.
2D Renderings of the 3D scans of the tablets made with the open software GigaMesh, published as the dataset HeiCuBeDa [9] for MaiCuBeDa and as publications in heidICON [10] for the Haft Tappeh Collection.¹

Using the Cuneur Cuneiform Annotator² application, the annotations were made by manually reading the cuneiform signs on the renderings and checking the rendering against the transliteration. If the transliteration matched the text, the signs were marked and annotated with the following metadata:

– Sign reading: The reading of the sign, which is unambiguously convertible to a Unicode codepoint if one is defined for the respective sign
– Tablet surfaces: Front, back, left, right, top, bottom
– Line number: The number of the line per surface
– Character Index: The index of the annotated sign in the line
– Word Index: The index of the word in the respective line
– Relative character Index: The index of a sign relative to its word
– A set of tags indicating damages in the annotated sign: E.g., damaged, erased

If the transliteration did not match the text written on the cuneiform tablet, the transliteration was augmented by an expert to fit the correct meaning of the signs.

The aforementioned annotated metadata was applied afterwards.

Missing spaces on cuneiform tablets were treated as one character/word index to be consistent with transliteration notations.

The annotations are saved in the JSON-LD format [11], incorporating the Web Annotation Data Model [12] and containing the aforementioned annotation contents, as well as the sign name and Unicode codepoint reference of the cuneiform sign in the transliteration. The rendering position is fixed using an SVGSelector [13], which defines the position as a 2D polygon in pixel coordinates. The cropping process can therefore also be reproduced with different image scaling if needed. In addition, we provide IIIF links to the images on the University Library of Heidelberg’s IIIF service.

By only modifying the link, users may take advantage of the features of a IIIF service, e.g., to retrieve different resolutions of images and various image formats if these suit the machine learning task better than the default PNG image formats we provide. Given the so-created annotation exports, the MaiCuBeDa and MaiCuBeDa HT 3D datasets provide cropped annotation PNG images for:

– Lines on the cuneiform tablet (cf. Figure 1)
– Words on the cuneiform tablet (cf. Figure 2)
– Signs on the cuneiform tablet (cf. Figure 3)
– Single wedges of a cuneiform sign for a selected set of 30 tablets of the MaiCuBeDa dataset, which have been fully annotated (cf. Figure 4)

Line 4 of the front surface of cuneiform tablet HS 1004.

Word 2 of line 4 for the front surface of cuneiform tablet HS 1004.

Sign 3 of line 4 of the front surface of cuneiform tablet HS 1004.

Single wedge of cuneiform tablet HS 903.

Each image includes the aforementioned annotation information in its filename and metadata. For MaiCuBeDa and MaiCuBeDa HT, we provide polygon cropping annotations and bounding box annotations (cf. Figure 5) of the individual signs and wedge samples. In a bounding box annotation, the bounding box of the polygonal annotation is used for cropping.

Bounding box cropping of signs (left) vs. polygon croppings of signs (right).

For polygon-cropping annotations, any areas outside the polygon in the bounding box are set to transparent.

In addition, metadata about every annotation is exported, which includes the time period of the annotation, the genre of the text, and an IIIF [14] link that resolves the annotation on the University of Heidelberg’s IIIF server.

Finally, the collection of all annotations and their connection to the transliterations is saved as a knowledge graph dump in RDF [15]. In this way, the set of annotations can be queried in RDF and uploaded to a triple store database.

To facilitate the correction of annotations and potential future iterations of the datasets, we publish the Cuneiform Annotator Instances on GitLab as Cuneur for MaiCuBeDa³ and Cuneur for MaiCuBeDa HT.⁴ Assyriologists and other interested scholars can explore the annotations in the context of their rendering and may suggest changes to the annotations as issues on GitLab. In this way, we encourage discussion on the annotations and provide a way to optimize the dataset in future iterations.

Sampling strategy

For the MaiCuBeDa dataset, we selected a set of cuneiform tablets from the Hilprecht Collection, which were available as 3D models and transliterations in the Cuneiform Digital Library Initiative (CDLI)⁵ repository at the time of dataset creation. A subset of these texts has been annotated to target a completed annotation set size of approximately 30,000 annotations and to cover a relatively broad range of individual signs and time periods.

The annotated corpus of cuneiform tablets for MaiCuBeDa HT from the 523 scans of 470 tablets of the Hilprecht Collection has been selected based on the digital availability of transliterations created in the research project “Die digitale Edition der Keilschrifttexte aus Haft Tappeh.”⁶ Contrary to MaiCuBeDa, the annotations here depict, to our knowledge, the first and only annotation corpus of cuneiform signs from the Middle Elamite period.

Quality Control

Quality control steps were implemented in the Cuneur Cuneiform Annotator software,⁷ part of the Cuneur Family of Tools,⁸ allowing for the highlighting of annotated signs in the transliteration and vice versa on the image. With that, annotations could be checked for their correct positioning and consistency in assigned transliteration, reading, and annotation reading assignments. In addition, Cuneur allowed for highlighting inconsistencies in the aforementioned assignments, allowing the annotating scholars to check for annotation mistakes at a glance.

Once annotations have been finalized, their correctness can be checked in the Cuneurs Result View dialog, which shows all annotations per reading or per Unicode codepoint. In that way, mislabeled signs can be identified and corrected.

While these dataset validation steps can ensure consistency and completeness in the ability of the annotating person, they are oblivious to mistakes made in the transliteration itself and indexing or matching mistakes by the human annotating person. To minimize correct, but inconsistent sign value usage (e.g., ESZ5 instead of 3(DISZ)) or misspellings regarding CDLI ATF⁹ conformity (e.g., ha instead of h,a for h˘a), the Result View dialog was used to plot a list of all transliterations in the corpus. The list was then inspected, and the inconsistencies were manually corrected. Possible further mistakes, if found, will be corrected in future revisions of the dataset.

Constraints

The following constraints could be noted when working with the generated data:

– Color information is not present on cuneiform 3D renderings. To integrate color information, equivalent photos would need to be annotated and added to the dataset.
– Due to the nature of structured light scanning technologies, certain surface properties, such as reflectance, are not acquired during the scanning process and may therefore be missing in annotations.
– Annotation constraints: As annotations are created manually with human effort, mistakes may occur in the precision of the annotations (annotations may not completely encompass the intended area) and in the annotation content (e.g., incorrect line or character indices may be annotated).
Experts have attempted to mitigate annotation errors with manual checks, as outlined in the previous section.

(3) Dataset description MaiCuBeDa

Object name

Mainz Cuneiform Benchmark Dataset (MaiCuBeDa)

Data type

– Annotation images of 3D renderings of a cuneiform tablet
– Annotations as JSON-LD
– Metadata in CSV
– A knowledge graph representation of the dataset in RDF

Format names and versions

PNG [16]

JSON-LD [17]

CSV [18]

Terse Triple Language (TTL) [19]

Creation dates

14/01/2021 – 31/08/2023

Dataset Creators

Timo Homburg, Hochschule Mainz: Creation of annotations and publication

Hubert Mara, FU Berlin: Data management and publication

Language

Sumerian, Akkadian

License

CC BY-SA 4.0

Repository location

10.11588/DATA/QSNIQ2

Publication date

30/08/2023

(4) Dataset description MaiCuBeDa HT

Object name

Mainz Cuneiform Benchmark Dataset for the Haft Tappeh Collection (MaiCuBeDa HT)

Data type

– Annotation images of 3D renderings of cuneiform tablets
– Annotations as JSON-LD
– Metadata in CSV
– A knowledge graph representation of the dataset in RDF

Format names and versions

PNG [16]

JSON-LD [17]

CSV [18]

Terse Triple Language (TTL) [19]

Creation dates

01/09/2022-30/06/2025

Dataset Creators

Lukas Ahlborn, JGU Mainz: Creation of annotations and publication

Timo Homburg, Hochschule Mainz: Generation of annotation images and dataset preparation

Language

English

License

CC BY-SA 4.0

Repository location

10.11588/DATA/8TCR5C

Publication date

23/07/2025

(5) Reuse potential

The reuse potential of the datasets primarily lies in machine learning applications for cuneiform sign recognition.

Classifications have already been applied to recognize cuneiform signs, as demonstrated by published approaches using MaiCuBeDa [20, 21], since the appropriate metadata, in the form of a sign name and Unicode codepoint, are included in the annotation. Such classification methods are of broader relevance, as evidenced by [22, 23, 24, 25], which address various image types and stroke and cuneiform sign annotation.

Similar classifications can be performed at the word level, since word boundaries are clearly marked in the dataset.

Finally, classifications based on further metadata, such as the sign’s reading, its time period [26], or the genre of the text in which the sign/word/line is included, can be performed.

For the MaiCuBeDa dataset, it is also possible to conduct cuneiform language and dialect identification classification tasks [27, 28], since the language used to write the cuneiform tablet (Sumerian/Akkadian) is included in the metadata.

While a similar classification might also be conducted in the MaiCuBeDa HT dataset, there is only one language – Akkadian – which can be classified, making this dataset only useful as training data for the classification of other similar multi-language datasets.

Since both datasets also include a knowledge graph representation, subcorpora can be easily queried using SPARQL queries on the knowledge graph data dump itself or by filtering the metadata provided in the CSV files included in the dataset.

Also, questions about the dataset, such as how many cuneiform tablets from different time periods or proveniences are present, can be answered by formulating queries or by training AI on the available metadata. Last but not least, the knowledge graph layer should provide a concept of interoperability, so that future datasets can be modeled in the same way, i.e., that knowledge graph contents and metadata, as well as annotation contents, may be seamlessly combined.

To that end, the publications also contain Python scripts that have been used for metadata generation.

In addition, it should be noted that the two datasets serve different classification needs.

While MaiCuBeDa presents the user with a dataset spanning many time periods and languages, it will be well-suited as a basis for creating machine learning classifiers that can distinguish between those attributes.

MaiCuBeDa HT is best used as a training dataset for the specific time period and area, i.e., Middle Elamite Akkadian, since it includes only annotations of this kind.

However, due to the variability of cuneiform signs in time and space, it is invaluable to get data samples from different time periods and different locations to pave the way to enabling a more generative classification approach, which can deal with more cuneiform sign variations.

Hence, MaiCuBeDa HT provides a very detailed, manually curated dataset to work in this direction.

Finally, both datasets provide future Assyriologists with the opportunity to validate or refine paleographic research, as they can underpin paleographic observations with data points on the occurrences of cuneiform sign variants. By making the underlying data openly available, the dataset renders future paleographic analyses reproducible [29]; something that analogue datasets currently used in the field, e.g., [30], cannot achieve. It thus represents an essential supplement to the forthcoming printed publications on this subject and an opportunity to use these corpora for a more precise grouping of, possibly digitally modelled [31, 32, 33], paleographic sign variants which could further be matched in classification tasks.

The easily accessible web interfaces of both datasets also allow annotations to be viewed in the context of their cuneiform tablets and to spot, report, and correct annotation mistakes in future iterations of the datasets.

Notes

[1] https://heidicon.ub.uni-heidelberg.de/search?p=467.

[2] https://gitlab.com/tacume/cuneur/cuneur_template/.

[3] https://tacume.gitlab.io/maicubeda/.

[4] https://tacume.gitlab.io/maicubedaht/.

[5] https://cdli.earth.

[6] Funded by German Research Foundation, project number 424957759.

[7] https://gitlab.com/tacume/cuneur/cuneur_template/.

[8] https://tacume.gitlab.io/cuneur/cuneur_website/.

[9] https://oracc.museum.upenn.edu/doc/help/editinginatf/cdliatf/index.html.

Acknowledgements

Doris Prechel: Supervision of the annotation process of MaiCuBeDa HT.

Tim Brandes: Supervision of the annotation process of MaiCuBeDa HT.

Competing Interests

The authors have no competing interests to declare.

The Mainz Cuneiform Benchmark Dataset Series: Sign Annotations of 3D Rendered Tablets

Full Article

(1) Overview

Context

Spatial coverage

Temporal coverage

(2) Methods

Steps

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Sampling strategy

Quality Control

Constraints

(3) Dataset description MaiCuBeDa

Object name

Data type

Format names and versions

Creation dates

Dataset Creators

Language

License

Repository location

Publication date

(4) Dataset description MaiCuBeDa HT

Object name

Data type

Format names and versions

Creation dates

Dataset Creators

Language

License

Repository location

Publication date

(5) Reuse potential

Notes

Acknowledgements

Competing Interests