Have a personal or library account? Click to login
The Basel Land Records Ground Truth: An Annotated Dataset for Information Extraction on German-Language Administrative Records Cover

The Basel Land Records Ground Truth: An Annotated Dataset for Information Extraction on German-Language Administrative Records

Open Access
|Dec 2025

Full Article

1 Overview

Repository location

https://doi.org/10.5281/zenodo.16919653

Context

This dataset was created as part of the research project Economies of Space (Hodel et al., 2024). The project uses the Historical Land Records of Basel to evaluate strategies for accessing large-scale historical corpora, handling premodern German, and conducting analyses from a historian’s perspective. The Historical Land Records of Basel are a reference tool based on index cards containing source-based excerpts from the Basel-Stadt State Archive. The collection has long served as a valuable resource for research in history, economics, and sociology (Füglister, 1981; Staehelin, 1990).

The corpus aggregates information drawn from court books and rent rolls dating from the 13th century to the 1800s. It features about 3,000 Latin-language entries, but most entries are written in German. Typically, each index card records a transaction, a property seizure for unpaid rents, a court case, or a list of rents paid or unpaid. In total, the corpus comprises approximately 220,000 index cards, which are organized by property into dossiers that include coordinate information (Staatsarchiv Basel-Stadt, 2025). Because the corpus consists of excerpts collected between 1895 and 1977, rather than original documents, the degree of correspondence with the sources varies. Some entries reproduce the original wording almost verbatim, while others provide condensed summaries. This variation affects the language as well: in some cases, more modern variants of words are used, but more often, the spelling and vocabulary are kept in line with the original.

For example, Figure 1 shows an index card documenting a property purchase from the year 1510. The text records the seller, Hans Ludwig von Schandorff, and the buyers, Hans Hermanzwiler, a vineyard worker (Rebknecht), and his wife, Margreth, along with details of the property and its purchase price. The location of the property is described in detail, and the entry notes that no interest payments were owed on it (ist zehend frei). This format is representative of many entries in the Historical Land Records. The high density of information contained within the corpus renders it particularly well-suited to both computational and historical analysis.

johd-11-387-g1.jpg
Figure 1

An index card describing the sale of a property in the year 1510 (State Archive of Basel-Stadt HGB 11/34, p. 4). The main text, date of the original entry, and the source are written in ink, while the marginalia and footnotes contain cross-references. The dataset focuses on the main text but retains date and source information as metadata.

This dataset was created to serve as training and evaluation data for machine learning algorithms, with the goal of facilitating automated information extraction. It has been used internally to train a number of models that automatically annotate entities and events (Prada Ziegler, 2024a, 2024b).1 The full corpus of the Historical Land Records has since been automatically annotated. The project has used this resource to produce a number of conference contributions focusing on the historical analysis and use of automatically generated deep annotations (Hitz & Hodel, 2025; Hitz et al., 2024) and visualizations (Hitz & Aeby, 2025).

2 Method

Steps

As a preparatory step toward processing the complete corpus of the Historical Land Records, we developed a handwritten text recognition (HTR) model using PyLaia (Puigcerver, 2017) within the Transkribus framework. The model achieved a character error rate (CER) as low as 3.5% on the relevant text regions.

We randomly selected 1,000 transcribed documents dated between 1400 and 1700 (the research project focuses on this timeframe),2 excluding any Latin-language documents. Including Latin-language documents was beyond the scope of the project due to limitations of available resources. Within the research timeframe of 1400–1700, 2,494 documents are written in Latin, compared to 71,794 in German. More than half of the Latin documents date to before 1450. No manual post-correction was applied to the transcriptions, ensuring that subsequent models trained on this dataset would learn to accommodate HTR noise. We discarded a number of documents that were unreadable due to poor HTR or layout analysis,3 as well as Latin documents not caught in the previous filter. The final dataset contains 829 documents.

To train models capable of extracting the dense information contained in these documents, we developed an annotation system designed to identify events, actors, and objects in the text. We recognized that classic named entity recognition would not suffice for our needs. Going back to the example from section 1, much information is represented in a nested manner, such as the occupation of Hans Hermanzwiler or the location of the property. Inspired by the ACE guidelines, we wrote custom annotation guidelines that focused on nested entity annotation. In addition, the dataset features event and relationship annotations which highlight the interactions between entities, such as property sales or obligations to pay dues. Figure 2 demonstrates the annotation concept on a shortened version of the above-mentioned index card. The nested annotation not only appends information directly to entity references, but also establishes relationships between nested entities, which can easily be extracted (e.g., the Rebgarten is located by the Spitalschüren). The event annotation, anchored by the event trigger and connected by the role tag, highlights how the entities interact with each other. For further information about the annotation, please refer to the annotation documentation in the associated repository. During the project, we shared and reviewed our annotation guidelines with other researchers working with historical text and published them online.4

johd-11-387-g2.jpg
Figure 2

Inline annotation in shortened form (only element and class tags) of a property purchase. Reference elements mark entities while appositions and attributes add additional information. The trigger and role information describe the interaction between entities and values.

Annotation was done with INCEpTION (Klie et al., 2018). Given the project’s structure, our goal was to generate enough training data to quickly train a model to annotate the full corpus. Considering that the dataset was only a by-product of the project, the annotation procedure did not follow common standards. Only one person annotated each document, and no multi-editor validation was conducted; as a result, this dataset may contain more noise than would typically be expected.5 During post-processing, we automatically validated the dataset as far as possible to ensure compliance with the annotation guidelines. For example, we made sure that every reference element contains a head element.

3 Dataset Description

Repository name

Zenodo

Object name

The Basel Land Records Ground Truth: An Annotated Dataset for Information Extraction on German-language Administrative Records.

Format names and versions

The benasch folder holds XML files containing the full annotation. We furthermore provide two TEI-formatted file collections tei_simple and tei_nested which feature a reduced tagset more commonly seen in humanities projects. A PDF containing the annotation documentation contains more information about the file formats as well.

Creation dates

2023-01-01 – 2025-08-01.

Dataset creators

  • Ismail Prada Ziegler: Management, Curation, Edition, Software, Annotation Guideline Design and Annotation.

  • Benjamin Hitz: Curation, Annotation Guideline Design, Historical Expertise and Annotation.

  • Katrin Fuchs: Annotation and Transcription.

  • Aline Vonwiller: Historical Expertise and Transcription.

  • Jonas Aeby: Software and Automated Transcription.

All creators were previously or are currently associated with the University of Basel.

Language

English

License

CC BY 4.0

Publication date

2025-08-21

4 Reuse Potential

The dataset comprises 829 entries, including 54,926 tokens and 30,020 annotations. See Table 1 for the reference classes as well as total counts, and the number of named entity references. 5,453 out of 7,982 (68.32%) references are nested within other entity references. A total of 1,203 events and 3,405 relations have been annotated, with the most common relations and events listed in Table 2. To our knowledge, no other premodern German-language dataset of this size provides annotations of comparable depth.

Table 1

Counts of entity reference elements by class.

CLASSCOUNTNE COUNTDESCRIPTION
per37462582Person
fac26991114Facility (e.g. houses, fountains)
org708386Organization (e.g. monasteries)
loc333299Non-Facility Location (e.g. rivers, hills)
gpe-loc324308Geopolitical entity as place
gpe-org136132Geopolitical entity as organization
unk224Unknown Entity Class
gpe-gpe1411Geopolitical entity without context
total79824836
Table 2

Counts of relations and events by class. Only classes that occur more than 50 times are listed. A full list with more detailed descriptions can be found in the annotation documentation in the data repository.

CLASSTYPECOUNTDESCRIPTION
topologicalrelation1061Spatial connection between entities.
ownershiprelation711Possession or legal control of a property.
due-obligationrelation559Obligation to pay rent between two parties.
familyrelation466Familial relationship.
due-paymentevent461Payment in relation to an obligation to pay rent.
property-purchaseevent225Acquisition of a property in exchange for money.
part_wholerelation199One entity forming part of another.
employmentrelation145Link between employer and employee.
civic-affiliationrelation110Citizenship or similar relations to a GPE.
seizureevent108Confiscation of property due to unpaid rents.
inheritanceevent94Transfer of property through succession.
rent-purchaseevent74Loan-taking with a property as security.
representationrelation69One entity acting on behalf of another.
membershiprelation59Belonging to a group or organization.
litigationevent52Legal dispute or court case.

Despite its limitations, being somewhat noisy and focused on a single city, this dataset represents an important resource for historical NLP research. It can be used to evaluate systems for nested sequence tagging, named entity recognition in low-resource scenarios or systems that aim to tackle challenges of inconsistent grammar and high levels of noise. During our work on this corpus, we observed that other cities feature very similar datasets, in domain, language, and document structure, for example the land records of the city of Vienna (Grünwald et al., 2021; Helmchen, 2025). Models trained on our dataset can serve as a foundation for these other corpora, allowing fine-tuning to create specialized models with less additional annotation. It can also be used alongside other premodern German-language documents to create more generalized models. Finally, any researchers using the BeNASch annotation guidelines may use this dataset as a repository of examples and a help to define their own project-specific guidelines. Research groups that work with corpora of the same domain of historical land records can also consider using the same guidelines and the same classes for events and relations.

Notes

[1] In preparation for this publication, the dataset has since been revised.

[2] Documents were selected in batches of 200; early batches were not limited by the timeframe; the dataset thus includes a few entries from before 1400 and after 1700.

[3] While CER overall was very low, a few documents written with a particular pen were practically unusable.

[4] https://dhbern.github.io/BeNASch/ (Accessed: 2025-08-28). Note that the guidelines are only available in German at the time of writing, but a translated version is planned.

[5] Although most documents were seen in multiple iterations, e.g. after something in the guidelines changed or to add event annotations when only entity annotations werde previously done.

Acknowledgements

We would like to thank everyone who contributed to the BeNASch working group for their advice and time to help with the guidelines – especially Dominic Weber and Christa Schneider. Furthermore we would like to thank Jonas Aeby, Aline Vonwiller and Tobias Hodel for their help with the preparation of the dataset.

Competing interests

The authors have no competing interests to declare.

Author Contributions

Ismail Prada Ziegler: Conceptualization, Data curation, Formal Analysis, Methodology, Software, Validation, Writing – Original Draft, Writing – review & editing.

Benjamin Hitz: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Writing – review & editing.

Katrin Fuchs: Data curation.

DOI: https://doi.org/10.5334/johd.387 | Journal eISSN: 2059-481X
Language: English
Submitted on: Sep 1, 2025
|
Accepted on: Nov 3, 2025
|
Published on: Dec 29, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Ismail Prada Ziegler, Benjamin Hitz, Katrin Fuchs, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.