Have a personal or library account? Click to login
From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark Cover

From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Open Access
|Mar 2026

Figures & Tables

Table 1

Overview of benchmark datasets and tasks. A companion data paper provides a more detailed table of the available datasets (Hindermann, Marti, Kasper, & Bosse, 2026).

Bibliographic Data: Metadata extraction from bibliographies
Data5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each.
ContentPages from Bibliography of Works in the Philosophy of History, 1945–1957.
DesignThe dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
Blacklist Cards: NER extraction from index cards
Data33 images, JPG, approx. 1788 × 1305 px, 590 KB each.
ContentSwiss federal index cards for companies on a British black list for trade (1940s).
DesignThe dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
Book Advert XMLs: Data correction of LLM-generated XML files
Data50 JSON files containing XML structures.
ContentFaulty LLM-generated XML structures from historical sources
DesignThis dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
Business Letters: NER extraction from correspondence
Data57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each.
ContentCollection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft.
DesignThis dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
Company Lists: Company data extraction from list-like materials
Data15 images, JPG, approx. 1868 × 2931 px, 360 KB each.
ContentPages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland
DesignThis dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
Fraktur Adverts: Page segmentation and Fraktur text transcription
Data5 images, JPG, approx. 5000 × 8267 px, 10 MB each.
ContentPages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland.
DesignThis dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image down-scaling has a limited effect on model performance.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
Library Cards: Metadata extraction from multilingual sources
Data263 images, JPG, approx. 976 × 579 px, 10 KB each.
ContentLibrary cards with dissertation thesis information.
DesignThis dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
Medieval Manuscripts: Handwritten text recognition
Data12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each.
ContentPages from Pilgerreisen nach Jerusalem 1440 und 1453.
DesignThis dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-16)
johd-12-470-g1.png
Figure 1

Images in images/ are sent to an LLM with a prompt from prompts/ and compared against a ground truth file in ground_truths/. Each result follows the structure defined by dataclass.py and is scored using the implementation in benchmark.py.

johd-12-470-g6.png
Listing 1

The benchmark class (without actual implementation) from the Library Cards benchmark v0.4.1.

johd-12-470-g7.png
Listing 2

The Pydantic data class (without docstrings) from the Library Cards benchmark v0.4.1.

johd-12-470-g8.png
Listing 3

The ground truth for card 00624122 (see Figure 2) from the Library Cards benchmark v0.4.1.

johd-12-470-g2.png
Figure 2

The card 00624122 from the Library Cards benchmark v0.4.1.

johd-12-470-g3.png
Figure 3

Two letterheads and a valediction from the Business Letters benchmark v0.4.1. The strings “Max Oettinger” (Figure 3a), “Artur Oettinger-Meili” (Figure 3b), and “A. Oettinger” (Figure 3c) are name variants referring to the same individual, namely Artur Oettinger-Meili. He served as managing director (Geschäftsführer) of the Basler Personenschifffahrtsgesellschaft from 1925 to 1938; the variant “Max” results from an error on the part of the senders.

johd-12-470-g4.png
Figure 4

Screenshot from the front-end.7 The graph shows cumulated test results of the Business Letters benchmark v0.4.1 over time. Connecting lines show test runs of the same models on different dates with colors indicating providers. The graph is interactive—only one part is shown here.

johd-12-470-g5.png
Figure 5

Comparison of average model scores of Google and OpenAI across all benchmarks. The results show substantial task-specific variance: Google achieves higher scores on Fraktur Adverts (77.1 versus OpenAI 20.6), Medieval Manuscripts (63.3 versus OpenAI 51.4), and Library Cards (82.5 versus OpenAI 77.2), while OpenAI performs better on Bibliographic Data (57.7 versus Google 36.1), and Book Advert XML (85.8 versus Google 80.5). Differences are less pronounced (<5%) for Blacklist Cards (Google 89.9, OpenAI 85.5), Business Letters (OpenAI 53.2, Google 49.7), and Company Lists (Google 42.6, OpenAI 39.6). This comparison underscores the need for systematic evaluation of humanities relevant LLM tasks.

DOI: https://doi.org/10.5334/johd.470 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 15, 2025
|
Accepted on: Jan 21, 2026
|
Published on: Mar 2, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Maximilian Hindermann, Lea Katharina Kasper, Sorin Marti, Arno Bosse, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.