From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Maximilian Hindermann; Lea Katharina Kasper; Sorin Marti; Arno Bosse

doi:10.5334/johd.470

Figures & Tables

Table 1

Overview of benchmark datasets and tasks. A companion data paper provides a more detailed table of the available datasets (Hindermann, Marti, Kasper, & Bosse, 2026).

Bibliographic Data: Metadata extraction from bibliographies
Data	5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each.
Content	Pages from Bibliography of Works in the Philosophy of History, 1945–1957.
Design	The dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Blacklist Cards: NER extraction from index cards
Data	33 images, JPG, approx. 1788 × 1305 px, 590 KB each.
Content	Swiss federal index cards for companies on a British black list for trade (1940s).
Design	The dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Book Advert XMLs: Data correction of LLM-generated XML files
Data	50 JSON files containing XML structures.
Content	Faulty LLM-generated XML structures from historical sources
Design	This dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Business Letters: NER extraction from correspondence
Data	57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each.
Content	Collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft.
Design	This dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Company Lists: Company data extraction from list-like materials
Data	15 images, JPG, approx. 1868 × 2931 px, 360 KB each.
Content	Pages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland
Design	This dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Fraktur Adverts: Page segmentation and Fraktur text transcription
Data	5 images, JPG, approx. 5000 × 8267 px, 10 MB each.
Content	Pages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland.
Design	This dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image down-scaling has a limited effect on model performance.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Library Cards: Metadata extraction from multilingual sources
Data	263 images, JPG, approx. 976 × 579 px, 10 KB each.
Content	Library cards with dissertation thesis information.
Design	This dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Medieval Manuscripts: Handwritten text recognition
Data	12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each.
Content	Pages from Pilgerreisen nach Jerusalem 1440 und 1453.
Design	This dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)

Images in `images/` are sent to an LLM with a prompt from `prompts/` and compared against a ground truth file in `ground_truths/`. Each result follows the structure defined by `dataclass.py` and is scored using the implementation in `benchmark.py`.

The benchmark class (without actual implementation) from the Library Cards benchmark v0.4.1.

The Pydantic data class (without docstrings) from the Library Cards benchmark v0.4.1.

The ground truth for card 00624122 (see Figure 2) from the Library Cards benchmark v0.4.1.

The card 00624122 from the *Library Cards* benchmark v0.4.1.

Two letterheads and a valediction from the *Business Letters* benchmark v0.4.1. The strings “Max Oettinger” (Figure 3a), “Artur Oettinger-Meili” (Figure 3b), and “A. Oettinger” (Figure 3c) are name variants referring to the same individual, namely Artur Oettinger-Meili. He served as managing director (*Geschäftsführer*) of the Basler Personenschifffahrtsgesellschaft from 1925 to 1938; the variant “Max” results from an error on the part of the senders.

Screenshot from the front-end.⁷ The graph shows cumulated test results of the *Business Letters* benchmark v0.4.1 over time. Connecting lines show test runs of the same models on different dates with colors indicating providers. The graph is interactive—only one part is shown here.

Comparison of average model scores of Google and OpenAI across all benchmarks. The results show substantial task-specific variance: Google achieves higher scores on *Fraktur Adverts* (77.1 versus OpenAI 20.6), *Medieval Manuscripts* (63.3 versus OpenAI 51.4), and *Library Cards* (82.5 versus OpenAI 77.2), while OpenAI performs better on *Bibliographic Data* (57.7 versus Google 36.1), and *Book Advert XML* (85.8 versus Google 80.5). Differences are less pronounced (<5%) for *Blacklist Cards* (Google 89.9, OpenAI 85.5), *Business Letters* (OpenAI 53.2, Google 49.7), and *Company Lists* (Google 42.6, OpenAI 39.6). This comparison underscores the need for systematic evaluation of humanities relevant LLM tasks.

From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Figures & Tables

Table 1

Figure 1

Listing 1

Listing 2

Listing 3

Figure 2

Figure 3

Figure 4

Figure 5

Paradigm

My account