The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Maximilian Hindermann; Sorin Marti; Lea Katharina Kaspera; Arno Bossea

doi:10.5334/johd.481

Figures & Tables

Overview over the Bibliographic Data benchmark in the visualization frontend.

Best performance per model and benchmark dataset, taking the maximum score over all recorded configurations per model and dataset. Null values (N/A) indicate tests that have not yet been run or can no longer be run because the corresponding model has been deprecated.

Table 1

Overview of benchmark datasets, tasks, and scoring schemes.

BIBLIOGRAPHIC DATA: METADATA EXTRACTION FROM BIBLIOGRAPHIES
Data	5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each.
Content	Pages from Bibliography of Works in the Philosophy of History, 1945–1957.
Source	http://www.jstor.org/stable/2504495
Language	English
Content Type	Printed text. Directory-like.
Task	Transcription; metadata extraction.
Expected output	JSON list of structured items; output defined via Pydantic class.
Scoring	Fuzzy matching on each key in the JSON items.
Test runs	77 runs.
Best result	71.43%, OpenAI, gpt-4o.
Design	The dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
BLACKLIST CARDS: NER EXTRACTION FROM INDEX CARDS
Data	33 images, JPG, approx. 1788 × 1305 px, 590 KB each.
Content	Swiss federal index cards for companies on a British black list for trade (1940s).
Source	https://www.recherche.bar.admin.ch/recherche/#/de/archiv/einheit/31240458 (last accessed 2026-01-12)
Language	Mostly German, some French
Content Type	Typed and handwritten text. Stamped dates.
Task	Transcription; metadata extraction.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	Fuzzy matching on each key in the JSON object.
Test runs	38 runs.
Best result	95.65%, OpenAI, gpt-4.1-mini.
Design	The dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
BOOK ADVERT XMLS: DATA CORRECTION OF LLM-GENERATED XML FILES
Data	50 JSON files containing XML structures.
Content	Faulty LLM-generated XML structures from historical sources
Source	Data from the digitized Basler Avisblatt, namely book advertisements, extracted as XML.
Language	Early Modern German
Content Type	Digital data. Plain text in JSON files.
Task	Correct XML structure (add closing tags, remove faulty tags).
Expected output	JSON object with correct XML as string; output defined via Pydantic class.
Scoring	Fuzzy matching on whole XML string after removing white spaces and setting all characters to lowercase.
Test runs	40 runs.
Best result	97.47%, Anthropic, Claude Sonnet 4.5.
Design	This dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
BUSINESS LETTERS: NER EXTRACTION FROM CORRESPONDENCE
Data	57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each.
Content	Collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft.
Source	http://dx.doi.org/10.7891/e-manuscripta-54917
Language	German
Content Type	Typed, printed and handwritten text. Signatures.
Task	Metadata extraction, person matching, signature recognition.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	F1 score.
Test runs	212 runs.
Best result	77.00%, OpenAI, gpt-5.
Design	This dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
COMPANY LISTS: COMPANY DATA EXTRACTION FROM LIST-LIKE MATERIALS
Data	15 images, JPG, approx. 1868 × 2931 px, 360 KB each.
Content	Pages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland
Source	https://doi.org/10.7891/e-manuscripta-174832
Language	English and German
Content Type	Printed lists with strongly varying layout.
Task	Metadata extraction with varying layouts.
Expected output	JSON list of structured items; output defined via Pydantic class.
Scoring	F1 score.
Test runs	76 runs.
Best result	58.40% OpenAI, gpt-5.
Design	This dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
FRAKTUR ADVERTS: PAGE SEGMENTATION AND FRAKTUR TEXT TRANSCRIPTION
Data	5 images, JPG, approx. 5000 × 8267 px, 10 MB each.
Content	Pages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland.
Source	https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12)
Language	Early modern German
Content Type	Printed text in Fraktur typeface.
Task	F1 score & Character Error Rate (CER).
Expected output	JSON list of structured items; output defined via Pydantic class.
Scoring	Segmentation of adverts and text recognition
Test runs	91 runs.
Best result	95.70%, Google, gemini-2.0-pro-exp-02-05.
Design	This dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image downscaling has a limited effect on model performance.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
LIBRARY CARDS: METADATA EXTRACTION FROM MULTILINGUAL SOURCES
Data	263 images, JPG, approx. 976 × 579 px, 10 KB each.
Content	Library cards with dissertation thesis information.
Source	https://ub.unibas.ch/cmsdata/spezialkataloge/ipac/searchform.php?KatalogID=ak2 (last accessed 2026-01-12)
Language	German, French, English, Latin, Greek, and other European languages
Content Type	Typed and handwritten multilingual text.
Task	Metadata extraction.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	F1 score.
Test runs	61 runs.
Best result	89.51%, OpenAI, gpt-5.
Design	This dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)
MEDIEVAL MANUSCRIPTS: HANDWRITTEN TEXT RECOGNITION
Data	12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each.
Content	Pages from Pilgerreisen nach Jerusalem 1440 und 1453.
Source	https://www.e-codices.ch/de/description/ubb/H-V-0015/HAN (last accessed 2026-01-12)
Language	Medieval German
Content Type	Handwritten text.
Task	Segmentation & Text recognition.
Expected output	JSON object with predefined keys; output defined via Pydantic class.
Scoring	Fuzzy & Character Error Rate (CER).
Test runs	38 runs.
Best result	76.90%, OpenAI, gpt-4.1-mini.
Design	This dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-12)

Table 2

Contributors to the project. Contributor roles follow the CRediT taxonomy.

NAME	PRIMARY ROLES
Anthea Alberto	Data Curation; Validation
Sven Burkhardt	Validation
Eric Decker	Data Curation; Validation
Pema Frick	Data Curation; Validation; Formal Analysis; Software
Maximilian Hindermann	Conceptualization; Methodology; Software; Formal Analysis
Lea Katharina Kasper	Data Curation; Validation; Formal Analysis
José Luis Losada Palenzuela	Data Curation; Validation
Sorin Marti	Conceptualization; Software; Formal Analysis; Visualization
Gabriel Müller	Data Curation; Validation
Ina Serif	Data Curation; Validation; Formal Analysis
Elena Spadini	Data Curation; Validation

The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Figures & Tables

Figure 1

Figure 2

Figure 3

Table 1

Table 2

Paradigm

My account