Have a personal or library account? Click to login
The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks Cover

The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Open Access
|Feb 2026

Figures & Tables

johd-12-481-g1.png
Figure 1

Execution flow of a benchmark test run.

johd-12-481-g2.png
Figure 2

Overview over the Bibliographic Data benchmark in the visualization frontend.

johd-12-481-g3.png
Figure 3

Best performance per model and benchmark dataset, taking the maximum score over all recorded configurations per model and dataset. Null values (N/A) indicate tests that have not yet been run or can no longer be run because the corresponding model has been deprecated.

Table 1

Overview of benchmark datasets, tasks, and scoring schemes.

BIBLIOGRAPHIC DATA: METADATA EXTRACTION FROM BIBLIOGRAPHIES
Data5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each.
ContentPages from Bibliography of Works in the Philosophy of History, 1945–1957.
Sourcehttp://www.jstor.org/stable/2504495
LanguageEnglish
Content TypePrinted text. Directory-like.
TaskTranscription; metadata extraction.
Expected outputJSON list of structured items; output defined via Pydantic class.
ScoringFuzzy matching on each key in the JSON items.
Test runs77 runs.
Best result71.43%, OpenAI, gpt-4o.
DesignThe dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
BLACKLIST CARDS: NER EXTRACTION FROM INDEX CARDS
Data33 images, JPG, approx. 1788 × 1305 px, 590 KB each.
ContentSwiss federal index cards for companies on a British black list for trade (1940s).
Sourcehttps://www.recherche.bar.admin.ch/recherche/#/de/archiv/einheit/31240458 (last accessed 2026-01-12)
LanguageMostly German, some French
Content TypeTyped and handwritten text. Stamped dates.
TaskTranscription; metadata extraction.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringFuzzy matching on each key in the JSON object.
Test runs38 runs.
Best result95.65%, OpenAI, gpt-4.1-mini.
DesignThe dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
BOOK ADVERT XMLS: DATA CORRECTION OF LLM-GENERATED XML FILES
Data50 JSON files containing XML structures.
ContentFaulty LLM-generated XML structures from historical sources
SourceData from the digitized Basler Avisblatt, namely book advertisements, extracted as XML.
LanguageEarly Modern German
Content TypeDigital data. Plain text in JSON files.
TaskCorrect XML structure (add closing tags, remove faulty tags).
Expected outputJSON object with correct XML as string; output defined via Pydantic class.
ScoringFuzzy matching on whole XML string after removing white spaces and setting all characters to lowercase.
Test runs40 runs.
Best result97.47%, Anthropic, Claude Sonnet 4.5.
DesignThis dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
BUSINESS LETTERS: NER EXTRACTION FROM CORRESPONDENCE
Data57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each.
ContentCollection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft.
Sourcehttp://dx.doi.org/10.7891/e-manuscripta-54917
LanguageGerman
Content TypeTyped, printed and handwritten text. Signatures.
TaskMetadata extraction, person matching, signature recognition.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringF1 score.
Test runs212 runs.
Best result77.00%, OpenAI, gpt-5.
DesignThis dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
COMPANY LISTS: COMPANY DATA EXTRACTION FROM LIST-LIKE MATERIALS
Data15 images, JPG, approx. 1868 × 2931 px, 360 KB each.
ContentPages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland
Sourcehttps://doi.org/10.7891/e-manuscripta-174832
LanguageEnglish and German
Content TypePrinted lists with strongly varying layout.
TaskMetadata extraction with varying layouts.
Expected outputJSON list of structured items; output defined via Pydantic class.
ScoringF1 score.
Test runs76 runs.
Best result58.40% OpenAI, gpt-5.
DesignThis dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
FRAKTUR ADVERTS: PAGE SEGMENTATION AND FRAKTUR TEXT TRANSCRIPTION
Data5 images, JPG, approx. 5000 × 8267 px, 10 MB each.
ContentPages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland.
Sourcehttps://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-12)
LanguageEarly modern German
Content TypePrinted text in Fraktur typeface.
TaskF1 score & Character Error Rate (CER).
Expected outputJSON list of structured items; output defined via Pydantic class.
ScoringSegmentation of adverts and text recognition
Test runs91 runs.
Best result95.70%, Google, gemini-2.0-pro-exp-02-05.
DesignThis dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image downscaling has a limited effect on model performance.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
LIBRARY CARDS: METADATA EXTRACTION FROM MULTILINGUAL SOURCES
Data263 images, JPG, approx. 976 × 579 px, 10 KB each.
ContentLibrary cards with dissertation thesis information.
Sourcehttps://ub.unibas.ch/cmsdata/spezialkataloge/ipac/searchform.php?KatalogID=ak2 (last accessed 2026-01-12)
LanguageGerman, French, English, Latin, Greek, and other European languages
Content TypeTyped and handwritten multilingual text.
TaskMetadata extraction.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringF1 score.
Test runs61 runs.
Best result89.51%, OpenAI, gpt-5.
DesignThis dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
MEDIEVAL MANUSCRIPTS: HANDWRITTEN TEXT RECOGNITION
Data12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each.
ContentPages from Pilgerreisen nach Jerusalem 1440 und 1453.
Sourcehttps://www.e-codices.ch/de/description/ubb/H-V-0015/HAN (last accessed 2026-01-12)
LanguageMedieval German
Content TypeHandwritten text.
TaskSegmentation & Text recognition.
Expected outputJSON object with predefined keys; output defined via Pydantic class.
ScoringFuzzy & Character Error Rate (CER).
Test runs38 runs.
Best result76.90%, OpenAI, gpt-4.1-mini.
DesignThis dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis.
Further InfoDataset Description, Test Results, GitHub (last accessed 2026-01-12)
Table 2

Contributors to the project. Contributor roles follow the CRediT taxonomy.

NAMEPRIMARY ROLES
Anthea AlbertoData Curation; Validation
Sven BurkhardtValidation
Eric DeckerData Curation; Validation
Pema FrickData Curation; Validation; Formal Analysis; Software
Maximilian HindermannConceptualization; Methodology; Software; Formal Analysis
Lea Katharina KasperData Curation; Validation; Formal Analysis
José Luis Losada PalenzuelaData Curation; Validation
Sorin MartiConceptualization; Software; Formal Analysis; Visualization
Gabriel MüllerData Curation; Validation
Ina SerifData Curation; Validation; Formal Analysis
Elena SpadiniData Curation; Validation
DOI: https://doi.org/10.5334/johd.481 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 15, 2025
|
Accepted on: Jan 7, 2026
|
Published on: Feb 4, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Maximilian Hindermann, Sorin Marti, Lea Katharina Kaspera, Arno Bossea, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.