From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Maximilian Hindermann; Lea Katharina Kasper; Sorin Marti; Arno Bosse

doi:10.5334/johd.470

Full Article

1 Benchmarking within the Humanities: Context and Rationale

The impulse to order complexity—to make sense of multiplicity by arranging, naming, and comparing—runs deep in the humanities. Long before the language of data and metrics, scholars constructed catalogs, taxonomies, and controlled vocabularies to render heterogeneous source material intelligible. Such instruments did more than impose order; they defined what could be connected, distinguished, related, or compared. As practices of ordering and classification became more formalized, the collection of data increasingly adopted the same categorical logic and scholars became increasingly attentive to the conditions under which data were produced, recognizing that the reliability of interpretation and analysis rests not only on conceptual rigor but on the coherence, consistency, and traceability of the underlying data. The inclusion of digital methods not just in the analysis of research data but in the acts of collection, extraction, categorization, and structuring, has amplified the need for measuring the performance of such tools and workflows. Such measurements are a prerequisite for producing reliable and reproducible results. The RISE Humanities Data Benchmark¹ (Hindermann, Marti, Alberto, et al., 2026) draws from this epistemic tradition, translating the humanistic concern for transparency and comparability into a framework for evaluating computational methods against representative (digital) humanities tasks and datasets.

1.1 From Interpretive Tasks to Measurable Benchmarks

Researchers often want to carry out data-centric tasks such as extracting, transcribing, structuring, categorizing, and normalizing research data from source material. Such tasks are inherently interpretive as their heterogeneity, material condition, and contextual variation introduce a degree of complexity that cannot be fully resolved through procedural rules. Benchmarking can respond to this condition by formalizing those aspects of data processing that can be rendered measurable while explicitly documenting how complexity and heterogeneity are managed in that process. When generative LLMs are used to perform such tasks, the evaluation becomes significantly more complex since we now need to compare systems that operate probabilistically and take into account sampling and model parameters, prompts, quantized versions of the same model, and even the choice of inference engine. All of these factors could lead to changes in accuracy, bias, and reproducibility. The infrastructure for a benchmark should be able to operationalize such comparison at the tool level and allow the negotiation between structure and variability to become a transparent, comparable dimension of the evaluation itself.

Recent works have offered methodological frameworks for integrating LLMs into humanities and social science research workflows (Abdurahman et al., 2025; Karjus, 2025), but comparable infrastructure for evaluating the performance of generative LLMs on humanities data remains largely absent (Dobson, 2020; Simons et al., 2025). Most current benchmarks either prioritize scale (Hauser et al., 2024; Kang et al., 2025) or focus on specific domains (e.g., Ziems et al., 2024; Spinaci et al., 2025; Hamilton et al., 2025; Khadangi et al., 2025). Recently, Bamman et al. (2024) compare prompting-based and supervised approaches across ten cultural analytics classification tasks, finding that prompting performs competitively for established tasks but lags behind supervised methods for de novo problems lacking precedent in training data. But their focus on text classification is distinct from the document understanding tasks we regularly encounter in our consulting practice, which we feel remain comparatively underserved.

The RISE Benchmark suite was developed to help address this gap. Each benchmark operationalizes a discrete research task such as transcription, metadata extraction, classification, normalization, or enrichment by defining a standardized input (image, text, and prompt), an expected output, a task-specific ground truth, and a scoring function for quantitative evaluation. The ground truth is not intended to capture the full interpretive complexity of the source material; rather, it represents the explicitly defined, correct output for that specific operational step as defined by the researcher. It functions as a validation target: the benchmark measures how well a model can replicate a discrete, clearly defined task, which in turn allows researchers to assess the model’s utility for their work. The suite thus serves both ex ante and ex post functions: it provides measurable evidence for testing, selecting, and configuring models before their application, and enables the assessment of data accuracy and model performance during or after a project.

1.2 Practice-Driven Benchmarking: From Consulting to Systematic Evaluation

At the University of Basel, the Research and Infrastructure Support (RISE) team provides consulting services for digital projects in the humanities and social sciences. This regularly involves evaluating the suitability of computational methods for specific research contexts. Previously, assessments were conducted on a case-by-case basis, evaluating whether specific models were suitable for particular data types and research tasks. While such ad hoc testing did produce a corpus of tacit methodological knowledge—for example, which models handled historical handwriting reliably, which configurations balanced cost and accuracy, or visual formats caused systematic failure—drawing on this knowledge proved inefficient in practice. What was needed were consistent and rapid comparisons of test results across LLMs under reproducible conditions.

To facilitate this, we developed an accessible, modular, technical framework that could aggregate ad hoc empirical observations about model performance, data structure, and evaluation criteria beyond individual tests and projects. Our goal was to promote a virtuous feedback loop between applied research questions and research support and enable researchers to apply and validate their methodological ideas. Over time, the results from the suite would allow project-level findings to be aggregated across comparable task types and sources, yielding more generalizable insights into LLM performance.

Applied digital humanities research relies on the operationalization of concrete data-centric tasks such as transcription, normalization, metadata extraction, and classification-activities that link directly to the creation and validation of research data. These processes require models to interact with heterogeneous, often degraded—faded, damaged, or low-quality—sources and multi-modal materials combining textual, visual, and structural information whose formats, vocabularies, and contextual meanings vary across periods.

In recognition of this, each benchmark is built around a collection of small, representative sample datasets and associated tasks. Rather than organizing benchmarks thematically, we drew them directly from concrete humanities research projects that RISE has previously curated and supervised. These derive mainly from historical sources: business correspondence, printed texts in Fraktur, medieval manuscripts, list-like materials such as 19th-century bibliographic records and early 20th-century company directories, and index card-based sources like library cards or blacklist datasets. Sampling strategies at the level of individual documents or records vary depending on source characteristics and research goals: random or stratified sampling is used where datasets are sufficiently uniform; elsewhere, samples are selected through targeted curation to capture typical variation, and where data volumes are limited, the complete dataset is included without further sampling. Individual tasks are defined through a set of prompts and the representation, in structured form, of the expected outputs from the model after these have processed the sources. For example, the recognition by a LLM of a location on a scanned index card and its correct representation in structured form. Depending on the research context, processing such material often requires models to perform several operations—recognizing, segmenting, and classifying content within a single step—rather than completing discrete tasks sequentially.

External contributions to these dataset are always welcome—the collection has already been expanded several times in the last year. See Table 1 for an overview of the sample datasets; a companion data paper provides more details (Hindermann, Marti, Kasper, & Bosse, 2026).²

Table 1

Overview of benchmark datasets and tasks. A companion data paper provides a more detailed table of the available datasets (Hindermann, Marti, Kasper, & Bosse, 2026).

Bibliographic Data: Metadata extraction from bibliographies
Data	5 images, 67 elements, JPG, approx. 1743 × 2888 px, 350 KB each.
Content	Pages from Bibliography of Works in the Philosophy of History, 1945–1957.
Design	The dataset was created as a proof of concept in the context of a master’s thesis and targets the extraction of structured bibliographic information from semi-structured, directory-like printed text. The benchmark focuses on structural inference rather than linguistic prompting and omits an explicit prompt in favor of a predefined Pydantic output schema.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Blacklist Cards: NER extraction from index cards
Data	33 images, JPG, approx. 1788 × 1305 px, 590 KB each.
Content	Swiss federal index cards for companies on a British black list for trade (1940s).
Design	The dataset was created in the context of an ongoing dissertation and reflects a historically common archival document type. The cards were manually selected and curated to capture the combination of typed and handwritten entries typical for mid-20th-century administrative records, which are not readily available for bulk download.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Book Advert XMLs: Data correction of LLM-generated XML files
Data	50 JSON files containing XML structures.
Content	Faulty LLM-generated XML structures from historical sources
Design	This dataset was created to introduce text-only benchmarks into the framework as part of an ongoing research project. It targets a reasoning task situated in a workflow, where large language models are required to detect and correct structural errors in previously generated XML, reflecting common data transformation and processing challenges in digital humanities pipelines.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Business Letters: NER extraction from correspondence
Data	57 letters, 98 images, JPG, approx. 2479 × 3508 px, 600 KB each.
Content	Collection of letters from the Basler Rheinschifffahrt-Aktiengesellschaft.
Design	This dataset was designed as a multi-stage benchmark reflecting a common humanities scenario centered on historical correspondence. The benchmark evaluates sequential tasks including metadata extraction, named-entity recognition, and signature-based person identification, addressing both linguistic variation and document-level decision making.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Company Lists: Company data extraction from list-like materials
Data	15 images, JPG, approx. 1868 × 2931 px, 360 KB each.
Content	Pages from Trade index: classified handbook of the members of the British Chamber of Commerce for Switzerland
Design	This dataset was developed in the context of an ongoing dissertation project focused on historical company networks. It evaluates metadata extraction from multilingual company lists with strongly varying layouts, testing a model’s ability to infer a consistent schema despite substantial variation in formatting and presentation.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Fraktur Adverts: Page segmentation and Fraktur text transcription
Data	5 images, JPG, approx. 5000 × 8267 px, 10 MB each.
Content	Pages from the Basler Avisblatt, an early advertisement newspaper published in Basel, Switzerland.
Design	This dataset was developed as part of ongoing research at the University of Basel and targets historical advertisements printed in Fraktur typeface. It evaluates advert segmentation and text recognition under conditions of dense layout and strong typographic variability, while demonstrating that aggressive image down-scaling has a limited effect on model performance.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Library Cards: Metadata extraction from multilingual sources
Data	263 images, JPG, approx. 976 × 579 px, 10 KB each.
Content	Library cards with dissertation thesis information.
Design	This dataset was created as a feasibility study to assess whether unique dissertation records can be identified and structured within a large-scale historical card catalog comprising approximately 700,000 entries. A random sample was selected to capture multilingual content, inconsistent layouts, and a mixture of handwritten, typed, and printed text, reflecting realistic challenges in large archival metadata extraction tasks.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)
Medieval Manuscripts: Handwritten text recognition
Data	12 images, JPG, approx. 1872 × 2808 px, 1.1 MB each.
Content	Pages from Pilgerreisen nach Jerusalem 1440 und 1453.
Design	This dataset was prepared in the context of a university seminar and focuses on late medieval manuscript material. It targets segmentation and text recognition of handwritten sources that are difficult to read due to script variability and the presence of marginalia, reflecting common challenges in medieval manuscript analysis.
Further Info	Dataset Description, Test Results, GitHub (last accessed 2026-01-16)

Finally, the tests define the concrete conditions under which a benchmark is executed. Each test specifies the provider, model, prompt, temperature, output schema or data-class, and other runtime parameters. This design enables a single dataset—with its corresponding ground truth and scoring logic—to be evaluated across multiple LLMs using identical workflows. Comparative testing thus becomes straightforward and fully reproducible, requiring no adjustments to the underlying code.

To facilitate sharing and comparison of results, benchmark outputs are rendered through a dedicated presentation interface (NDR-Core), also developed and supported by RISE.³ This interface aggregates scores, costs, and processing times across benchmarks, visualizes model performance along multiple dimensions, and enables researchers to explore trade-offs before applying models to their own data. Whilst this front-end provides a valuable means of exploring results, it is not essential for running benchmarks. All required components are hosted on GitHub, with data automatically synchronized to the front-end at regular intervals. This architecture combines a dynamic, user-friendly interface with a stable, version-controlled data repository. Regular GitHub releases are automatically pushed to Zenodo to ensure the project remains citable and FAIR-compliant.

The benchmark framework itself handles the complexities of LLM API interactions and repeatable workflows, providing a provider-agnostic interface for running identical steps across different models and vendors. In practice, creating a benchmark involves only a few clear steps: selecting the data to process (currently mostly images), defining a schema for the desired output, preparing a ground-truth file for each data item, and implementing an abstract benchmark class that defines the scoring logic. Once this structure is in place, users can add tests—each representing a concrete run with a chosen provider and model—without changing any code. Complex evaluation workflows thus require minimal scripting beyond the scoring implementation, whilst remaining fully comparable across providers.

2 Method: System Design and Implementation

2.1 Benchmark Structure and Execution

As illustrated in Figure 1, a defining feature of our benchmarks is their structured JSON output, generated by providing predefined schemas to the model context. These schemas can be implemented either as Pydantic data classes or as standard JSON Schema (see Listing 2), enable structural validation of the results and allow performance to be measured against ground-truth data following the same schema (see Listing 3). All benchmarks follow a consistent directory structure containing the images and/or text files to be analyzed, prompts describing the task and output format/schema,⁴ ground truth, and optional Python scripts that define structured output or custom scoring, allowing researchers to adapt existing data or create new tasks with minimal effort. Experiments are conducted as tests. Tests are identified by unique IDs and defined in a shared configuration file that specifies a benchmark, provider, model, prompt, and additional parameters (such as temperature, data class type, or evaluation rules). Model responses are evaluated automatically using task-specific scoring functions triggered by the test configuration in use (see Listing 1).⁵ For each benchmark and class of configuration, performance can be compared between models or providers. By creating tests with one variable parameter (e.g., the prompt), researchers can compare and optimize models directly. The modular design supports both individual experimentation and collective comparison, providing shared infrastructure for systematic LLM evaluation in humanities research.

Images in `images/` are sent to an LLM with a prompt from `prompts/` and compared against a ground truth file in `ground_truths/`. Each result follows the structure defined by `dataclass.py` and is scored using the implementation in `benchmark.py`.

The benchmark class (without actual implementation) from the Library Cards benchmark v0.4.1.

The Pydantic data class (without docstrings) from the Library Cards benchmark v0.4.1.

The ground truth for card 00624122 (see Figure 2) from the Library Cards benchmark v0.4.1.

The card 00624122 from the *Library Cards* benchmark v0.4.1.

Beyond internal performance metrics dependent on ground truth, the suite records external indicators relevant to the feasibility of employing a model in practice: runtime, token usage, and monetary cost. These measurements enable quantitative comparisons using standardized ratios: cost efficiency expressed as monetary cost per performance unit (e.g., dollars per F1 point or per character error rate improvement) and time efficiency as processing duration per item or per accuracy threshold achieved. By normalizing across different scoring scales, these metrics support systematic comparison of model trade-offs both between individual benchmarks and across model families. Considered together, internal and external metrics address questions that typically arise during project consultations: Which model offers sufficient accuracy for the specific material? Will large-scale processing of my materials remain financially viable? How long will it take to complete a run using available infrastructure?

2.2 Reproducibility and FAIRness

The benchmark suite reproduces experiments through version-controlled test configurations and benchmark components: model input and generated responses are stored in raw form (see Hindermann, 2024), while evaluation of model responses against a benchmark’s scoring function and ground truth is automated and deterministic.

Our benchmark suite follows the FAIR principles for research software (FAIR4RS; see Barker et al., 2022). Each benchmark contains standardized metadata describing the employed materials as a dataset (including provenance and licensing), the ground truth creation workflow and its data format, the scoring criteria and how they were implemented, and the benchmark’s creators and their contribution type (following Treloar et al., 2007). In line with software-engineering principles of modularity, the suite externalizes all API-access functionality into a dedicated Python package (Marti, 2025). Separating API logic from benchmark definitions maintains interoperability and allows the benchmarks to remain compatible with evolving model providers and future extensions. The benchmark suite is hosted on a public GitHub repository with semantically versioned releases automatically pushed to Zenodo.

These principles make the evaluative choices in a benchmark transparent. Reproducibility is not treated as exact replication but as the capacity to trace, examine, and adapt the interpretive decisions underlying the creation of ground truth—allowing researchers working with similar materials or facing comparable methodological choices to build on prior work.

2.3 Community Contribution and Adaptation

From its inception, the benchmark suite was conceived as a community resource. The repository provides a contribution guide and benchmark template that enable researchers to create new tasks derived from their own projects. Adding a benchmark requires only a directory with the standard structure, a short description, and defined evaluation criteria. A lightweight review process ensures legal compliance and necessary metadata before integration.

Researchers who prefer to work independently can fork the repository, adapt it to their data, and conduct benchmarks locally without submitting them for inclusion. This flexibility allows the framework to function both as shared public infrastructure and as a private research tool for institutions conducting internal evaluations. The open structure encourages informal experiments to evolve into reusable benchmarks, gradually building a shared evidence base that reflects the diversity of humanities data and practices.

3 From Practical Hurdles to Epistemic Practice

Developing and deploying the RISE Humanities Data Benchmark suite revealed significant challenges in using LLMs for humanities research. This discussion examines two representative issues: the pragmatic problem of assessing the feasibility of drawing on LLMs for specific humanities tasks without adequate evaluation frameworks, and the epistemic challenge of adapting benchmarking practices to domains where interpretive choices are fundamental rather than incidental. Both issues are illustrated by three consulting cases that informed our benchmark development and were integrated as benchmarks (see also Table 1):

Fraktur Adverts: Segment and transcribe classified advertisements from the Basler Avisblatt (1729–1845), printed in Fraktur typeface. Challenge: How to handle historical spelling, typographical errors, and section hierarchies?
Library Cards: Extract structured bibliographic metadata from the University Library Basel’s digitized collection of multilingual library index cards of dissertations up to 1980 (printed, handwritten, or both). Challenge: How to deal with multiple historically evolved metadata standards and practices on 700,000 index cards? How to map these historical metadata standards to RDA-DACH?⁶
Business Letters: Undertake named entity recognition on a collection of business letters from the archive of the Rheinschifffahrt-Aktiengesellschaft (1926–1932). Challenge: How to identify the same individuals, places, and organizations throughout the collection when these appeared in typed text, handwritten notes, personal signatures, with variant spellings, different languages, or alternative names (e.g., due to marriage or company mergers)?

3.1 The Pragmatic Challenge

The most immediate practical barrier we encounter is not technical complexity but the absence of systematic methods for assessing whether LLMs can adequately perform specific humanities relevant tasks. Increasingly, projects approaching RISE for consultation challenge us with a simply stated but difficult question: Can this workflow be feasibly and reliably automated with current LLMs? The answer depends not only on a model’s capability but also on context-specific standards of accuracy, domain conventions, and the heterogeneity of historical sources.

In practice, projects must determine whether LLMs can:

segment and transcribe Fraktur advertisements while preserving historical spelling conventions;
extract consistent bibliographic metadata from handwritten and inconsistently formatted index cards;
recognize entities across corpus-specific variations.

General-purpose LLM benchmarks offer no insight into these domain-specific problems, and ad hoc experimentation performed by individual project teams is often costly, unsystematic, and ultimately inconclusive. Projects also lack a framework for defining context-appropriate success criteria. For example, transcribing Fraktur advertisements for basic text retrieval requires a very different error tolerance than producing a critical scholarly edition. Similarly, the level of precision required for extracting structured data from index cards varies according to the downstream use: general bibliographic discovery requires less precision than catalog ingestion.

In these situations, projects tend either to avoid computational methods altogether due to uncertainty about feasibility and cost, or to commit resources to computational pipelines that ultimately fail to meet their quality requirements. In our consulting practice we regularly encounter both patterns: projects that invested in LLM-based workflows which proved insufficiently reliable for their purposes, and projects that continued using resource-intensive manual procedures even though partial or full automation would have met their needs. Our response has been to design targeted evaluations that combine representative (random or stratified) samples from project data with explicit, context-sensitive performance thresholds.

With our benchmark suite we were able to systematize and express these individual, pragmatic interpretative judgments in the framework of a reusable evaluation infrastructure. Each experiment leading to an individual evaluation contributes to accumulated methodological knowledge—not because success criteria or datasets transfer directly, but because the evaluation design process, sampling strategies, and implementation frameworks can be directly compared and adapted to new contexts. For example, the Fraktur benchmark’s approach to handling historical spelling variations is able to inform evaluation design for other historical transcription projects, while the catalog card extraction framework provides templates for structured metadata evaluation across different bibliographic contexts.

This approach supports decision-making for projects considering computational replacements for manual workflows, while acknowledging that broader questions of interpretive adequacy and contextual understanding remain beyond the scope of automated evaluation.

3.2 The Epistemic Challenge

The most significant epistemic challenge we encountered in this context stemmed from the uncritical importation of benchmarking practices from NLP and computer vision into humanities contexts. Most standard LLM benchmarking assumes clear-cut, objective ranking. Typical (digital) humanities tasks such as transcription, metadata extraction or entity recognition, however, resist this form of evaluation. When transcribing Fraktur advertisements, should historical spelling errors be preserved or corrected? Should abbreviated words like “Preiß” be standardized to modern “Preis”? How should damaged or unclear Fraktur characters be represented? For information extraction from index cards, should author names follow modern standardization or preserve historical formatting? Should “ca. 1850” be normalized to “1850” or retained as approximate dating? When multiple valid interpretations exist, aggregate performance scores mislead by obscuring the contextual and interpretive dimensions that define scholarly quality. For example, as illustrated in Figure 3, should the typewritten addressee “Max Oettinger” (see Figure 3a), the handwritten signature “Artur Oettinger-Meili” (see Figure 3b), and the abbreviated form “A. Oettinger” (see Figure 3c) be treated as referring to the same individual when contextual evidence suggests identity, even though the variants differ in form, medium, and, in one case, reflect an error on the part of the sender?

Two letterheads and a valediction from the *Business Letters* benchmark v0.4.1. The strings “Max Oettinger” (Figure 3a), “Artur Oettinger-Meili” (Figure 3b), and “A. Oettinger” (Figure 3c) are name variants referring to the same individual, namely Artur Oettinger-Meili. He served as managing director (*Geschäftsführer*) of the Basler Personenschifffahrtsgesellschaft from 1925 to 1938; the variant “Max” results from an error on the part of the senders.

We address this by treating a benchmark as an explicit epistemic practice rather than a neutral measurement. Ground truth, in this framework, is not about absolute correctness but represents the operationalized version of specific, scholar-defined interpretive choices made in collaboration with domain specialists. For the Fraktur Adverts benchmark, domain experts established explicit policies: preserve all historical spellings, maintain original punctuation, and expand abbreviations only when these are unambiguous. The Library Cards benchmark required bibliographic specialists to define standardization rules for author names, date formatting, and incomplete entries. The Business Letter benchmark developed systematic workflows for entity identity decisions, including confidence thresholds for uncertain matches. This approach forces researchers to articulate decisions that would otherwise remain implicit, transforming tacit methodological knowledge from consulting practice into transparent parameters that can be examined, debated, and adapted by others.

This creates a clear separation between interpretive work (defining what counts as a correct transcription or an accurate entity normalization) and measurement (determining whether a model achieves that standard). Our benchmarks do not claim that a particular interpretation of “Preiß” versus “Preis” is objectively correct, but rather demonstrate that models can reliably execute discrete, well-defined tasks given specific research contexts within reproducible evaluation conditions. Rather than identifying the ‘right’ way to translate interpretive choices into concrete rules, we make this translation process transparent, shifting the burden from hidden judgment to explicit, defensible scholarly choices.

This approach necessarily focuses on measurable components (such as character-level transcription accuracy or metadata field extraction) rather than holistic insights like interpretive adequacy or contextual understanding of historical advertising practices. Automated scoring provides objective measurement, but only of those aspects that scholars can translate into discrete, comparable criteria. Digital humanities projects already depend on reliable execution of these component tasks; we make existing scholarly choices reusable and systematic rather than adding new interpretive work.

The suite supports alternative ground truths for existing benchmarks, allowing researchers who pursue different transcription strategies (e.g., normalization versus preservation of historical spellings) to measure model performance against their own explicitly defined interpretive rules. Rather than reducing interpretive plurality to a single score, the framework presents different scholarly approaches as versioned, comparable alternatives. This shifts the focus from identifying ‘best’ performance to examining how different scholarly decisions affect the reliability of LLM results. The suite thus provides methodological transparency by measuring computational alignment with scholar-defined tasks without claiming universal validity.

4 Results and Implications

The results of our tests reveal systematic differences in how LLMs process data. Variations in model responses emerge not only in performance, but also in output consistency within the same task, as well as in cost-effectiveness and latency (see Figure 4). This highlights a key challenge for humanities research projects: balancing performance with feasibility and available resources. Multi-modal tasks—especially those involving degraded or visually complex sources, such as the Fraktur Adverts benchmark—show greater variability in performance across providers, whereas simpler tasks, such as the extraction of relatively few and well-structured text fields in Library Cards benchmark, yield largely consistent results (see Figure 5). This clear dependence on source characteristics underscores the need to test multiple providers and model configurations systematically rather than assuming uniform performance across task types and source material.

Screenshot from the front-end.⁷ The graph shows cumulated test results of the *Business Letters* benchmark v0.4.1 over time. Connecting lines show test runs of the same models on different dates with colors indicating providers. The graph is interactive—only one part is shown here.

Comparison of average model scores of Google and OpenAI across all benchmarks. The results show substantial task-specific variance: Google achieves higher scores on *Fraktur Adverts* (77.1 versus OpenAI 20.6), *Medieval Manuscripts* (63.3 versus OpenAI 51.4), and *Library Cards* (82.5 versus OpenAI 77.2), while OpenAI performs better on *Bibliographic Data* (57.7 versus Google 36.1), and *Book Advert XML* (85.8 versus Google 80.5). Differences are less pronounced (<5%) for *Blacklist Cards* (Google 89.9, OpenAI 85.5), *Business Letters* (OpenAI 53.2, Google 49.7), and *Company Lists* (Google 42.6, OpenAI 39.6). This comparison underscores the need for systematic evaluation of humanities relevant LLM tasks.

The Business Letter benchmark illustrates these limitations particularly well. Here, the model must process long, heterogeneous letters combining typed and handwritten elements, multiple languages, and frequent references to persons, companies, and places. Across all providers, performance remains comparatively weak, indicating that such complex correspondence corpora could usefully be examined using modular benchmarking, in which sub-tasks—such as segmentation, transcription, and entity recognition—are evaluated separately. This approach clarifies where specific errors occur, shows which parts of a workflow can be reliably automated by LLMs and indicates where rule-based or semi-automatic methods remain preferable. More broadly, such task decomposition provides a useful outlook for future research workflows, allowing complex analytical processes to be organized as staged procedures that combine generative models with structured, deterministic components. The integration of time and cost metrics reveals marked differences in cost between providers, while variations in processing time are less pronounced. In our benchmarks, datasets are modest in size and turnaround times are generally brief, so these factors have not been substantial enough to influence whether a task is practically feasible. All in all, the benchmarks highlight that model performance is highly task- and source-dependent. Complex materials require differentiated, often modular approaches, while simpler and well-structured data can be handled more consistently. Benchmarking thus provides a practical means of mapping these boundaries to guide decisions about integrating computational methods into research workflows.

4.1 Future Directions

Building on these insights, future development of the benchmark suite will pursue two complementary directions: broadening the range of evaluated models, languages, and data types, while refining evaluation methods to assess robustness and optimize prompting strategies. New benchmarks will extend coverage across model generations, languages, and data types—including audio transcription and more text-to-text tasks—to capture a broader range of humanities workflows. Particular attention will be given to evaluating prompts, examining how minimal versus detailed task descriptions influence performance, and to robustness testing, which assesses sensitivity to image quality, layout variation, and handwriting differences. We also plan to extend the benchmark datasets by supporting multiple task definitions (via alternative data classes and corresponding ground truth), allowing the same underlying dataset to be evaluated under different task formulations; this also provides a lightweight path to add baseline tasks (e.g., language identification or document type classification) without requiring entirely new datasets.

Beyond expanding existing benchmark types, we plan to introduce mixed-media benchmarks that combine textual and image context data, and agentic tool-use benchmarks that assess multi-step workflows with external tools. The latter will enable metadata enrichment via deterministic authority-data services (e.g., Geonames look-ups, Wikidata queries) whilst leaving Named Entity Recognition to LLMs—avoiding the tendency of models to invent identifiers when none exist. We will also scale the framework by increasing both the number and size of datasets, incorporating standardized BenchmarkCards (see Sokol et al., 2024) to streamline documentation. The goal is to trace how models evolve over time and across providers: adaptive benchmarking will enable longitudinal comparison, whilst periodic re-runs on fixed samples will quantify performance drift and stability. In parallel, stronger community participation will help ensure that new benchmarks reflect the diversity of research practices, data conditions, and interpretive conventions in the humanities.

4.2 Conclusions

The RISE Humanities Benchmark suite measures what can be operationalized—accuracy, cost, speed, and structural consistency—whilst making the interpretive choices underlying these measurements explicit and reproducible. But its value lies equally in what it cannot measure: interpretive adequacy, disciplinary appropriateness, and scholarly judgment remain outside its scope. By documenting where automation fails, which materials resist comparison, and which task formulations align with which evaluative standards, the suite provides the empirical ground on which such judgments can be made more deliberately. This matters because computational methods enter humanities research through local, often undocumented decisions—about which models to trust, which materials to process, and which errors to tolerate. The suite cannot make these decisions, but it can render them contestable by providing structured evidence that others can verify, challenge, and extend.

Notes

[1] All software, data, and benchmark definitions are openly available at https://github.com/RISE-UNIBAS/humanities_data_benchmark. Results and visualizations can be explored through the presentation front-end at https://rise-services.rise.unibas.ch/benchmarks/.

[2] The comparatively strong presence of economic and commercial materials reflects the research context of the contributing projects rather than a predefined domain focus.

[3] NDR Core (Marti, 2024) is a Django-based framework for publishing and exploring research data in the humanities. It provides modular components for data ingestion, search, and visualization, and serves as a web application for hosting benchmark results and related metadata.

[4] Prompts are written in a minimal, task- and output-explicit manner. Prompt variants were explored on an exploratory, benchmark-specific basis during implementation. In many cases, performance was primarily constrained by the output schema and data structure, and prompt variation did not lead to substantial changes. The test-based design of the suite nevertheless allows prompt variants to be compared and evaluated in a structured manner.

[5] For example, character and word error rates for transcription tasks, F1 scores per data class field for classification tasks, and custom field-level matching algorithms for structured metadata extraction that accommodate formatting variations through fuzzy string comparison.

[6] The present-day metadata standard RDA-DACH is available (in German and French) at https://sta.dnb.de/doc/RDA.

[7] https://rise-services.rise.unibas.ch/benchmarks/p/benchmarks/?id=business_letters (last accessed 2026-01-16).

[8] https://avisblatt.dg-basel.hasdai.org (last accessed 2026-01-16).

Acknowledgements

The RISE Humanities Data Benchmark uses data from Printed Markets – The Basler Avisblatt,⁸ the forthcoming dissertation project of Lea Katharina Kasper, and curated collections of the University Library Basel.

We thank the contributors to the RISE Humanities Data Benchmark, Anthea Alberto, Sven Burkhardt, Eric Decker, Pema Frick, José Luis Losada Palenzuela, Gabriel Müller, Ina Serif, and Elena Spadini, for their contributions to dataset creation, annotation, software development, and evaluation design.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Maximilian Hindermann: Conceptualization; Methodology; Software; Data curation; Writing–original draft; Writing–review & editing; Supervision. Lea Katharina Kasper: Conceptualization; Data curation; Writing–original draft. Sorin Marti: Conceptualization; Methodology; Software; Data curation; Visualization; Writing–review & editing. Arno Bosse: Writing–review & editing.

From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark

Full Article

1 Benchmarking within the Humanities: Context and Rationale

1.1 From Interpretive Tasks to Measurable Benchmarks

1.2 Practice-Driven Benchmarking: From Consulting to Systematic Evaluation

Table 1

2 Method: System Design and Implementation

2.1 Benchmark Structure and Execution

Figure 1

Listing 1

Listing 2

Listing 3

Figure 2