The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Maximilian Hindermann; Sorin Marti; Lea Katharina Kaspera; Arno Bossea

doi:10.5334/johd.481

Abstract

The RISE Humanities Data Benchmark is a framework and collection of curated datasets for evaluating large language models (LLMs) on humanities-related tasks. The datasets are designed to be small and task-specific and are each accompanied by manually verified ground truths. An accompanying tool systematically submits the datasets to various LLM providers and models using shared prompts and configurations, then automatically scores the results against the ground truths. The results are published and searchable through a web interface. The framework aims to promote greater reproducibility, transparency, and consistency in LLM-based data processing in the humanities.

References

Abdurahman, S., Salkhordeh Ziabari, A., Moore, A. K., Bartels, D. M., & Dehghani, M. (2025). A primer for evaluating large language models in social science research. Advances in Methods and Practices in Psychological Science, 8(2). 10.1177/25152459251325174
Open DOI Search in Google Scholar Back to article
Greif, G., Griesshaber, N., & Greif, R. (2025, April). Multimodal LLMs for OCR, OCR post-correction, and named entity recognition in historical documents. arXiv. 10.48550/arXiv.2504.00414
Open DOI Search in Google Scholar Back to article
Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., … del Rio-Chanona, R. M. (2024, November). Large language models’ expert-level global history knowledge benchmark (hiST-LLM). In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Retrieved from https://openreview.net/forum?id=xlKeMuyoZ5#discussion
Search in Google Scholar Back to article
Kang, Z., Gong, J., Yan, J., Xia, W., Wang, Y., Wang, Z., … Li, X. (2025, June). HSSBench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv. 10.48550/arXiv.2506.03922
Open DOI Search in Google Scholar Back to article
Karjus, A. (2025, February). Machine-assisted quantitizing designs: Augmenting humanities and social sciences with artificial intelligence. Humanities and Social Sciences Communications, 12(1), 277. 10.1057/s41599-025-04503-w
Open DOI Search in Google Scholar Back to article
Kraus, F., Blumenröhr, N., Götzelmann, G., Tonne, D., & Streit, A. (2025). A gold standard benchmark dataset for digital humanities. In OM-2024: The 19th International Workshop on Ontology Matching collocated with the 23rd International Semantic Web Conference (ISWC 2024). November 11th, Baltimore, USA. 10.5445/ir/1000178023
Open DOI Search in Google Scholar Back to article
Simons, A., Zichert, M., & Wüthrich, A. (2025, June). Large language models for history, philosophy, and sociology of science: Interpretive uses, methodological challenges, and critical perspectives. arXiv. 10.48550/arXiv.2506.12242
Open DOI Search in Google Scholar Back to article
Sokol, A., Daly, E., Hind, M., Piorkowski, D., Zhang, X., Moniz, N., & Chawla, N. (2025). BenchmarkCards: Standardized documentation for large language model benchmarks. arXiv. 10.48550/arXiv.2410.12974
Open DOI Search in Google Scholar Back to article
Spinaci, G., Klic, L., & Colavizza, G. (2025, September). Benchmarking vision–language and multimodal large language models in zero-shot and few-shot scenarios: A study on christian iconography. arXiv. 10.63744/oxWtm5MhhwBH
Open DOI Search in Google Scholar Back to article
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024, March). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291. 10.1162/coli_a_00502
Open DOI Search in Google Scholar Back to article

The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Abstract

Paradigm

My account