The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Maximilian Hindermann; Sorin Marti; Lea Katharina Kaspera; Arno Bossea

doi:10.5334/johd.481

References

Abdurahman, S., Salkhordeh Ziabari, A., Moore, A. K., Bartels, D. M., & Dehghani, M. (2025). A primer for evaluating large language models in social science research. Advances in Methods and Practices in Psychological Science, 8(2). 10.1177/25152459251325174
Open DOI Search in Google Scholar Back to article
Greif, G., Griesshaber, N., & Greif, R. (2025, April). Multimodal LLMs for OCR, OCR post-correction, and named entity recognition in historical documents. arXiv. 10.48550/arXiv.2504.00414
Open DOI Search in Google Scholar Back to article
Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., … del Rio-Chanona, R. M. (2024, November). Large language models’ expert-level global history knowledge benchmark (hiST-LLM). In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Retrieved from https://openreview.net/forum?id=xlKeMuyoZ5#discussion
Search in Google Scholar Back to article
Kang, Z., Gong, J., Yan, J., Xia, W., Wang, Y., Wang, Z., … Li, X. (2025, June). HSSBench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv. 10.48550/arXiv.2506.03922
Open DOI Search in Google Scholar Back to article
Karjus, A. (2025, February). Machine-assisted quantitizing designs: Augmenting humanities and social sciences with artificial intelligence. Humanities and Social Sciences Communications, 12(1), 277. 10.1057/s41599-025-04503-w
Open DOI Search in Google Scholar Back to article
Kraus, F., Blumenröhr, N., Götzelmann, G., Tonne, D., & Streit, A. (2025). A gold standard benchmark dataset for digital humanities. In OM-2024: The 19th International Workshop on Ontology Matching collocated with the 23rd International Semantic Web Conference (ISWC 2024). November 11th, Baltimore, USA. 10.5445/ir/1000178023
Open DOI Search in Google Scholar Back to article
Simons, A., Zichert, M., & Wüthrich, A. (2025, June). Large language models for history, philosophy, and sociology of science: Interpretive uses, methodological challenges, and critical perspectives. arXiv. 10.48550/arXiv.2506.12242
Open DOI Search in Google Scholar Back to article
Sokol, A., Daly, E., Hind, M., Piorkowski, D., Zhang, X., Moniz, N., & Chawla, N. (2025). BenchmarkCards: Standardized documentation for large language model benchmarks. arXiv. 10.48550/arXiv.2410.12974
Open DOI Search in Google Scholar Back to article
Spinaci, G., Klic, L., & Colavizza, G. (2025, September). Benchmarking vision–language and multimodal large language models in zero-shot and few-shot scenarios: A study on christian iconography. arXiv. 10.63744/oxWtm5MhhwBH
Open DOI Search in Google Scholar Back to article
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024, March). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291. 10.1162/coli_a_00502
Open DOI Search in Google Scholar Back to article

The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

References

Paradigm

My account