Have a personal or library account? Click to login
The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks Cover

The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks

Open Access
|Feb 2026

References

  1. Abdurahman, S., Salkhordeh Ziabari, A., Moore, A. K., Bartels, D. M., & Dehghani, M. (2025). A primer for evaluating large language models in social science research. Advances in Methods and Practices in Psychological Science, 8(2). 10.1177/25152459251325174
  2. Greif, G., Griesshaber, N., & Greif, R. (2025, April). Multimodal LLMs for OCR, OCR post-correction, and named entity recognition in historical documents. arXiv. 10.48550/arXiv.2504.00414
  3. Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., … del Rio-Chanona, R. M. (2024, November). Large language models’ expert-level global history knowledge benchmark (hiST-LLM). In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Retrieved from https://openreview.net/forum?id=xlKeMuyoZ5#discussion
  4. Kang, Z., Gong, J., Yan, J., Xia, W., Wang, Y., Wang, Z., … Li, X. (2025, June). HSSBench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv. 10.48550/arXiv.2506.03922
  5. Karjus, A. (2025, February). Machine-assisted quantitizing designs: Augmenting humanities and social sciences with artificial intelligence. Humanities and Social Sciences Communications, 12(1), 277. 10.1057/s41599-025-04503-w
  6. Kraus, F., Blumenröhr, N., Götzelmann, G., Tonne, D., & Streit, A. (2025). A gold standard benchmark dataset for digital humanities. In OM-2024: The 19th International Workshop on Ontology Matching collocated with the 23rd International Semantic Web Conference (ISWC 2024). November 11th, Baltimore, USA. 10.5445/ir/1000178023
  7. Simons, A., Zichert, M., & Wüthrich, A. (2025, June). Large language models for history, philosophy, and sociology of science: Interpretive uses, methodological challenges, and critical perspectives. arXiv. 10.48550/arXiv.2506.12242
  8. Sokol, A., Daly, E., Hind, M., Piorkowski, D., Zhang, X., Moniz, N., & Chawla, N. (2025). BenchmarkCards: Standardized documentation for large language model benchmarks. arXiv. 10.48550/arXiv.2410.12974
  9. Spinaci, G., Klic, L., & Colavizza, G. (2025, September). Benchmarking vision–language and multimodal large language models in zero-shot and few-shot scenarios: A study on christian iconography. arXiv. 10.63744/oxWtm5MhhwBH
  10. Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024, March). Can large language models transform computational social science? Computational Linguistics, 50(1), 237291. 10.1162/coli_a_00502
DOI: https://doi.org/10.5334/johd.481 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 15, 2025
|
Accepted on: Jan 7, 2026
|
Published on: Feb 4, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Maximilian Hindermann, Sorin Marti, Lea Katharina Kaspera, Arno Bossea, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.