Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Daniel Hutchinson

doi:10.5334/johd.489

Abstract

Large language models (LLMs) are increasingly utilized in historical research and pedagogy, yet assessments of their capabilities tell conflicting stories. Leading benchmarks suggest LLMs possess expert-level performance in historical domains, while other evaluations reveal dramatic failures on core forms of historical reasoning. This paper examines five benchmarks assessing the historical competencies of LLMs to identify methodological approaches for devising assessments that reveal rather than obscure model limitations. Comparative analysis shows a collapse in LLM performance as assessments move from contaminated to decontaminated datasets, from Western to global knowledge domains, and from multiple-choice questions to open-ended responses. These patterns expose fundamental gaps between pattern recognition, where models excel, and historical reasoning, which remains beyond the current capabilities of “frontier” models. In making these claims, the paper introduces four dimensions of assessment for developing future benchmarks. As LLMs continue to become integrated into our digital lives, rigorous benchmarking becomes essential for matching their actual strengths to appropriate tasks while maintaining human judgment for the interpretative tasks at the heart of humanistic scholarship.

References

Chartier, M., Dakkoune, N., Bourgeois, G., & Jean, S. (2025). HiBenchLLM: Historical inquiry benchmarking for large language models. Data & Knowledge Engineering, 156, 102383. 10.1016/j.datak.2024.102383
Open DOI Search in Google Scholar Back to article
Chen, X., Zhou, S., Liang, K., Yuan, D., Chen, H., Sun, X., Meng, L., & Liu, X. (2025). Putting on the thinking hats: A survey on chain-of-thought fine-tuning from the perspective of human reasoning mechanism. arXiv. 10.48550/arXiv.2510.13170
Open DOI Search in Google Scholar Back to article
De Ninno, F., & Lacriola, M. (2025). Mussolini and ChatGPT: Examining the risks of AI writing historical narratives on Fascism. Journal of Modern Italian Studies, 30(2), 187–209. 10.1080/1354571X.2025.2457250
Open DOI Search in Google Scholar Back to article
Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., Bennett, J. S., Hoyer, D., Francois, P., Turchin, P., & del Rio-Chanona, R. M. (2024). Large language models’ expert-level global history knowledge benchmark (HiST-LLM). In Proceedings of the 38th International Conference on Neural Information Processing Systems (Article 1016). Last accessed February 18, 2026. https://dl.acm.org/doi/10.5555/3737916.3738932
Search in Google Scholar Back to article
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. 10.48550/arXiv.2009.03300
Open DOI Search in Google Scholar Back to article
Hutchinson, D. (2024). Mapping the latent past: Assessing large language models as digital tools through source criticism. Journal of Digital History, 3(1). 10.1515/jdh-2023-0018
Open DOI Search in Google Scholar Back to article
Li, W., Ma, R., Wu, J., Gu, C., Peng, J., Len, J., Zhang, S., Yan, H., Lin, D., & He, C. (2024). FoundaBench: Evaluating Chinese fundamental knowledge capabilities of large language models. arXiv. 10.48550/arXiv.2404.18359
Open DOI Search in Google Scholar Back to article
Marshall, L. (2020). The strange world of AP U.S. history. Contingent Magazine. Last accessed February 18, 2026. https://contingentmagazine.org/2020/10/20/apush/
Search in Google Scholar Back to article
Mollick, E. (2024). Co-intelligence: Living and working with AI. London: Ebury Publishing.
Search in Google Scholar Back to article
Qiu, J., Xiao, F., Wang, Y., Mao, Y., Chen, Y., Juan, X., Zhang, S., Wang, S., Qi, X., Zhang, T., Yao, Z., Guo, J., Lu, Y., Argon, C., Cui, J., Chen, D., Zhou, J., Zhou, S., Zhou, Z., … Wang, M. (2025). On path to multimodal historical reasoning: HistBench and HistAgent. arXiv. 10.48550/arXiv.2505.20246
Open DOI Search in Google Scholar Back to article
Shao, Y., Jiang, Y., Kanell, T., Xu, P., Khattab, O., & Lam, M. (2024). Assisting in writing Wikipedia-like articles from scratch with large language models. arXiv. 10.48550/arXiv.2402.14207
Open DOI Search in Google Scholar Back to article
Stanford Center for Research on Foundation Models. (n.d.). HELM MMLU leaderboard. Retrieved November 23, 2025, from https://crfm.stanford.edu/helm/mmlu/latest/
Search in Google Scholar Back to article
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2406.01574
Open DOI Search in Google Scholar Back to article
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv. 10.48550/arXiv.2201.11903
Open DOI Search in Google Scholar Back to article
Wong, A. (2018). The controversy over just how much history AP World History should cover. The Atlantic. Last accessed February 18, 2026. https://www.theatlantic.com/education/archive/2018/06/ap-world-history-controversy/562778/
Search in Google Scholar Back to article
Xu, R., Wang, Z., Fan, R.-Z., & Liu, P. (2024). Benchmarking benchmark leakage in large language models. arXiv. 10.48550/arXiv.2404.18824
Open DOI Search in Google Scholar Back to article
Zaagsma, G. (2023). Digital history and the politics of digitization. Digital Scholarship in the Humanities, 37(3), 830–851. 10.1093/llc/fqac050
Open DOI Search in Google Scholar Back to article
Zhao, Q., Huang, Y., Lv, T., Cui, L., Sun, Q., Mao, S., Zhang, X., Xin, Y., Yin, Q., Li, S., & Wei, F. (2024). MMLU-CF: A contamination-free multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2412.15194
Open DOI Search in Google Scholar Back to article

Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Abstract

Paradigm

My account