Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Daniel Hutchinson

doi:10.5334/johd.489

References

Chartier, M., Dakkoune, N., Bourgeois, G., & Jean, S. (2025). HiBenchLLM: Historical inquiry benchmarking for large language models. Data & Knowledge Engineering, 156, 102383. 10.1016/j.datak.2024.102383
Open DOI Search in Google Scholar Back to article
Chen, X., Zhou, S., Liang, K., Yuan, D., Chen, H., Sun, X., Meng, L., & Liu, X. (2025). Putting on the thinking hats: A survey on chain-of-thought fine-tuning from the perspective of human reasoning mechanism. arXiv. 10.48550/arXiv.2510.13170
Open DOI Search in Google Scholar Back to article
De Ninno, F., & Lacriola, M. (2025). Mussolini and ChatGPT: Examining the risks of AI writing historical narratives on Fascism. Journal of Modern Italian Studies, 30(2), 187–209. 10.1080/1354571X.2025.2457250
Open DOI Search in Google Scholar Back to article
Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., Bennett, J. S., Hoyer, D., Francois, P., Turchin, P., & del Rio-Chanona, R. M. (2024). Large language models’ expert-level global history knowledge benchmark (HiST-LLM). In Proceedings of the 38th International Conference on Neural Information Processing Systems (Article 1016). Last accessed February 18, 2026. https://dl.acm.org/doi/10.5555/3737916.3738932
Search in Google Scholar Back to article
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. 10.48550/arXiv.2009.03300
Open DOI Search in Google Scholar Back to article
Hutchinson, D. (2024). Mapping the latent past: Assessing large language models as digital tools through source criticism. Journal of Digital History, 3(1). 10.1515/jdh-2023-0018
Open DOI Search in Google Scholar Back to article
Li, W., Ma, R., Wu, J., Gu, C., Peng, J., Len, J., Zhang, S., Yan, H., Lin, D., & He, C. (2024). FoundaBench: Evaluating Chinese fundamental knowledge capabilities of large language models. arXiv. 10.48550/arXiv.2404.18359
Open DOI Search in Google Scholar Back to article
Marshall, L. (2020). The strange world of AP U.S. history. Contingent Magazine. Last accessed February 18, 2026. https://contingentmagazine.org/2020/10/20/apush/
Search in Google Scholar Back to article
Mollick, E. (2024). Co-intelligence: Living and working with AI. London: Ebury Publishing.
Search in Google Scholar Back to article
Qiu, J., Xiao, F., Wang, Y., Mao, Y., Chen, Y., Juan, X., Zhang, S., Wang, S., Qi, X., Zhang, T., Yao, Z., Guo, J., Lu, Y., Argon, C., Cui, J., Chen, D., Zhou, J., Zhou, S., Zhou, Z., … Wang, M. (2025). On path to multimodal historical reasoning: HistBench and HistAgent. arXiv. 10.48550/arXiv.2505.20246
Open DOI Search in Google Scholar Back to article
Shao, Y., Jiang, Y., Kanell, T., Xu, P., Khattab, O., & Lam, M. (2024). Assisting in writing Wikipedia-like articles from scratch with large language models. arXiv. 10.48550/arXiv.2402.14207
Open DOI Search in Google Scholar Back to article
Stanford Center for Research on Foundation Models. (n.d.). HELM MMLU leaderboard. Retrieved November 23, 2025, from https://crfm.stanford.edu/helm/mmlu/latest/
Search in Google Scholar Back to article
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2406.01574
Open DOI Search in Google Scholar Back to article
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv. 10.48550/arXiv.2201.11903
Open DOI Search in Google Scholar Back to article
Wong, A. (2018). The controversy over just how much history AP World History should cover. The Atlantic. Last accessed February 18, 2026. https://www.theatlantic.com/education/archive/2018/06/ap-world-history-controversy/562778/
Search in Google Scholar Back to article
Xu, R., Wang, Z., Fan, R.-Z., & Liu, P. (2024). Benchmarking benchmark leakage in large language models. arXiv. 10.48550/arXiv.2404.18824
Open DOI Search in Google Scholar Back to article
Zaagsma, G. (2023). Digital history and the politics of digitization. Digital Scholarship in the Humanities, 37(3), 830–851. 10.1093/llc/fqac050
Open DOI Search in Google Scholar Back to article
Zhao, Q., Huang, Y., Lv, T., Cui, L., Sun, Q., Mao, S., Zhang, X., Xin, Y., Yin, Q., Li, S., & Wei, F. (2024). MMLU-CF: A contamination-free multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2412.15194
Open DOI Search in Google Scholar Back to article

Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

References

Paradigm

My account