Have a personal or library account? Click to login
Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment Cover

Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Open Access
|Mar 2026

References

  1. Chartier, M., Dakkoune, N., Bourgeois, G., & Jean, S. (2025). HiBenchLLM: Historical inquiry benchmarking for large language models. Data & Knowledge Engineering, 156, 102383. 10.1016/j.datak.2024.102383
  2. Chen, X., Zhou, S., Liang, K., Yuan, D., Chen, H., Sun, X., Meng, L., & Liu, X. (2025). Putting on the thinking hats: A survey on chain-of-thought fine-tuning from the perspective of human reasoning mechanism. arXiv. 10.48550/arXiv.2510.13170
  3. De Ninno, F., & Lacriola, M. (2025). Mussolini and ChatGPT: Examining the risks of AI writing historical narratives on Fascism. Journal of Modern Italian Studies, 30(2), 187209. 10.1080/1354571X.2025.2457250
  4. Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., Bennett, J. S., Hoyer, D., Francois, P., Turchin, P., & del Rio-Chanona, R. M. (2024). Large language models’ expert-level global history knowledge benchmark (HiST-LLM). In Proceedings of the 38th International Conference on Neural Information Processing Systems (Article 1016). Last accessed February 18, 2026. https://dl.acm.org/doi/10.5555/3737916.3738932
  5. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. 10.48550/arXiv.2009.03300
  6. Hutchinson, D. (2024). Mapping the latent past: Assessing large language models as digital tools through source criticism. Journal of Digital History, 3(1). 10.1515/jdh-2023-0018
  7. Li, W., Ma, R., Wu, J., Gu, C., Peng, J., Len, J., Zhang, S., Yan, H., Lin, D., & He, C. (2024). FoundaBench: Evaluating Chinese fundamental knowledge capabilities of large language models. arXiv. 10.48550/arXiv.2404.18359
  8. Marshall, L. (2020). The strange world of AP U.S. history. Contingent Magazine. Last accessed February 18, 2026. https://contingentmagazine.org/2020/10/20/apush/
  9. Mollick, E. (2024). Co-intelligence: Living and working with AI. London: Ebury Publishing.
  10. Qiu, J., Xiao, F., Wang, Y., Mao, Y., Chen, Y., Juan, X., Zhang, S., Wang, S., Qi, X., Zhang, T., Yao, Z., Guo, J., Lu, Y., Argon, C., Cui, J., Chen, D., Zhou, J., Zhou, S., Zhou, Z., … Wang, M. (2025). On path to multimodal historical reasoning: HistBench and HistAgent. arXiv. 10.48550/arXiv.2505.20246
  11. Shao, Y., Jiang, Y., Kanell, T., Xu, P., Khattab, O., & Lam, M. (2024). Assisting in writing Wikipedia-like articles from scratch with large language models. arXiv. 10.48550/arXiv.2402.14207
  12. Stanford Center for Research on Foundation Models. (n.d.). HELM MMLU leaderboard. Retrieved November 23, 2025, from https://crfm.stanford.edu/helm/mmlu/latest/
  13. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2406.01574
  14. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv. 10.48550/arXiv.2201.11903
  15. Wong, A. (2018). The controversy over just how much history AP World History should cover. The Atlantic. Last accessed February 18, 2026. https://www.theatlantic.com/education/archive/2018/06/ap-world-history-controversy/562778/
  16. Xu, R., Wang, Z., Fan, R.-Z., & Liu, P. (2024). Benchmarking benchmark leakage in large language models. arXiv. 10.48550/arXiv.2404.18824
  17. Zaagsma, G. (2023). Digital history and the politics of digitization. Digital Scholarship in the Humanities, 37(3), 830851. 10.1093/llc/fqac050
  18. Zhao, Q., Huang, Y., Lv, T., Cui, L., Sun, Q., Mao, S., Zhang, X., Xin, Y., Yin, Q., Li, S., & Wei, F. (2024). MMLU-CF: A contamination-free multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2412.15194
DOI: https://doi.org/10.5334/johd.489 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 24, 2025
|
Accepted on: Jan 29, 2026
|
Published on: Mar 13, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Daniel Hutchinson, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.