References
- Chartier, M., Dakkoune, N., Bourgeois, G., & Jean, S. (2025). HiBenchLLM: Historical inquiry benchmarking for large language models. Data & Knowledge Engineering, 156, 102383. 10.1016/j.datak.2024.102383
- Chen, X., Zhou, S., Liang, K., Yuan, D., Chen, H., Sun, X., Meng, L., & Liu, X. (2025). Putting on the thinking hats: A survey on chain-of-thought fine-tuning from the perspective of human reasoning mechanism. arXiv. 10.48550/arXiv.2510.13170
- De Ninno, F., & Lacriola, M. (2025). Mussolini and ChatGPT: Examining the risks of AI writing historical narratives on Fascism. Journal of Modern Italian Studies, 30(2), 187–209. 10.1080/1354571X.2025.2457250
- Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., Bennett, J. S., Hoyer, D., Francois, P., Turchin, P., & del Rio-Chanona, R. M. (2024). Large language models’ expert-level global history knowledge benchmark (HiST-LLM). In Proceedings of the 38th International Conference on Neural Information Processing Systems (Article 1016). Last accessed February 18, 2026.
https://dl.acm.org/doi/10.5555/3737916.3738932 - Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. arXiv. 10.48550/arXiv.2009.03300
- Hutchinson, D. (2024). Mapping the latent past: Assessing large language models as digital tools through source criticism. Journal of Digital History, 3(1). 10.1515/jdh-2023-0018
- Li, W., Ma, R., Wu, J., Gu, C., Peng, J., Len, J., Zhang, S., Yan, H., Lin, D., & He, C. (2024). FoundaBench: Evaluating Chinese fundamental knowledge capabilities of large language models. arXiv. 10.48550/arXiv.2404.18359
- Marshall, L. (2020). The strange world of AP U.S. history. Contingent Magazine. Last accessed February 18, 2026.
https://contingentmagazine.org/2020/10/20/apush/ - Mollick, E. (2024). Co-intelligence: Living and working with AI. London: Ebury Publishing.
- Qiu, J., Xiao, F., Wang, Y., Mao, Y., Chen, Y., Juan, X., Zhang, S., Wang, S., Qi, X., Zhang, T., Yao, Z., Guo, J., Lu, Y., Argon, C., Cui, J., Chen, D., Zhou, J., Zhou, S., Zhou, Z., … Wang, M. (2025). On path to multimodal historical reasoning: HistBench and HistAgent. arXiv. 10.48550/arXiv.2505.20246
- Shao, Y., Jiang, Y., Kanell, T., Xu, P., Khattab, O., & Lam, M. (2024). Assisting in writing Wikipedia-like articles from scratch with large language models. arXiv. 10.48550/arXiv.2402.14207
- Stanford Center for Research on Foundation Models. (n.d.). HELM MMLU leaderboard. Retrieved November 23, 2025, from
https://crfm.stanford.edu/helm/mmlu/latest/ - Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2406.01574
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv. 10.48550/arXiv.2201.11903
- Wong, A. (2018). The controversy over just how much history AP World History should cover. The Atlantic. Last accessed February 18, 2026.
https://www.theatlantic.com/education/archive/2018/06/ap-world-history-controversy/562778/ - Xu, R., Wang, Z., Fan, R.-Z., & Liu, P. (2024). Benchmarking benchmark leakage in large language models. arXiv. 10.48550/arXiv.2404.18824
- Zaagsma, G. (2023). Digital history and the politics of digitization. Digital Scholarship in the Humanities, 37(3), 830–851. 10.1093/llc/fqac050
- Zhao, Q., Huang, Y., Lv, T., Cui, L., Sun, Q., Mao, S., Zhang, X., Xin, Y., Yin, Q., Li, S., & Wei, F. (2024). MMLU-CF: A contamination-free multi-task language understanding benchmark. arXiv. 10.48550/arXiv.2412.15194
