Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Daniel Hutchinson

doi:10.5334/johd.489

Full Article

(1) Context and motivation

Generative artificial intelligence (AI) and large language models (LLMs) are increasingly common tools in humanistic research and pedagogy. As these technologies increasingly impact scholarship and teaching, determining their capabilities and limitations becomes critical. LLMs demonstrate remarkable abilities across a range of tasks, yet determining the actual capabilities of so-called “frontier” models remains essential – especially in core practices like analytical reasoning and interpretative synthesis. Effective use requires understanding which tasks align with model training and which exceed current capabilities.

Digital humanists insist that algorithmic systems be evaluated not just for accuracy but for the assumptions and social dynamics they encode. Evaluating LLMs requires the same scrutiny. LLM benchmarks measuring how different models perform across various knowledge domains, such as history, offer an approach to this challenge. Yet benchmarks tell conflicting stories. The most widely used benchmark for history suggests LLMs possess expert-level capacities, with leading models exceeding 90% accuracy. But other assessments reveal significant shortcomings, with models struggling to achieve basic competence on core historical reasoning tasks. What then do benchmarks actually measure? Whose historical knowledge do these metrics encode? What limitations do such benchmarks illuminate and obscure? Considering these questions not only helps scholars match tasks to capabilities, but can also inform debates about the effective and ethical use of LLMs in historical scholarship.

This paper examines benchmark studies to identify which methodological dimensions enable effective LLM assessment. These studies reveal a systematic pattern: performance collapses as assessments move from contaminated to decontaminated datasets, from Western to global knowledge domains, from multiple-choice questions to open-ended responses, and from factual recall to historiographical interpretation. Charting this collapse reveals that models excel at matching patterns in questions resembling their training data, a capability that produces impressive multiple-choice scores. Yet historical reasoning demands synthesis across evidence, construction of causal arguments, and development of contextual explanations. At present, these studies suggest that such competencies remain beyond today’s “frontier” models.

However, rapid LLM development requires equally robust innovation in evaluation methods. “Reasoning” models, retrieval augmented generation (RAG), and agentic systems have expanded LLM capabilities while introducing new vulnerabilities. As these technologies develop, so too must the approaches used by digital humanists to effectively measure, assess, and critique these systems. While such approaches may require highly technical frameworks, the core humanistic practice of source criticism offers an accessible means to understanding both the strengths and weaknesses of generative AI.

(2) From Source Criticism to Historical Benchmarking

While LLM benchmarks seek to quantitatively measure effectiveness across domains, their construction demands qualitative judgments about what to measure. What counts as “historical knowledge,” which curricula merit inclusion, and what forms of competence demand measurement? Just as important is the composition of LLMs themselves. Trained on internet-scale data sources, LLMs reflect the expansive diversity of the digital world. Yet there are gaps and distortions in these sources. Those same features inform the inherited “knowledge” of LLMs and their outputs. LLMs can thus be understood as historical sources, shaped by their training data and the contingencies of their creation. Accordingly, source criticism offers insights into both benchmarks and the LLMs they measure.

First, data provenance is crucial to benchmark effectiveness. A major risk in LLM evaluation is benchmark contamination, where models have been exposed to the very measures used to evaluate them. Does a model’s performance reflect genuine capability, or is it simply reproducing data encountered during training? LLMs are not trained on uniform distributions of human knowledge but on materials dominated by specific places, periods, and languages. Benchmarks often replicate these patterns, emphasizing particular aspects of our shared cultural heritage.

The distinction between factual recognition and historical reasoning is also crucial. How benchmarks measure competency shapes what insights they offer. Benchmarks employing multiple-choice questions test very different capabilities than those demanding open-ended synthesis. The first approach allows models to match question patterns to factual information in their training data and select correct answers from provided options. The second approach demands weighing evidence, developing analytical frameworks, and constructing causal narratives. While multiple-choice formats usefully measure factual recall, they cannot assess whether models understand how historical knowledge is constructed or if they can synthesize evidence into original arguments. Format thus determines whether benchmarks measure memorized patterns or historical thinking.

In examining historical benchmarking of LLMs, this study offers the evaluation framework provided in Table 1:

Table 1

Framework for Evaluating LLM Benchmarks for Historical Knowledge.

FRAMEWORK DIMENSION	KEY QUESTION	WHAT THIS DIMENSION TESTS	ASSESSMENT APPROACH
Contamination Resistance	Does performance reflect model memorization or historical reasoning?	Whether models encountered evaluation questions during training.	Compare performance on contaminated vs. decontaminated datasets.
Content Diversity	Does model competence generalize beyond well-represented domains?	How performance varies across geographic, temporal, and linguistic contexts.	Test across global knowledge domains, not just Western and well-represented domains.
Format Diversity	Can pattern recognition transfer to reasoning?	If LLMs can leverage factual recall to analytical examination.	Compare multiple-choice identification (recognition) vs. open-ended synthesis (reasoning).
Epistemological Sophistication	Can models distinguish how we know from what we know?	Whether models understand how historical knowledge is constructed.	Examine capacities for source criticism, evidential weighing, and inferential reasoning.

The benchmarks examined here each illuminate distinct dimensions of this framework. By applying these criteria, we can move beyond surface-level performance scores to understand what different assessments actually measure, and what they obscure. We begin with the most widely cited benchmark, which reveals the challenge of contamination resistance.

(3) What Do 90% Scores Really Measure? MMLU and Contamination Resistance

Developed in 2021 by researchers led by Dan Hendrycks, the Massive Multitask Language Understanding (MMLU) benchmark has become a standard measurement for assessing generative AI across 57 academic fields. MMLU measures LLM competencies through nearly 16,000 questions ranging from elementary to postgraduate difficulty. Yet its reliance on publicly available curricula creates validity concerns that illuminate why contamination resistance matters for rigorous evaluation. For example, the MMLU evaluates historical knowledge through some 600 questions taken from the Advanced Placement (A.P.) curricula for U.S., European, and World history. Critically, these assessments were assembled from the open web, where extensive test prep materials are publicly available. This public availability complicates assessment (Hendrycks et al., 2021).

In this assessment, LLMs receive an excerpt from a historical source followed by a multiple-choice question with four possible answers:

U.S. History Benchmark, Question 5:

This question refers to the following information.

“I was once a tool of oppression

And as green as a sucker could be

And monopolies banded together

To beat a poor hayseed like me.”

“The railroads and old party bosses

Together did sweetly agree;

And they thought there would be little trouble

In working a hayseed like me. …”

The Hayseed

The song, and the movement that it was connected to, highlight which of the following developments in the broader society in the late 1800s?

A: Corruption in government, especially as it related to big business, energised the public to demand increased popular control and reform of local, state, and national governments.

B: A large-scale movement of struggling African American and white farmers, as well as urban factory workers, was able to exert a great deal of leverage over federal legislation.

C: The two-party system of the era broke down and led to the emergence of an additional major party that was able to win control of Congress within ten years of its founding.

D: Continued skirmishes on the frontier in the 1890s with American Indians created a sense of fear and bitterness among western farmers.

Correct Answer: A

MMLU benchmarks were first tested in 2021 against the then-leading LLM, GPT-3. 25% accuracy represented random chance; 90% performance represented estimated “expert-level” accuracy. The MMLU researchers derived this threshold by aggregating 95th percentile human performance across professional examinations and academic subjects. For history, this means the 90% threshold actually reflects strong A.P. exam performance, not professional historical expertise, making “expert-level” a misleading descriptor. Despite this conflation, the threshold provides a consistent baseline for comparing model performance across academic domains and tracking capabilities over time (Hendrycks et al., 2021).

GPT-3 achieved over 50% accuracy across three history subfields, placing history among the top third of disciplines tested, though falling short of the 90% threshold (Hendrycks et al., 2021). Within a few years, however, GPT-3’s successors crossed this threshold decisively. The data from Figure 1 below from a 2025 replication study by Stanford’s Center for Research on Foundation Models shows leading LLMs now achieve 90%+ accuracy on all three history assessments (Stanford CRFM 2025, European History, US History, and World History sections).

Accuracy of Selected LLMs on History Questions in the MMLU Benchmark: HELM Subject Leaderboards (High School History), January 2025. GPT-3 data from Hendrycks et al. (2021); all other models from Stanford CRFM (2025).

In just a few years, commercial and open-source models alike have exceeded the 90% threshold on all three history subject exams. Similar progress has been made by LLMs in the other academic fields tested by this benchmark. MMLU›s accessibility, versatility, and scope have made it the most widely used AI benchmark for historical knowledge and other knowledge domains.

Yet this widespread adoption makes the assessment’s methodological limitations especially significant. Given the scale of LLM training sets, many models have been trained on the very questions meant to test their competencies, a phenomenon known as “benchmark leakage” (Xu et al., 2024). The A.P. curriculum’s significant online presence, through test prep materials, study apps, and uploaded exams, makes such contamination nearly inevitable. Indeed, the evidence for leakage is compelling. One study found that when prompted with MMLU questions without answer choices, some LLMs reproduced the exact answer options 52–57% of the time, a strong evidence of training data contamination (Zhao et al., 2024). This challenge exposes a fundamental tension in benchmarking web-scale models: evaluation datasets must be accessible enough for reproducible research, yet protected enough to prevent absorption into training corpora.

Beyond direct contamination, curriculum alignment presents a second validity concern. Even without encountering specific MMLU questions, models are trained extensively on primary sources, textbooks, and instructional materials central to A.P. curricula. This creates a teach-to-test dynamic: models absorb not just evaluation questions but the entire curriculum. This alignment differs from direct contamination. Models may not have seen specific MMLU questions, but their training saturates them with the historical content, interpretive frameworks, and question patterns that shape A.P. curricula. High scores may thus reflect training data alignment with A.P.›s particular historical geography rather than general historical competence.

Researchers responded with revisions aimed at eliminating benchmark leakage and increasing difficulty. MMLU-Pro (2024) increased rigor by filtering half of the easiest questions and expanding answer choices from four to ten options (Wang et al., 2024). MMLU-CF (2024) took a more comprehensive approach, creating 20,000 new evaluation questions through sophisticated data decontamination methods and establishing closed question sets withheld from future LLM training datasets. History represented one of the largest categories in MMLU-CF, constituting 11% of the benchmark, or over 2,000 questions (Zhao et al., 2024).

As the data in Figure 2 demonstrates, performance dropped substantially from 90%+ to the 68–77% range across models, still well above the 25% random baseline but revealing clear contamination effects. These results suggest LLMs possess genuine facility with pattern recognition, at least within a constrained geography of knowledge.

Comparison of LLM Performance on MMLU Variants: Overall Accuracy vs. History Questions. MMLU-Pro data (left) from Wang et al. (2024); MMLU-CF data (right) from Zhao et al. (2024).

However, whose historical knowledge do these patterns encode? This question introduces the second dimension of our framework: content diversity. Do models perform equally across different geographic, temporal, and linguistic domains, or do their capabilities reflect the provinciality of their training data? Both the benchmark and the models it measures inherit the same geographic constraints: training data and evaluation materials draw from identical historical geographies. This alignment makes impressive scores less revealing than they might initially suggest. MMLU’s history questions, rooted in A.P. curricula, represent a particular distribution of historical knowledge. While rigorous, A.P. emphasizes Western historical frames shaped by U.S. educational politics (Marshall, 2020; Wong, 2018). LLM training data does not transcend this geography but reinforces it. Internet-scale datasets are vast, yet they privilege certain domains over others. Models thus inherit a geography reflecting not universal competence but the specific patterns of which histories have been digitized, excelling where training data is abundant.

As one of the earliest LLM benchmarks, MMLU provided a valuable baseline for assessing historical competencies in generative AI. The limitations identified above constrain what conclusions can be drawn from these assessments. A.P. materials represent valuable pedagogical assessments for secondary students but are not scholarly evaluation tools. They emphasize breadth over depth, favor recognition over reasoning, and reflect curricular politics rather than disciplinary practices. As publicly available educational materials, A.P. items sit at lower tiers of an evidence hierarchy – below the expert-constructed, decontamination-resistant assessments required for robust evaluation. Triangulation with expert-constructed benchmarks confirms these limitations: the same models that achieve 90%+ on MMLU score 46% or below on curator-reviewed global history assessments and 38% on expert-evaluated open-ended questions, a performance collapse examined in subsequent sections. Whether pattern recognition within a specific historical geography translates to broad competence is the question the next benchmark addresses through global history assessment.

(4) Testing Content Diversity: Global and Cross-Cultural Benchmarks

The History Seshat Test for LLMs (HiST-LLM) directly tests whether LLM historical competencies extend beyond Western knowledge domains. Unlike MMLU’s reliance on A.P. curricula, HiST-LLM draws from the Seshat Global History Databank, a structured repository containing 36,000 data points covering 600 historical societies across every United Nations region. These data range from the Neolithic period to the early modern era. Critically, this dataset remained off the open web until release, substantially reducing contamination risk (Hauser et al., 2024).

HiST-LLM uses multiple-choice questions but, unlike MMLU, models must distinguish aspects of the historical record itself. Questions ask what can be directly evidenced in the historical record and what must be inferred. Rather than asking “did this society use writing?”, the benchmark asks whether writing was directly attested, inferred present, inferred absent, or directly evidenced as absent from what it knows on the subject. This approach is meant to mirror how historians distinguish direct evidence from inference in their own analyses.

Consider the following example. Here a LLM is queried using chain of thought (CoT) prompting, which asks models to articulate their reasoning process in natural language (often called a “reasoning trace”) before providing an answer. This technique has been shown to improve performance on complex reasoning tasks, and serves as the foundation for the development of “reasoning” LLMs (Wei et al., 2022).

The characteristic ’Shields’ is categorized under ’Armor’. Was it present, inferred present, inferred absent, or absent for the polity called ’Latium- Bronze Age’, during the time frame from 1800 BCE to 900 BCE?

Options:

A: Present,

B: Inferred Present,

C: Inferred Absent,

D: Absent

Reasoning and evidence: Weapons, statuettes, and “double shields” found in male burials suspected to infer elite military or religious status.

Answer: A

While this structure reflects expert coders’ epistemological judgments, the multiple-choice format leaves unclear whether models engage in similar reasoning. When tested on a simplified two-choice format (present/absent only), accuracy improved from 34–46% to 58–63%, suggesting inference categories add substantial difficulty. However, the benchmark did not examine the reasoning traces to assess how models evaluate historical evidence. For HiST-LLM, the benchmark›s primary strength lies in testing factual knowledge across global domains rather than epistemological sophistication.

This global scope helps map the geographies of knowledge contained in LLM training data. While LLM training corpora are vast, they reflect the uneven distribution of the digitized past. Models excel when encountering extensively documented domains but struggle when sources are scarce. Benchmarks like HiST-LLM chart not just what models “know” but whose histories have been digitized.

HiST-LLM carries important methodological limitations. The Seshat dataset was compiled primarily from English-language scholarship, resulting in less comprehensive coverage for non-English-speaking regions. The Seshat dataset contains far more “present” codes (47%) than “absent” codes (20%). To prevent models from achieving inflated scores by guessing “present,” the benchmark uses balanced accuracy, a statistical adjustment accounting for this imbalance. These limitations do not undermine the benchmark’s utility but contextualize its findings: like all historical datasets, HiST-LLM reflects accessible evidence rather than the totality of human experience (Hauser et al., 2024).

The measures illustrated in Figure 3 confirm these limitations. Researchers tested seven leading LLMs: performance ranged from 34% (Llama-3.1–8B) to just 46% (GPT-4-Turbo). Such results remain well above random chance but fall far short of the 90% threshold established by frameworks like the MMLU. The same models that achieved 68–77% on decontaminated Western-focused benchmarks fell to 46% or below on global assessments. Such findings illustrate the importance of testing for content diversity: models trained predominantly on Western historical materials struggle when that geographic advantage disappears.

Balanced Accuracy of LLMs on HiST-LLM Global History Benchmark (4-Choice Setting). Data from Hauser et al. (2024).

Figure 4 illustrates how regional disparities reveal performance gaps across geography and historical periods. Models performed strongest in North America (41–46%) and Latin America (39–49%) and weakest in Oceania (32–40%) and Sub-Saharan Africa (35–43%). While absolute differences are modest, they suggest a consistent pattern: models perform best where English-language scholarship is most abundant.

Balanced Accuracy of Selected LLMs by Region: HiST-LLM Global History Benchmark (4-Choice). Data from Hauser et al. (2024).

These patterns manifest in sometimes surprising ways. Models achieved the highest regional performance in Latin America, likely reflecting Seshat’s dataset composition. Seshat’s Latin American coverage concentrates on major pre-Columbian civilizations, such as the Aztec, Inca, and Maya empires, which feature prominently in world history curricula and English-language scholarship. Thus, Latin America’s high scores reflect not a broad-based assessment of the region’s diversity, but alignment between benchmark measures and training data emphasis on noteworthy “world civilizations.” Even within a global benchmark representation remains uneven, favoring societies prominent in world history curricula and LLM training data.

Lower performance in Oceania and Sub-Saharan Africa suggests another dimension of training data influence. Long-standing inequalities in digitization investment shape which societies’ histories become available at the scale necessary for LLM training (Zaagsma, 2023). While many training data corpora remain proprietary, consistent patterns across benchmarks suggest digital divides become embedded in model architectures. These gaps in digitized knowledge become gaps in model performance, encoding historical inequalities into computational systems.

(5) Cross-Cultural Diversity: The FoundaBench Chinese History Benchmark

Yet these disparities raise a question HiST-LLM alone cannot answer: do performance gaps reflect inherent properties of different historical domains, or uneven training data distribution? FoundaBench tests this hypothesis through controlled comparison. This Chinese knowledge benchmark employs 200 multiple-choice questions on Chinese history at middle and high school levels, using methods similar to MMLU’s but grounded in Chinese educational curricula. This design isolates training data composition as the key variable: similar question format, different linguistic corpus.

The benchmarks results in Figure 5 support the training data hypothesis. Researchers found that LLMs trained extensively on Chinese-language data outperformed vastly larger multilingual models like GPT-4 (Li et al., 2024). These findings challenge assumptions about the importance of model scaling. Model size typically correlates with improved performance, as larger models train on more extensive datasets for longer computational periods. GPT-4 is estimated at one trillion parameters. Yet InternBench models with 123 and 70 billion parameters surpassed GPT-4 on Chinese history; even the comparatively tiny 14-billion-parameter Qwen model demonstrated comparable performance to GPT-4.

Accuracy of LLMs on FoundaBench History Subjects (CircularEval): Chinese Models Highlighted. Data from Li et al. (2024).

This pattern reveals that training data constraints apply broadly. Where HiST-LLM shows Western-trained models struggling with global knowledge, FoundaBench shows the reverse: Chinese-trained models excel in Chinese history regardless of size, while larger Western-trained models falter. Importantly, this is not evidence of superior Chinese model architectures or training processes. Rather, both illustrate the same phenomenon: LLM “historical knowledge” reflects linguistic and geographic hierarchies in training data, not universal competences or computational scale.

Content diversity thus reveals a consistent pattern: models excel where training data is abundant, regardless of computational scale. Training data composition, and not architectural sophistication, determines which historical domains models can navigate. Yet a fundamental question remains: can pattern recognition within familiar domains translate to historical reasoning beyond them? Moving from recognition to reasoning requires different assessment formats, a shift examined through benchmarks employing open-ended questions rather than multiple-choice options.

(6) Beyond Multiple Choice: The Challenge of Evaluating Open-Ended Responses

The framework dimensions of format diversity and epistemological sophistication test whether models can move beyond pattern recognition to historical reasoning. HiBenchLLM primarily illuminates format diversity through open-ended questions, while a study of Italian Fascism probes epistemological sophistication by testing whether models can distinguish assertion from argument. Benchmarks assessing these dimensions test LLMs through open-ended questions requiring synthesis, revealing whether models can construct rather than recognize historical arguments. Expert-driven assessment and close reading of LLM outputs from two complementary studies illustrate a dramatic collapse in performance across different historical domains and methodological approaches.

(6.1) Interpretive Collapse: HiBenchLLM on French Regional History

The HiBenchLLM study confronts the format diversity dimension by shifting from breadth to depth and from recognition to generation. Researchers challenged 14 leading LLMs on the regional history of Poitou, France, through open-ended prompts requiring qualitative analysis and contextual understanding (Chartier et al., 2025). HiBenchLLM tests whether LLMs can construct historically specific explanations when answers must be generated rather than selected.

Where multiple-choice benchmarks can be scored automatically, open-ended questions demand qualitative judgment. Researchers developed a nuanced evaluation scale recognizing that historical questions rarely admit simple right-or-wrong answers. Responses were evaluated for accuracy, comprehensiveness, and analytical depth rather than binary correctness. The benchmark’s 62 questions spanned a spectrum of complexity. Some queries test straightforward factual recall, such as dates, locations, and biographical details. Others focus on interpretive questions requiring synthesis and argumentation. Such questions demand that models marshal evidence and construct coherent explanations, core skills that define historical reasoning. Researchers also tested response consistency by creating multiple variations of each question with different phrasing or period-specific language. If models possessed grounded “knowledge” of these concepts, variations should make little difference. Yet they showed significant divergence.

The results of the benchmark shown in Figure 6 demonstrate the challenges LLM encountered on this measurement. Even Gemini, the best-performing model, provided reliable results less than 40% of the time. The highly variable nature of LLM responses raises significant concerns about the consistency of these technologies for high-order historical analysis. This limitation compounds another: some question types proved more challenging, revealing which historical competencies remain most elusive for LLMs.

Reliability of LLMs on HiBenchLLM French Regional History Benchmark (Share of Questions With 100% Correct Variants). Data from Chartier et al. (2025).

Figure 7 illustrates how this variability reveals which historical competencies remain on the “jagged frontiers” of LLM capabilities (Mollick, 2024). On quantitative questions assessing specific dates, numbers, or factual intervals, models averaged just 12% reliability. LLMs achieved greater success in metadata and list-generation tasks (42% average), identifying relevant details (battles, figures, events) without determining relationships among them. This pattern suggests models excel at retrieval and listing, tasks that leverage the remarkable breadth of LLM training data. But these same models struggle when synthesis and contextual understanding are required. When asked for definitions or descriptions requiring contextual understanding, average model performance dropped to 34%.

Reliability by Question Type on HiBenchLLM French Regional History Benchmark: Top LLMs and Top-5 Average. Data from Chartier et al. (2025).

The most revealing failures emerged on questions demanding interpretive synthesis, what the researchers classified as “broad problem” questions. These questions prompted models to construct historical arguments across several frames of reference, such as: “To what extent was craftsmanship crucial in the economic life of Poitou villages from 1500 to 1800?” Such queries demanded more than robust training data. Models needed to weigh evidence, consider counterarguments, and develop analytical frameworks.

Performance collapsed: models averaged only 16% reliability, with 38% of responses classified as “approximate or partially correct.” Another 24% of responses were simply incorrect, often because models “produced lengthy and generic responses without addressing specificities or reasoning about related concepts.” (Chartier et al., 2025). Models can recall relevant information but struggle to organize it into a coherent analysis. On the tasks that matter most for historical scholarship (synthesis, interpretation, analytical reasoning) even frontier models performed poorly.

This performance collapse on interpretive questions reveals format diversity’s crucial role in benchmark validity. Multiple-choice assessments, even decontaminated ones testing global knowledge, still provide scaffolding through answer options that LLMs can navigate via their training data. Open-ended questions requiring synthesis remove this scaffolding, exposing whether models can construct rather than merely recognize answers. For benchmark designers, format diversity functions as a validity check: models appearing competent on recognition tasks may lack interpretive capabilities that genuine expertise demands, suggesting multiple-choice scores overestimate performance on tasks requiring original analysis.

This raises a further question: even when models generate responses that appear factually accurate, can they distinguish how we know from what we know? The capacity to evaluate sources, weigh competing interpretations, and distinguish assertion from argument represents the final dimension of this framework.

(6.2) Violent Ambiguities: Benchmarking Italian Fascism

Format diversity reveals that models struggle to transform factual recall into interpretive synthesis. The final framework dimension, epistemological sophistication, poses a related challenge: can models distinguish not only what is asserted but how we know? A recent study examined LLM performance on Italian Fascism, testing whether LLMs can weigh historical interpretation against historiographical debate, contested evidence, and public memory (De Ninno & Lacriola, 2025).

The authors queried the ChatGPT model family (GPT-3.5, GPT-4, GPT-4o) on 84 open-ended questions and analyzed 252 responses. The results illustrated in Figure 8 proved sobering: they judged 143 of 252 responses (56.7%) as containing errors. The analytical shortcomings identified in this study mirrored those in HiBenchLLM: factual errors, chronological inconsistencies, and generic outputs applicable to any authoritarian regime. However, the most persistent flaw in ChatGPT was not factual errors, but misleading interpretations. Even the advanced GPT-4o produced “interpretively ambiguous” outputs history in over a third of responses, meaning users are frequently presented with authoritative-sounding responses that actually require professional historical training to debunk.

Error Categories in ChatGPT Responses on Italian Fascism. Data from De Ninno and Lacriola (2025).

Yet the study’s most revealing findings emerged from patterns of historiographical distortion rather than error rates. For example, when asked about the development of Italy’s 1938 racial laws, all three GPT versions attributed them primarily to external pressure from Nazi Germany. This reproduced dated interpretations that downplayed the role of the Fascist regime and the domestic politics that influenced it. When models identified bibliographic works to support their views, they consistently privileged older English-language scholarship over recent Italian works, and even then frequently misapplied their arguments. For example, all three models cited Michele Sarfatti’s authoritative work on Italian Fascism, yet failed to apply his core argument emphasizing the domestic factors that drove the regime. This pattern extended across economic and political history: models described Fascist corporatism using the regime’s own rhetoric rather than scholarly analysis, and concentrated on 1938–1943 while neglecting earlier foundational periods. Such distortions suggest models privilege widely circulated narratives over specialized scholarship. Worse still, models tended to consistently downplay the violence of the Fascist regime, a concerning tendency. These shortcomings on a topic demanding careful evaluation of evidence and historical nuance should give historians pause in deploying LLMs for higher-order analytical tasks.

Epistemological sophistication thus exposes the deepest limitation of current LLMs: models can recognize and cite scholarship but struggle to engage with historiographical arguments. They privilege circulation over insight, reproducing widely available interpretations rather than weighing evidence and contested claims, the core practices that distinguish historical knowledge from historical information.

(7) From Critique to Framework: Principles for Effective Historical Assessment of LLMs

Collectively, the four dimensions examined here – contamination resistance, content diversity, format diversity, and epistemological sophistication – reveal systematic patterns in the capabilities and limitations of LLMs across multiple scales of analysis. The implications of these findings for benchmark design, model development, and historical practice merit careful consideration.

Performance across these benchmarks reveals systematic collapse as assessments demand synthesis, source criticism, and historiographical judgment, the core competencies defining historical reasoning. These benchmarks chart this progression: from 90%+ on contaminated MMLU questions to 68–77% after decontamination, 46% on global history (HiST-LLM), 38% on open-ended questions (HiBenchLLM), and 16% on interpretive synthesis – a decline mapping the shift from pattern recognition to historical reasoning. Qualitative analysis of Italian Fascism responses exposes failures beyond accuracy: models with 44% accuracy consistently misattributed causation, cited works without engaging arguments, and privileged popular narratives over specialized scholarship.

As Table 2 documents, these studies reveal a clear progression: from contaminated to decontaminated datasets, Western to global content, multiple-choice to open-ended formats, and factual recall to historiographical sophistication. Yet, even as these benchmarks expose current limitations, model architectures continue to evolve, raising the question of whether new capabilities demand new assessment frameworks.

Table 2

Four-Dimensional Framework for Historical LLM Assessment with Design Implications.

DIMENSION	DEFINITION	KEY FINDINGS	DESIGN IMPLICATIONS
Contamination Resistance	Extent to which benchmark isolates genuine capabilities from memorization	90%→68% accuracy after decontamination	Curator-reviewed questions; temporal holdout sets; avoid widely distributed datasets
Content Diversity	Geographic, temporal, linguistic breadth of evaluation dataset	HiST-LLM Latin America anomaly (49%) reflects Seshat’s pre-Columbian focus; FoundaBench shows training data composition > computational scale	Include underrepresented regions/periods; document source coverage; test across languages
Format Diversity	Range of response types (MC, open-ended, structured) and complexity levels	HiBenchLLM: 38% overall → 16% interpretive synthesis; MC provides scaffolding that inflates scores	Multiple formats per domain; emphasize generation over selection; include multi-step reasoning
Epistemological Sophistication	Assessment of source criticism, historiographical judgment, evidential reasoning	Italian Fascism: 56.7% error rate; models cite but cannot apply historiographic arguments; privilege older English language scholarship and outdated framings	Test interpretation vs. fact; require source evaluation; probe historiographical awareness

The development of “reasoning” models represents one such architectural frontier. As seen in the HiST-LLM benchmark, earlier models achieved improved performance through chain-of-thought (CoT) prompting, which generated “reasoning traces” that provided models the means to scaffold their analysis for complex queries. Newer models trained with large-scale reinforcement learning now produce these reasoning traces natively, enabling them to plan multi-step solutions and engage in more complex problem-solving behaviors. For historical assessment, this development creates opportunities to evaluate not just final outputs but the analytical processes that generate them (Chen et. al., 2025). Evaluators can examine how models marshal evidence, construct causal narratives, and justify interpretive claims across chains of inference. However, the benchmarks examined here predate these architectural innovations, leaving key questions for future research. Can reasoning traces lead to genuine analytical complexity, or do they merely produce more elaborate forms of pattern-matching? Do extended inference chains lead toward historical insight or sophisticated errors? Future benchmarks should treat reasoning traces as assessable artifacts, developing rubrics that evaluate whether models’ step-by-step reasoning reflects genuine historical thinking.

Similarly, RAG represents an architectural response to the content diversity limitations revealed by these benchmarks. Rather than relying solely on fixed training data, RAG systems query external knowledge sources to supply relevant evidence to LLMs, theoretically enabling models to access information beyond their training data. This approach promises to address the geographic and temporal constraints on LLM performance examined in these benchmarks. Yet assessing the effectiveness of these approaches requires new evaluation frameworks. For example, one early RAG system is STORM, designed by the Stanford Open Virtual Assistant Lab (Shao et al., 2024). This system generates Wikipedia-style articles through a multi-step retrieval and research process that synthesizes information from web sources. While impressive in scope, the developers found that access to external sources introduces new challenges even as it resolves others. Expert evaluation of STORM-generated articles revealed two systematic failure modes that parallel the epistemological limitations identified in this study. First, models created what evaluators identified as “red herring fallacies”, or unverifiable connections between different pieces of retrieved information. This pattern mirrors the shortcomings observed in the Italian Fascism study: LLMs could identify and reference relevant sources, but could not evaluate whether those sources actually supported their inferences. Second, models transferred bias and emotional tone directly from internet sources into their articles. Evaluators characterized these outputs as “emotional” or “unneutral,” demonstrating that models struggle with the critical filtering necessary to evaluate source quality or maintain historiographical balance. These findings suggest that RAG shifts rather than solves the evaluation challenge. Instead of assessing what models learned during their training, new measures must now assess how models synthesize dynamically retrieved data. Future benchmarks must therefore probe whether RAG systems can critically evaluate source reliability, distinguish justified inference from speculative association, and maintain historiographical neutrality when synthesizing information from biased retrieval results.

Beyond architectural evolution, the expanding multimodal capabilities of LLMs introduce further assessment challenges. Media like audio, images, and videos have joined text to become legible sources for LLM systems. Each modality demands distinct evaluation approaches for source processing and interpretation (Hutchinson, 2024). HistBench, a recently developed benchmark of 414 expert-authored questions grounded in sources spanning 29 languages and multiple modalities, provides a rigorous testbed for these capabilities (Qiu et al., 2025). Its assessment framework uses approaches parallel to those examined here: evaluating content diversity across languages and regions, format diversity through multimodal sources, and epistemological sophistication through tasks requiring historical analysis and source interpretation. Performance across model types reveals both incremental progress and persistent limitations. LLMs equipped with basic web search achieved only 18.6% accuracy (GPT-4o). Equipping the same model with HistAgent, a sophisticated RAG system integrating optical character recognition, multilingual translation, reverse image search, and academic literature retrieval, improves performance to 27.5%. Yet even reasoning-optimized models designed for extended inference reached only 32.4% accuracy (GPT-o3). This progression demonstrates that architectural innovations and enhanced retrieval infrastructure yield measurable gains, but fundamental limitations in historical reasoning persist. Models can access, transcribe, and cite sources with increasing sophistication, yet they still struggle to critically evaluate source reliability, distinguish justified inference from speculation, or synthesize interpretive arguments. Assessment frameworks grounded in disciplinary methodology remain essential for distinguishing technical capability from genuine historical reasoning.

The four-dimensional framework developed here demonstrates how source criticism provides analytical tools for evaluating LLMs and their benchmarks. By interrogating training data provenance, content distributions, question formats, and epistemological sophistication, historians and digital humanists can assess both model capabilities and the benchmarks designed to measure them. As architectures evolve, these assessment frameworks must evolve as well, maintaining critical scrutiny of both what models know and how they know it.

Competing Interests

The author has no competing interests to declare.

Author Contributions

Daniel Hutchinson: Conceptualization, Methodology, Investigation, Data curation, Formal analysis, Visualization, Writing – original draft, Writing – review & editing.

Benchmarking as Source Criticism: From Recognition to Reasoning in LLM Assessment

Full Article

(1) Context and motivation

(2) From Source Criticism to Historical Benchmarking

Table 1

(3) What Do 90% Scores Really Measure? MMLU and Contamination Resistance

Figure 1

Figure 2

(4) Testing Content Diversity: Global and Cross-Cultural Benchmarks

Figure 3

Figure 4

(5) Cross-Cultural Diversity: The FoundaBench Chinese History Benchmark

Figure 5

(6) Beyond Multiple Choice: The Challenge of Evaluating Open-Ended Responses

(6.1) Interpretive Collapse: HiBenchLLM on French Regional History

Figure 6

Figure 7

(6.2) Violent Ambiguities: Benchmarking Italian Fascism

Figure 8

(7) From Critique to Framework: Principles for Effective Historical Assessment of LLMs

Table 2

Competing Interests

Author Contributions

Paradigm

My account