Have a personal or library account? Click to login
Bias in Psychology: A Critical, Historical and Empirical Review Cover
By: Lee Jussim and  Nathan Honeycutt  
Open Access
|Jul 2024

Full Article

People are subject to many biases in thinking, perception, judgment, and decision making. And yet, humanity has built great civilizations, produced amazing technology, visited the Moon in person, and the Earth now sustains over 8 billion people. Clearly, people are also able to figure out a great many things and navigate their worlds successfully. Bias, then, is not the whole story, and, perhaps, not even the main one.

The present review is a cautionary analysis of the dangers of scientific overreach, showing how, time and time again, for nearly a century, there have been great outbursts of research in psychology on a variety of biases. We also show how, in each case, many of the original and most dramatic claims proved unjustified. It often took years, sometimes decades, for the depth of critiques of major claims in the bias literatures to percolate through and be understood by the wider scientific community. Whereas the outbursts received massive attention at the time, and the critiques generally received lukewarm attention at best, in nearly every case, the waves of enthusiasm for some then-new form of bias proved overwrought. It is not that the biases were necessarily shown to be false, though some were. But even when the biases found held up to critical scrutiny, they often were either not as powerful as originally claimed, or, sometimes, rather than producing a reign of error, they proved to serve people well in the real world. There are three scientific messages to be gleaned from this cautionary review: 1. Many of the original claims that emerged from bias research often reached conclusions about the power of subjectivity and bias that were not justified; 2. The next time bias effects appear, scientists should approach apparently amazing, dramatic, world-changing findings with perhaps more skepticism than has been common; and 3. It may be wise to hold off telling the world about these amazing, dramatic, world-changing findings as if they can be taken at face value and be used for effective real-world interventions until the wider scientific community has had many years, often decades, to skeptically vet whether they hold up to what early advocates claim them to be.

Paroxysms of “Bias”

What is “Bias”?

Before embarking on this review, however, it might be useful to point out that there are many different definitions and conceptualizations of bias in the literature. Bias can and has been defined as deviation from formal logic or from normative statistical models, as preference, as tendency, as systematic error, as unfair discrimination, and in many other ways (see, e.g., Banaji & Greenwald, 2016; Kahneman, Slovic & Tversky, 1982). In this review, we consider something to be a bias whenever researchers themselves referred to whatever phenomenon they were addressing as some sort of bias. Table 1 provides a list of some phenomena in psychological research referred to as bias.

Table 1

Some Biases in Social and Cognitive Psychology.

acquiescence biashalo effectnaïve realism
anchoringhindsight biasoutcome bias
availability heuristichot hand fallacyoutgroup homogeneity
base-rate fallacyhypothesis-confirming biasoverconfidence
belief perseveranceillusion of controlpluralistic ignorance
biased assimilationillusory correlationprejudice
confirmation biasimplicit biasrepresentativeness
conjunction fallacyingroup biasself-consistency bias
correspondence biasjust world biasself-fulfilling prophecy
dogmatismlabeling effectsself-serving bias
ethnocentrismlaw of small numberssexism, racism
expectancy biaslinguistic biassocial desirability
false consensusmicroaggressionsstereotype exaggeration
false uniquenessmindlessnessstereotype-confirming biases
fixed pie biasmisanthropic biassystem justification
fundamental attribution errormyside biasunrealistic optimism

Psychological research has examined biases for nearly a century (e.g., LaPiere, 1936; Bruner, 1957). Critiques can be readily found elsewhere (e.g., Corneille & Hütter, 2020; Jussim, 2012; Krueger & Funder, 2004) and are consistent with the theme of the present paper: that evidence for errors and biases is often weaker than is typically claimed.

Psychology has been intermittently punctuated by great outbursts of research on various sorts of biases, usually accompanied by grandiose claims about their pervasiveness, power and social importance. We review some of those outbursts, documenting how many of the cases fit a pattern that Paul Meehl (1978, p. 806–807) aptly described:

I consider it unnecessary to persuade you that most so-called “theories” in the soft areas of psychology (clinical, counseling, social, personality, community, and school psychology) are scientifically unimpressive and technologically worthless… Perhaps the easiest way to convince yourself is by scanning the literature of soft psychology over the last 30 years and noticing what happens to theories. Most of them suffer the fate that General MacArthur ascribed to old generals—They never die, they just slowly fade away.

Meehl (1990, p. 196) in a paper titled “Why Summaries of Research on Psychological Theories are Often Uninterpretable” also wrote:

…theories in the “soft areas” of psychology have a tendency to go through periods of initial enthusiasm leading to large amounts of empirical investigation with ambiguous over-all results. This period of infatuation is followed by various kinds of amendment and the proliferation of ad hoc hypotheses. Finally, in the long run, experimenters lose interest rather than deliberately discard a theory as clearly falsified.

Before proceeding to the main review, two points and one term need to be clarified. Meehl addressed psychology generally, and was not focused on bias per se. Furthermore, although he referred specifically to theory, we interpret his passage broadly to refer to claims and conclusions (which usually occur within some theoretical context). Thus, one could view the present review as something of, if not exactly a “case study” then perhaps a “domain study” of the relevance of Meehl’s analysis for psychology writ large. However, our review focuses on bias, rather than psychology writ large, so we leave it to readers to reach their own judgments about the relevance of Meehl’s analysis to other domains of psychological research. Our view is that his general analysis applies well to much of the work on both old and new biases that we review. After great outbursts and hundreds or maybe thousands of scholarly articles, the validity or generality of many of the original claims was ultimately far more ambiguous than originally suggested. Second, our review is on bias per se; it does not address the decades old debate about whether humans are fundamentally rational or irrational. Indeed, there are many conditions under which a variety of biases may be considered rational (e.g., Tappin, Pennycook & Rand, 2020), although this issue is beyond the scope of the present review.

Last, we use the term “Wow Effect!” to describe the academic rhetoric around certain findings. Jussim, Crawford, Anglin, Stevens & Duarte (2016, p. 118) described a “Wow Effect!” as “some novel result that comes to be seen as having far-reaching theoretical, methodological, or practical implications. It is the type of work likely to be emulated, massively cited, and highly funded.” We show with respect to bias (as Jussim et al 2016 showed then with respect to a range of other areas of research) that many (though not all) supposed Wow Effects! failed to deliver on the claims of their advocates.

Old Biases

The first part of our review focuses on “old biases.” Much of this work was conducted between 1940 and 1990. It was characterized by interest in perceptual and cognitive biases that reflected basic features of human psychology, generally bearing little or no relevance to political or ideological issues, agendas, or advocacy that do characterize the “new biases” we review later.

The New Look and Its Discontents

The first such explosion was the New Look in Perception of the 1940s and 1950s, which produced one of the earliest Wow Effects! in psychology -- that emotions and motivations influence basic perception (vision, hearing, etc., Bruner, 1957). At the time, this was a Wow Effect! because perception was presumed to be a largely objective process of neural reception of external stimuli, unaffected by fears, needs or motivations, so the New Look aspired to turn this view on its head. It is now widely recognized that this effort largely failed (Jussim, 2012). Cole and Balcetis (2021, p. 131) put it this way: “The accumulating studies in the late 1940’s and 1950’s claimed to have amassed evidence of visual perception being infiltrated by perceivers’ current states. Yet despite its initial wildfire popularity, the New Look in perception ultimately gave way to withering critiques in the late 1950’s…” For example, because all studies examined explicit reports, none of the studies could rule out that supposed effects on perception were far more likely to reflect memory, judgment or, when using potentially sensitive stimuli, such as profanity or sex, socially desirable responding (Jussim, 2012). As a result, effects on perception per se were never actually demonstrated.

This area made a smaller comeback under the “social priming” and “automaticity” umbrella (e.g., Bargh & Chartrand, 1999). These efforts also failed. Much social priming work could not be replicated by other researchers (e.g., Doyen, Klein, Pichon & Cleeremans, 2012). Even when the findings replicated, conclusions regarding effects on perception were found to be unjustified because they were shown to reflect higher order cognitive processes rather than perception (Firestone & Scholl, 2016). A recent review (Cole & Balcetis, 2021) claims that more modern research justifies the conclusion that motivations influence basic perception, but that claim is too recent to have been subject to the intense critical scrutiny that torpedoed prior such claims. Whether it will survive such scrutiny, or fail like its predecessors, remains to be seen.

Heuristics and Biases

Kahneman and Tversky (e.g., Kahneman et al. 1982) discovered and described a slew of deviations from normative models of rationality which they generally referred to as “heuristics and biases.” Many are listed in Table 1. This set off another explosion of research on error and bias. Detailed reviews can be found for psychology (Shah & Oppenheimer, 2008), economics (Harvey, 1998), law (Peer & Gamliel, 2013) and many other fields.

This explosion produced soaring testaments to the power and pervasiveness of error and bias (see, e.g., Jussim, 1991 for quotes). In contrast to The New Look, research on these types of heuristics and biases was generally replicable, never halted and continues today (Atanasiu, Ruotsalainen, & Khapova, 2023). Nonetheless, the next decades included the discovery that much of what appeared to be error or bias under specific lab conditions actually served people quite well (see Gigerenzer & Gaissmaier, 2011; Jussim, 1991; Krueger & Funder, 2004 for reviews). For example, “improper linear models” (ones that violate optimal statistical principles) performed almost as well and sometimes better than proper statistical models, such as ordinary least squares regression (Dawes, 1979). Gigerenzer & Gaissmaier (2011, p. 451) summarize related work that tested “…formal models of heuristic inference, including in business organizations, health care, and legal institutions. This research indicates that…individuals and organizations often rely on simple heuristics in an adaptive way…” Jussim (2012) concluded not that biases do not occur, but that accuracy dominates bias and self-fulfilling prophecy. Jussim (1991) and DiDonato, Ullrich and Krueger (2011) showed how, sometimes, the biases produced by social stereotypes can improve the accuracy of judgments.

Thus, much of the early glorification of findings of error and bias has disappeared from the more recent literature, replaced by acknowledgment that the critics were mostly right (e.g., Hjeij & Vilks, 2023). As Krueger and Funder (2004, p. 313) put it, exaggerated emphasis on error and bias had many shortcomings, including “…frequently erroneous imputations of error, findings of mutually contradictory errors, incoherent interpretations of error, an inability to explain the sources of behavioral or cognitive achievement, and the inhibition of generalized theory.” It is in the latter spirit that we present a tool for evaluating how biased people actually are in any given study, one that shows that, often (though not always), even when biases occur, unbiased judgments are often far more substantial.

The Goodness of Judgment Index

In this section we present a mathematical tool for extracting information about unbiased responding from studies focusing exclusively on bias. Doing so can correct the impression that classic studies provide evidence of powerful biases when they do not. This is an easy mistake to make. When articles focus exclusively on bias, it is natural for readers to come away interpreting the article as finding nothing but bias.

The Goodness of Judgment Index (GJI) can provide a corrective to this unfortunate misrepresentation of the literature. It allows researchers to identify how much evidence of accuracy, agreement, or unbiased responding occurred in studies exclusively reporting bias. Only by obtaining quantitative assessments of how much error and bias occurred relative to how much accuracy or unbiased responding occurred, can scientists reach appropriate conclusions about the relative power of bias, or accuracy in any study.

The GJI was inspired by and modeled after goodness of fit (GFI) tests in structural equation models (SEM, see, e.g., Bollen, 1989). The key question addressed by GFI tests was “how much better is this model than nothing at all?” rather than “does the model statistically deviate from perfection?” (e.g., Bentler & Bonnet, 1980). Although the details for actual tests typically involved additional adjustments and computations not shown, the core idea was this:

Equation 1: Prototype for SEM

ChiSquare (null model) – ChiSquare (hypothesized model)ChiSquare (null model)

The null model assumes 0 covariance among all variables; as such it is incapable of explaining any covariance among the variables. The chi-square test for the null model is typically very large because it explains none of the covariances. The better the fit of the hypothesized model, the lower its chi-square. A perfectly fitting model would have a chi-square of 0 because it would fully account for all covariances. In such cases, Equation 1 equals 1.0. Equation 1 ranges from 0 to 1, with higher values indicating a better fitting model and meaning that the hypothesized model captured more of the covariances. This inspired the GJI:

Equation 2: Basic GJI

(Maximal Possible Judgmental Error)  (Actual Judgmental Error)(Maximal Possible Judgmental Error)

The GJI ranges from 0 (complete judgmental error) to 1 (perfect accuracy, no error). The GJI can also be adapted to different types of judgment, and need not be restricted to errors per se. For example, in some studies, there may be no objective truth criterion. Instead, there are often just differences between how different people, or people in different situations, view, perceive, evaluate, or judge some stimulus or event. In such cases, the GJI becomes:

Equation 3: GJI for Bias and (Dis)Agreement

(Maximal Possible Judgmental Disagreement) – (Observed Judgmental Disagreement)(Maximal Possible Judgmental Disagreement)

The less disagreement, the higher the GJI. Consider a hypothetical stereotype experiment wherein people judge a target with some behavior or attribute. People are randomly assigned to conditions wherein the target is labeled as a member of Group A or Group B. Typically, such studies exclusively assess bias – do people judge the target differently when labeled as a member of Group A or Group B? That is fine as far as it goes, but the GJI permits extraction of information about agreement from such studies.

What is a “good” or “bad” GJI? With SEM structural equation models, goodness of fit indices of .9 or higher are generally viewed as markers of good fit, and those below .8 as not good (Bentler & Bonnet, 1980; Bollen, 1989). How well that applies to the GJI will probably require extensive use of it to determine how well it performs in different contexts. Of course, the GJI should be interpreted in context of whatever is being studied. Nonetheless, if we use the SEM guidelines for good/bad fit as a heuristic starting point, GJI’s above .9 would indicate minimal bias, and those below .8 substantial bias (with those between .8 and .9 on the border). Furthermore, the GJI provides neither contextual information nor evaluations of “importance.”

With these limitations in mind, we next illustrate the utility of the GJI, with three examples from studies published over a span of 60 years. These are examples, not a representative sample. There are two main reasons we use the GJI to analyze evidence from these three articles. The first is illustrative: Simply to show how the GJI can be used to extract evidence of reasonable or unbiased responding from studies that focused exclusively on bias. The second reason is more substantive: It shows how, for two of these studies, there was far more evidence of unbiased responding than bias, even though they are usually interpreted and cited as evidence of substantial or dramatic biases. This, in turn, raises a question: does a similar pattern characterize other work on bias (though answering that question is beyond the scope of the present review)?

Hastorf and Cantril (1954)

This study stands as a classic investigation of bias, which examined the aftermath of a controversial 1951 football game between Dartmouth and Princeton. There were injuries, accusations of foul play, and outrage expressed at each school. Hastorf and Cantril (1954) showed a film of the game to 48 Dartmouth and 49 Princeton students. For each play, the students rated whether they saw an infraction committed by a player on the Dartmouth and/or Princeton team. Table 2 presents their main results. The key finding was that Princeton students saw the Dartmouth team as committing more than twice as many infractions than did the Dartmouth students. Hastorf and Cantril’s (1954, p. 133) conclusions, still echoed in modern scholarship (e.g., Van Baar & FeldmanHall, 2022) included:

Table 2

Hastorf and Cantril’s (1954) Results.

PERCEIVED INFRACTIONS BY:DARTMOUTH TEAMPRINCETON TEAM
Perceiver Group:  
Dartmouth Students (N = 48)4.3*4.4
Princeton Students (N = 49)9.8*4.2

[i] * The difference between the starred means was reported as “significant at the .01 level” though they did not report what statistical test they performed.

“There is no such ‘thing’ as a ‘game’ existing ‘out there’ in its own right which people merely “observe.”

“The ‘thing’ simply is not the same for different people…”

The GJI tells a different story. But to do so first requires making explicit features of the study that may not be obvious because Hastorf and Cantril (1954) did not articulate them. First, they did not have a reality criterion. We don’t know how many infractions either team committed. The only thing we know is the extent to which the students disagreed about infractions committed by the teams. Thus, it is the (dis)agreement form of the GJI that must be applied.

Second, to use the GJI, one needs to know “maximal possible judgmental disagreement.” There are several possible answers to this question for perceived infractions in a football game. To illustrate the substantive contribution of the GJI, in each case, when there are alternatives, we select the option most supportive of Hastorf and Cantril’s (1954) conclusions. That is, if Alternative A would lead to a conclusion of more bias and less agreement than would Alternative B, we work through the example with Alternative A.

The first choice for maximal possible judgmental disagreement involves the number of plays in the game. We found no box score reporting the number of plays. A typical college football game in the 21st century has over 100 plays. For example, the 2024 national championship game between Michigan and Washington had over 120 plays (Covers, 2024).

When the maximal possible disagreement goes up (given a constant level of disagreement), so does the GJI. Thus, higher levels of maximal possible disagreement would make people look less biased than would lower levels of maximal possible disagreement. Because we do not know the number of plays in the in the Princeton v. Dartmouth game of 1951, we conservatively estimate that number as 60, keeping in mind that choosing a lower number favors finding bias.

The second choice involves how many infractions could be committed on each play by each team. There is no formal limit to this. In principle, it is possible that all 11 players on one team each committed multiple infractions. For example, with 60 plays, if one assumes one infraction per play for the entire team, maximal possible judgmental disagreement is 60 per team. If one assumes perceivers could “see” any single player committing 10 infractions per play, then there could be 110 infractions per play per team, or 6600 infractions for that team.

However, in these examples, we wish to compute the GJI in a manner maximally favorable to bias conclusions. Therefore, we estimate this value as one infraction per team per play, or 60 for the entire game. Any number higher than 60 will produce a higher GJI than the one we actually compute here. With these preliminaries addressed, we now have a conservative estimate for maximal possible disagreement – 60.

Next, we obtain the observed judgmental disagreement from Table 2. For judgments of the Princeton team, this value was 0.2 (4.4 infractions perceived by Dartmouth students, minus 4.2 infractions perceived by Princeton students). Now we compute the GJI:

Equation 4: The GJI for Judgments of Princeton Team Infractions

(60 – 0.2)60=59.860=0.997.

With respect to infractions committed by the Princeton team, there is the tiniest hair short of no bias at all. However, perhaps this was not necessary. Even Hastorf and Cantril (1954) did not make much of this tiny difference. Nonetheless, they did not acknowledge that this would seem starkly contrary to their overwrought claims about the game “not being the same” for different perceivers; it is apparent that the game, at least with respect to the Princeton team (i.e., half the game) is almost completely “the same” for different perceivers.

For perceptions of the Dartmouth team, again, we obtain observed judgmental disagreement from Table 2. That value is 5.5 (9.8 infractions perceived by Princeton students minus 4.3 infractions perceived by Dartmouth students). Now we compute the GJI:

Equation 5: The GJI for Judgments of Dartmouth Team Infractions

(60 – 5.5)60=54.5600.908

Even for the Dartmouth team, although there was some bias, overwhelmingly, the GJI shows they saw the same game. These results were obtained after making assumptions maximally favorable to Hastorf & Cantril’s (1954) conclusions. Any other set of plausible assumptions – more plays, more possible infractions--would produce GJI’s far higher indicating that Hastorf & Cantril’s (1954) students saw an even more nearly identical game than our analyses indicated. Thus, under any plausible assumptions, the GJI shows that their conclusions emphasizing bias and subjectivity were overstated.

Krueger and Rothbart (1988)

We selected one of the results from this study because it can be used to illustrate how to use the GJI with a different type of dependent variable – a seven-point scale. The full paper addressed conditions under which individuating information eliminated stereotype biases, but we focus on only one of the results finding bias. In their first experiment, in one condition, people rated the aggressiveness of a man or a woman who had engaged in a single behavior highly diagnostic of aggressiveness. Although Krueger and Rothbart (1988) did not report the exact means, their Figure 1 displays the means. We estimate those means from that figure.

In this particular condition, people rated the male target as more aggressive (from the Figure, we estimate that mean as 5.1) than the female target (from the Figure, we estimate that mean as 4.7) which contributed to a statistically significant finding of gender bias. Thus, in this case the disagreement is 0.4. Because this is on a seven-point scale, the maximal possible disagreement is 6 (7–1 = 6). Thus, the GJI becomes:

Equation 6: The GJI for Gender Bias in One Condition of Krueger & Rothbart’s (1988) First Study

6 – 0.465.6/6 = 0.933

Again, this does not negate the evidence of bias. It does show, however, that even in their study finding bias, perceivers judged the targets as being far more similar than different.

Kahan et al. (2016)

We selected one of the results from Kahan, Hoffman, Evans, Devins, Lucci & Cheng (2016) to illustrate how the GJI can be used when participants estimate percentages. One result compared people high or low in their Cultural Cognition Worldview Scale (which correlated with left/right politics). People high in “Hierarchical Communitarianism” (rightish) were compared to those high in “Egalitarian Individualism” (leftish) with respect to judgments of an ethical violation by a police officer inappropriately disclosing confidential information to members of a pro-life center. The largest bias was among the student sample. The outcome is whether participants saw the disclosure of confidential information as a legal violation.

86% of the Egalitarian Individualist students, and 63% of the Hierarchical Communitarian students saw the disclosure as a legal violation, which was justifiably interpreted as bias. But how much bias relative to unbiased responding? Maximal possible disagreement in this case is 100% (it is hypothetically possible that 0% of one group and 100% of the other would have judged the disclosure illegal). Observed disagreement was 86%–63% = 23%. So:

Equation 7: GJI for Student Sample Perceptions of Legal Violations in the Pro-Life Center Scenario

100 – 2310077/100 = 0.77

This is the first example here wherein the original interpretation of substantial bias is confirmed as per the SEM/GFI heuristic of “less than .8 is bad fit.” This is also the subgroup that showed the largest bias on this outcome, and, even here, the GJI indicates that there was more agreement than bias. This finding is consistent with a considerable body of research finding that political biases are often more powerful than many other biases typically studied (e.g., Carmines & Nassar, 2022; Iyengar & Westwood, 2015).

Conclusions about the GJI

These are only three examples calculating and interpreting the GJI. Given the simplicity of the calculation, it stands as a capable tool for quantifying, independent of p-values or statistical tests, amount of bias compared to accuracy and agreement. Using the GJI could help correct and properly contextualize findings of bias that, historically, may have been overstated.

Nonetheless, both what constitutes a “large” or “small” bias, and whether either is “important” depends on context and the goals of judgment and decision making. A large bias in deciding whether to bring an umbrella to work may be of trivial import; a tiny bias in calculating the distance for a Moon landing could produce disaster. The GJI addresses the size of bias relative to accuracy or agreement; it provides no information about the importance of any particular bias.

Although we have focused on maximal possible disagreement as the denominator for the GJI, it might be possible to use other denominators. For example, one might use maximal empirical disagreement or disagreement that results from random responding. Any change to the denominator would, of course, change the meaning of the GJI.

Certainly, researchers could use different denominators if they articulated compelling reasons for and a viable path to doing so. Absent that, however, we recommend defaulting to the maximal possible disagreement or error as the denominator. The approach we took here is simple and objective: If one uses maximal possible error or disagreement, the values can usually be obtained by identifying the upper and lower bounds of the measure – no subjectivity or complex modeling required. Simple “improper” models often outperform proper statistical models (Dawes, 1979; Czerlinski, Gigerenzer & Goldstein, 1999). It is in this spirit that we proposed the GJI.

The New Wave of Bias Research: Social Justice

Social psychology has addressed issues of what is now called “social justice” from its earliest publications (e.g., LaPiere, 1936). Nonetheless, around 2000, a new outburst of interest in bias emerged framed around phenomena related to social justice. “Social justice” is fundamentally a political term devoid of scientific meaning. What constitutes justice is moral, legal, and political, not something that can be resolved scientifically. Of course, this does not mean social science can say nothing about issues related to justice (social, or any other type). Perhaps the most common type of “social justice” research in psychology relates to inequality, discrimination, prejudice, gaps, and the like.

We next turn to some of the major phenomena that came to the fore in the outburst of research related to social justice that started around 2000. The first three (implicit bias, stereotype threat, and microaggressions) all were widely interpreted as producing Wow Effects! This included the first paper on the workhorse method for studying “implicit bias,” the implicit association test (Greenwald, McGhee & Schwartz 1998), the first two papers on stereotype threat (Spencer, Steele & Quinn, 1999; Steele & Aronson 1995) and the paper that put microaggressions on the map for most psychologists (Sue et al. 2007). As per Google Scholar, these four papers have been cited approximately 40,000 times as of this writing. Furthermore, all three areas of research have been presumed to be sufficiently sound to justify interventions to change the real world, such as diversity trainings, implicit bias trainings, and microaggression trainings. A closer look at all three areas raises questions about whether any of this was scientifically justified.

Stereotype Threat, Implicit Bias, and Microaggressions

This wave of bias research shows many of the same Meehl-like earmarks as the earlier waves. Masses of empirical studies have been inspired by stereotype threat, implicit bias, and microaggressions. Next, therefore, we briefly summarize how researchers studying these concepts, like those who produced prior research on bias, sold to researchers far more than they have, so far, delivered.

There are only two pre-registered attempts to replicate stereotype threat effects for women and math; both failed to do so (Finnigan & Corker, 2016; Flore, Mulder & Wicherts, 2018). With respect to race and stereotype threat, the overstatement and overselling of their findings (in journal articles, textbooks, and the popular press) manifests as the claim that, but for stereotype threat, Black and White standardized test scores would be equal (see Sackett, Hardison & Cullen, 2004; Tomeh & Sackett, 2022 for content analyses and quotes). However, no such result was ever reported in any study (see Jussim et al, 2016; Sackett et al, 2004 for reviews). We know of no published pre-registered attempts to replicate the stereotype threat effect for race. Until such work is published, given the original overselling and failed replications for women and math, the reality of the phenomena is suspect.

The consequences of researchers overselling findings manifest differently, but in many ways more seriously, for implicit bias and microaggressions. For both, advocates promised to detect pervasive unconscious racism (Banaji & Greenwald, 2016; Williams, 2020). Unfortunately, despite large numbers of publications, neither have delivered on this promise. Neither has demonstrated anything “unconscious.” People can predict their IAT scores well (Hahn, Judd, Hirsh & Blair, 2014). We know of no empirical study that has even attempted to test whether microaggressions are unconscious (see Cantu & Jussim, 2021; Lilienfeld, 2017). If the “unconscious” aspects of implicit bias and microaggressions at best have never been empirically demonstrated, and at worst, don’t exist, then claims that these concepts involve unconscious racism are wildly overstated and scientifically unjustified.

Both have also so far failed to produce empirical research supporting the requisite causal relations to make many of their core claims viable. For example, there is no evidence that implicit bias causes discrimination (Jussim, Careem, Goldberg, Honeycutt & Stevens, in press) or that racism causes microaggressions (Cantu & Jussim, 2021; Lilienfeld, 2017). Stereotype threat, implicit bias, and microaggressions, then, appear to fit the Meehlian observation about areas “characterized by initial enthusiasm and ambiguous overall results.” Perhaps this is why implicit bias has been called “delusive” (Corneille & Hütter, 2020) and a line of research suffering from “degeneration” (Cyrus-Lai et al. 2022). Accordingly, implicit bias and microaggression researchers have been characterized as having made strong claims based on weak or inadequate evidence (Blanton, Jaccard, Klick, Mellers, Mitchell & Tetlock, 2009; Lilienfeld, 2017).

Perhaps, some might argue, whether the literatures on stereotype threat, implicit bias and microaggressions are oversold and weakly evidenced is beside the point. Gender and racial discrimination, this argument goes, are so well-established that something must explain them. Next, therefore, we review recent research on discrimination.

Audit Studies

Our review relies heavily on two audit studies, one each on gender and racial discrimination in the workplace. Audit studies are one of the strongest methodological tools available for assessing discrimination. First, they are experiments, so they are excellently designed to test whether bias causes unequal outcomes. Typically, targets who are otherwise identical or equivalent differ on some demographic characteristic and apply for something (such as a job). If Bob receives more callbacks or interviews than Barbara, this can be attributed to sex discrimination. For racial discrimination studies, names can be strongly linked to racial/ethnic groups through pre-testing. If Greg (established as conveying being White) receives more callbacks or interviews than does Jamal (established as conveying being Black), the result can be attributed to racial discrimination. Second, they are conducted in the real world, for example, by having fictitious targets apply for advertised jobs. These two strengths – strong methods for causal inference and tests conducted in the real world – render audit studies one of the best ways to test for discrimination.

Gender Discrimination

Workplace gender discrimination

A recent meta-analysis examined 85 audit studies, including over 360,000 job applications, conducted from 1976 to 2020 (Schaerer et al, 2023). The audit studies examined whether otherwise equivalent men or women were more likely to receive a callback after applying for a job.

The meta-analysis had two unique strengths that, in our view, render it one of the strongest meta-analyses on this, or any other, topic yet performed. First, the methods and analyses were pre-registered, thereby precluding the undisclosed flexibility that can permit researchers to cherry-pick findings to support a narrative and enhance chances of publication. Few existing meta-analyses in psychology have been pre-registered.

Second, they hired a “red team” – a panel of experts paid to critically evaluate the proposed methods and analyses, and the draft of the report after the study was conducted. The red team included four women and one man. Three had expertise in gender studies. One was a qualitative researcher and another a librarian (for critical feedback regarding the comprehensiveness of the literature search). The goal was to obtain critical peer review before the study was conducted to improve it.

There were several key results:

  1. Overall, men were statistically significantly less likely than women to receive a callback (odds ratio of .91). This finding was reduced to nonsignificance (no bias in either direction) when certain controls were added. Even then, however, there was no bias against women.

  2. Men were much less likely than women to receive a callback for female-typed jobs (odds ratio of .75).

  3. There were no statistically significant differences in the likelihood of men or women receiving callbacks for male-typical or gender-balanced jobs.

  4. Analysis of the discrimination trend over time found that women were disadvantaged in studies conducted before 2009. After 2009 the trend reversed, such that there was a slight tendency to favor women after 2009.

  5. They also had laypeople (N = 499) and academics (N = 312) predict the results of the meta-analysis. All predicted large biases favoring men and erroneously predicted that that bias persisted into the present. Last, expertise made no difference – academics who had published on gender were as inaccurate as those who had not.

The authors concluded: “Contrary to the beliefs of laypeople and academics revealed in our forecasting survey, after years of widespread gender bias in so many aspects of professional life, at least some societies have clearly moved closer to equal treatment when it comes to applying for many jobs.” Their results, however, do raise an interesting question: Why do so many people, especially academics who should know better, vastly overestimate sex discrimination? Although we do not aspire to a full answer to this question, the next sections provide evidence for one likely contributor: academics ignore evidence disconfirming biases against women in academia.

Gender bias in peer review

By “peer review” here we include not only biases in academic publishing, but also in grants and hiring. Although reviewing that vast literature is beyond the scope of the present paper, there are ample studies showing biases against women (e.g., Moss-Racusin et al, 2012), biases against men (e.g., Lloyd, 1990; Williams & Ceci, 2015), and no bias (e.g. Forscher, Cox, Brauer, & Devine, 2019).

A particularly striking contrast can be found between Moss-Racusin et al. (2012) and Williams and Ceci (2015), both of whom studied faculty in science fields. Moss-Racusin et al. (2012) performed a single study (N = 127) and found a male applicant for a lab manager position was evaluated more positively than was an identical female applicant. Williams and Ceci (2015) performed five studies (total N = 873) and found biases favoring women for a faculty position.

In addition to the opposite findings, this contrast is striking because, despite both being published in the same journal only a few years apart, Moss-Racusin et al. (2012) has been cited (as of this writing, January 21, 2024), over 3900 times, whereas Williams and Ceci (2015), 497 times. If we only consider citations since 2016 (after the Williams and Ceci paper had been out for a year), the counts are 3350 and 470. Thus, there are almost 3000 academic papers that cited Moss-Racusin et al. (2012) without even acknowledging the existence of Williams and Ceci’s (2015) opposite evidence. This pattern may help explain Schaerer et al.’s (2023) findings that academics vastly overestimated biases against women. Academic publishing may be filled with “scientific” articles testifying to the enduring power of biases against women, not because that is the weight of the evidence, but because evidence of biases favoring women is ignored.

This, however, is a comparison of only one pair of studies. It provides no evidence regarding whether such a pattern is common. To address this limitation, Honeycutt and Jussim (2020) identified all the papers they could find on gender bias in peer review published 2015 and earlier (to allow time for citations to accumulate). Among those found, four reported biases favoring men and six either unbiased responding or biases favoring women. They then examined their citation counts and sample sizes (one important indicator of study credibility).

The four studies showing biases against women had median citation counts of 51.5 per year and a median sample size of 182.5; the studies showing egalitarian responding or biases against men had median citation counts of 9 per year and a median sample size of 2311.5. The studies failing to show biases against women, by virtue of their vastly larger sample sizes, are far more credible. Yet those studies receive less than one fifth the academic citations received by papers finding biases against women. Even unbiased academics, when reading this literature, will find vastly more articles referencing biases against women than articles referencing unbiased responding or biases against men. This could partially explain why the academics’ predictions about gender bias were so poor in the Schaerer et al. (2023) study.

Claims of sexism in science also have another problem: A recent series of far more highly-powered close attempts to replicate Moss-Racusin et al (2012) not only failed to do so, they found the opposite – bias against women. This is discussed next.

Moss-Racusin replication

Because of the conflicting findings and small scale of Moss-Racusin et al. (2012), Honeycutt, Jussim, Careem & Lewis (2024) conducted three large sample direct replications. These studies were conducted as part of a registered replication report format (meaning that all methods and planned analyses were submitted as a proposal) and, on the basis of that proposal, it was given an “in principle” acceptance at Metapsychology. Although the final research report is not yet submitted, the studies and certain key analyses have been completed and we report results for one of them here (results were very similar across the three studies).

Honeycutt et al.’s (2024) first replication study included over 500 faculty in biology, chemistry and physics. It failed to replicate any of Moss-Racusin et al.’s primary findings. That is, it found biases favoring women for the four primary variables on which Moss-Racusin et al. (2012) found biases favoring men. The differences were not large (effect sizes d, were typically in the .2–.3 range) but all were statistically significant at p < .01).

We speculate that there are three explanations for the discrepancy between these findings and those of Moss-Racusin et al. (2012): 1. Something was wrong with their study; 2. something was wrong with Honeycutt et al.’s (2024) study; or 3. Both studies are valid, but something changed from the early 2010s to the early 2020s. In light of Schaerer et al.’s (2023) findings, we suspect that the most likely explanation is the latter.

The Racial Discrimination Paradox

Audit studies find substantial racial discrimination

The most recent review and meta-analysis of which we are aware found 21 audit studies of racial discrimination in hiring since 1989 (Quillian et al. 2017). The studies included over 55,000 applications submitted for over 26,000 jobs. Its main findings were:

  1. White applicants received 36% more callbacks than did Black applicants.

  2. This difference remained steady from 1989 to 2015. There was weak evidence that it had increased over that time.

Thus, according to this high-quality meta-analysis, there was consistent evidence of substantial and meaningful racial discrimination. In contrast, however, several recent studies found evidence of extremely low levels of discrimination; these are reviewed next.

Recent studies showing very low levels of discrimination

Three recent papers including nine separate studies found low levels of racial discrimination. The first found anti-Black discrimination 1.3 percent of the time (Peyton & Huber, 2021). This study used the ultimatum game to test for anti-Black discrimination. In the game, a first player proposes to the second how to divide some money. For example, the first player may be given a dollar to divide and offers 30 cents to the second. If the second player accepts, the first receives 70 cents and the second 30 cents. If the second rejects this division, neither gets anything.

Participants played the ultimatum game 25 times with either Black or White partners. The total number of offers accepted or refused was over 18,000. Racial discrimination was determined by the frequency with which White players rejected offers from Black players that would have been accepted from White players. This happened 1.3 percent of the time.

Campbell and Brauer (2021) conducted surveys, experiments, and a meta-analysis examining discrimination at the University of Wisconsin-Madison. We focus exclusively on the seven experiments addressing racial discrimination (including discrimination against Muslims). All studies examined naturally occurring interactions on campus, such as door-holding, asking directions, and sitting next to a target on a bus.

One study found that students held a door for a White person 5% more often than for a Black person. Another found that a White actor requesting directions received them 9% more often than an Asian actor and 6% more often than a Muslim actor. Another found that a White actor received help 18% more often than did a Muslim actor, but 20% less often than did an Asian actor. Another found that a Muslim actor was treated with more social distance on a bus 6% of the time.

They also conducted two audit studies regarding job applications. One found that a White applicant received 7% more responses than did an Arab applicant. Another found that a White applicant received 8% more responses than did a Black applicant. Simply averaging the differences for all studies, produces an overall discrimination rate of about 8%.

Nodtveldt et al. (2021) examined discrimination in the selection of Airbnb listings among a nationally representative sample of 801 Norwegians. The host was identified as either ethnically Norwegian or ethnically Somali. Overall, there was a 9.3 percent preference for the listing by the Norwegian ethnic.

The discrimination paradox

Taking these findings together, we have the paradox. Quillian et al. (2017) in a high-quality meta-analysis found job discrimination at 36%; the recent studies reviewed in detail here (Campbell & Brauer, 2021; Notveldt et al., 2021; Peyton & Huber, 2021) found discrimination at very low levels, typically single digits.

It is possible to generate explanations that render these discrepant patterns sensible. Perhaps there are deep flaws in the studies finding little discrimination but not in the ones finding substantial discrimination (or vice versa) and we simply failed to uncover them. Perhaps differences between the studies (their samples, the situations they examined, etc.) explain the differences. Next, however, we argue that such explanations are not necessary because the findings are actually completely compatible without recourse to additional explanatory variables.

Resolving the discrimination paradox

The key to resolving the paradox is understanding that discrimination can be assessed at two levels of analysis. The 36% figure obtained by Qullian et al. (2017) refers to differences in callbacks received by Black and White applicants. It is a difference between experiences of Black and White applicants. It is not the difference between the responses of companies to Black and White applicants. In contrast, the results of the three papers finding single digit discrimination (Campbell & Brauer, 2021; Notveldt et al., 2021; Peyton & Huber, 2021) were acts by potential perpetrators of discrimination (Notveldt et al, 2021, reported the results both ways).

The importance of this difference can be seen with an example that starts with Quillian et al.’s (2017) figure of White applicants receiving 36% more callbacks than did Black applicants. First consider a simple hypothetical:

There are 1000 applicants for a type of job. There are 500 Black and 500 White applicants with equivalent records. In this hypothetical, they receive a total of 236 callbacks.

  1. If there were no discrimination, Black and White applicants would receive identical numbers of callbacks, 118 in each case.

  2. Were the Quillian et al. (2017) levels of discrimination to occur, with White applicants receiving 36% more callbacks, 136 White applicants would receive callbacks; among Black applicants, 100 would receive callbacks.

  3. In this hypothetical, discriminatory acts occurred 18/1000 times, or 1.8% of the time (118, which is egalitarian, plus 18 discriminatory callbacks, equals the 136 White applicant callbacks).

This resolves the discrimination paradox because it shows how 36% discrimination from the target’s standpoint results from acts of discrimination occurring only 1.8% of the time. There is no substantial conflict between the results of Quillian et al.’s (2017) meta-analysis, and those of the recent studies finding single digit levels of discrimination (Campbell & Brauer, 2021; Notveldt et al., 2021; Peyton & Huber, 2021).

Implications

One implication is that minimal levels of acts of discrimination can have a substantial impact on the targets of discrimination. There are longstanding debates about whether small biases are important. Our resolution to the discrimination paradox suggests that some small biases produce larger disparities than one might assume when one discovers that acts of discrimination occur in the single digits.

Another implication is that it can explain why certain types of diversity initiatives are not effective. Diversity training (Devine & Ash, 2022) and implicit bias training (Paluck, Porat, Clark & Green, 2021) are generally ineffective. Our resolution to the discrimination paradox suggests that they are not likely to accomplish very much because of a floor effect – there are too few acts of discrimination for even an otherwise effective intervention to accomplish much.

Conclusion

This review provides a cautionary tale regarding overwrought claims about lay biases. We have reviewed evidence that psychology has been periodically overrun with excess enthusiasm for bias, expressed as large waves of research supposedly demonstrating such biases, typically accompanied by leaping to unjustified conclusions about their credibility, power, or pervasiveness. This was found to be true of waves of research on how motivation influences basic perception, heuristics and biases in judgment and decision-making, and a slew of biases involved in social justice. In some cases, the flaws and weaknesses of the research were sufficiently severe that the bodies of research may not have produced anything standing the tests of time, skeptical vetting, and falsification. In others, work touted as revealing powerful and important biases captured researchers’ imaginations largely by virtue of ignoring work producing conflicting findings. The work revealing heuristics or biases was sound, but many of those heuristics and biases were subsequently found to serve people well in the real world.

One important limitation of this review is that whereas it did address the meaning of research on a variety of biases, it did not address fundamental issues of human (ir)rationality, which were beyond the scope of the present review. Another limitation is that it did not aspire to reach general conclusions about the power of biases compared to accuracy. Included in the present review was evidence of both strong biases (Kahan et al, 2016; Quillian et al., 2017) and modest ones (Hastorf & Cantril, 1954; Krueger & Rothbart 1988). Although we identified overwrought or unjustified claims about bias emerging from research in a range of areas (New Look, social perception, implicit bias, microaggressions, and stereotype threat), reaching conclusions about the power of bias more generally was beyond the scope of this review.

Nonetheless, in addition to critically evaluating individual studies, the present review addressed bias as found in historically major areas of research (New Look, social perception, implicit bias, microaggressions, stereotype threat, discrimination). This review indicated that many of the core claims in those areas did not hold up to critical scrutiny. As such, this review has an important scientific implication: when Wow Effects! appear in the literature, perhaps they should be treated with more skepticism than has been common.

It also has an important implication for applications: rather than rush to change the world based on compelling narratives emerging from preliminary research seeming to provide Wow Effects! it may be wise to hold off telling the world that these supposedly amazing, dramatic, world-changing findings!- can be taken at face value. It would be similarly wise to exercise the intellectual restraint necessary to hold off designing large-scale, real-world interventions (such as implicit bias or microaggression trainings) until the wider scientific community has had many years, often decades, to skeptically vet whether they hold up to what early advocates claim them to be. Instead, experimental pilot programs could be designed to examine the effectiveness of such interventions – if such pilot programs produce their intended benefits, prove cost effective and lack unintended negative consequences, they could then be scaled up to determine if larger programs are similarly effective. Organizations should implement such programs only if there is clear evidence that large interventions are actually effective (this analysis presumes that those organizations care about effectiveness; if they implement these programs to achieve other goals, e.g., public relations, that is beyond the scope of the present review).

In 2004, Krueger and Funder called for a more balanced psychology. Balance did not mean denying bias when it manifested. But, as they wrote (p. 313, abstract): “A more balanced social psychology would yield not only a more positive view of human nature, but also an improved understanding of the bases of good behavior and accurate judgment, coherent explanations of occasional lapses, and theoretically grounded suggestions for improvement.” Although it is not clear that academic psychologists have heeded their call, nearly 100 years of psychological research on a variety of biases has vindicated it.

Acknowledgements

This paper was based on a talk given at the festschrift held in honor of Joachim Krueger in Monte Vérita sponsored by the University of Zurich.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Lee Jussim: Conceptualization (lead), mathematical models (lead), formal analysis (equal), original draft (lead), review and editing (supporting).

Nathan Honeycutt: Formal analysis (equal), original draft (supporting), review and editing (lead).

DOI: https://doi.org/10.5334/spo.77 | Journal eISSN: 2752-5341
Language: English
Submitted on: Jan 30, 2024
|
Accepted on: Jun 12, 2024
|
Published on: Jul 1, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Lee Jussim, Nathan Honeycutt, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.