Have a personal or library account? Click to login
How many participants do we have to include in properly powered experiments?  A tutorial of power analysis with reference tables Cover

How many participants do we have to include in properly powered experiments? A tutorial of power analysis with reference tables

By: Marc Brysbaert  
Open Access
|Jul 2019

Figures & Tables

joc-2-1-72-g1.png
Figure 1

What happens to the significance of an effect when a study becomes more powerful? Red areas are p < .05, two-tailed t-test; green area is not significant.

Table 1

The outcome in terms of p-values a researcher can expect as a function of the effect size at the population level (no effect, effect of d = .4) and the number of participants tested in a two-tailed test. The outcome remains the same for the sample sizes when there is no effect at the population level, but it shifts towards smaller p-values in line with the hypothesis when there is an effect at the population level. For N = 10, the statistical test will be significant at p < .05 in 15 + 7 + 2 = 24% of the studies (so, this study has a power of 24%). For N = 30, the test will be significant in 24 + 21 + 14 = 49% of the studies. For N = 100, the test will be significant for 6 + 16 + 76 = 98% of the studies, of which the majority with have significance at p < .001. At the same time, even for this overpowered study researchers have 7% chance of finding a p-value hovering around .05.

N = 10N = 30N = 100
d = 0d = .4d = 0d = .4d = 0d = .4
p < .001 against hypothesis0.0005≈0%0.0005≈0%0.0005≈0%
.001 ≤ p < .01 against hypothesis0.0045≈0%0.0045≈0%0.0045≈0%
.01 ≤ p < .05 against hypothesis0.02000.00060.0200≈0%0.0200≈0%
.05 ≤ p < .10 against hypothesis0.02500.00120.02500.01420.0250≈0%
p ≥ .10 against hypothesis0.45000.10110.45000.27830.4500≈0%
p ≥ .10 in line with hypothesis0.45000.54510.45000.11620.45000.0092
.05 ≤ p < .10 in line with hypothesis0.02500.10850.02500.24120.02500.0114
.01 ≤ p < .05 in line with hypothesis0.02000.14860.02000.24120.02000.0565
.001 ≤ p < .01 in line with hypothesis0.00450.07350.00450.21440.00450.1618
p < .001 in line with hypothesis0.00050.02140.00050.13570.00050.7610
joc-2-1-72-g2.png
Figure 2

Tens of studies on the difference in vocabulary size between old and young adults. Positive effect sizes indicate that old adults know more words than young adults. Each circle represents a study. Circles at the bottom come from studies with few participants (about 20), studies at the top come from large studies (300 participants or more).

Source: Verhaeghen (2003).

joc-2-1-72-g3.png
Figure 3

Output of G*Power when we ask the required sample sizes for f = .2 and three independent groups. This number is an underestimate because the f-value for a design with d = .4 between the extreme conditions and a smaller effect size for the in-between condition is slightly lower than f = .2. In addition, this is only the power of the omnibus ANOVA test with no guarantee that the population pattern will be observed in pairwise comparisons of the sample data.

Table 2

Example of data (reaction times) from a word recognition experiment as a function of prime type (related, unrelated).

ParticipantRelatedUnrelatedPriming
p163865416
p270175150
p359762326
p46406411
p57567604
p658961324
p763566530
p867870123
p96596689
p10597584–13
Mean64966617
Standard dev.52.257.217.7
joc-2-1-72-g4.png
Figure 4

If one increases the correlation among the repeated measurements, G*Power indicates that fewer observations are needed. This is because G*Power takes the effect size to be dav, whereas the user often assumes it is dz.

Table 3

Correlations observed between the levels of a repeated-measures factor in a number of studies with different dependent variables.

StudyDependent variableCorrelation
Camerer et al. (2018)
    Aviezer et al. (2012)Valence ratings–0.85
    Duncan et al. (2012)Similarity identification0.89
    Kovacs et al. (2010)Reaction time to visual stimuli0.84
    Sparrow et al. (2011)Reaction time to visual stimuli0.81
Zwaan et al. (2018)
    associative primingReaction time to visual stimuli (Session 1)0.89
Reaction time to visual stimuli (Session 2)0.93
    false memoriesCorrect related-unrelated lures (Session 1)–0.47
Correct related-unrelated lures (Session 2)–0.14
      flanker taskRT stimulus congruent incongruent (Session 1)0.95
RT stimulus congruent incongruent (Session 2)0.93
    shape simulationRT to shape matching sentence (Seesion 1)0.89
RT to shape matching sentence (Seesion 2)0.92
      spacing effectMemory of massed v. spaced items (Session 1)0.35
Memory of massed v. spaced items (Session 2)0.55
joc-2-1-72-g5.png
Figure 3

Different types of interactions researchers may be interested in. Left panel: fully crossed interaction. Middle panel: the effect of A is only present for one level of B. Right panel: The effect of A is very strong for one level of B and only half as strong for the other level.

joc-2-1-72-g6.png
Figure 4

Figure illustrating how you can convince yourself with G*Power that two groups of 15 participants allow you to find effect sizes of f = .23 in a split-plot design with two groups and five levels of the repeated-measures variable. When seeing this type of output, it good to keep in mind that you need 50 participants for a typical effect in a t-test with related samples. This not only sounds too good to be true, it is also too good to be true.

Table 4

Responses of six participants (P1–P6) to four repetitions of the same stimulus (S1–S4).

S1S2S3S4
P175106
P2911810
P31291714
P4118618
P57336
P61014714
Table 5

First lines of the long notation of Table 4. Lines with missing values are simply left out.

ParticipantConditionResponse
P1S17
P1S25
P1S310
P1S46
P2S19
joc-2-1-72-g7.png
Figure 5

Illustration of how you can find a high correlation in the entire dataset (because of the group difference) and a low correlation within each group (because of the range restriction within groups).

Table 5

Intraclass correlations for designs with one repeated-measures factor (2 levels). The experiments of Zwaan et al. included more variables, but these did not affect the results much, so that the design could be reduced to a one-way design. The table shows the intraclass correlations, which mostly reach the desired level of ICC2 = .80 when it is calculated within conditions. The table also shows that the average reported effect size was dz = .76. If the experiments had been based on a single observation per condition, the average effect size would have been dz = .24, illustrating the gain that can be made by having multiple observations per participant per condition.

StudyDependent variableN partsN conditionsNobs_per_ condAcross the entire datasetAverage within conditiondN = 1dN = tot
ICC1ICC2ICC1ICC2
Camerer et al. (2018)
      Aviezer et al. (2012)Valence ratings142880.010.600.230.960.941.43
      Kovacs et al. (2010)Reaction time to visual stimuli95250.410.870.400.760.290.72
      Sparrow et al. (2011)Reaction time to visual stimuli23428 & 160.100.910.100.810.030.10
Zwaan et al. (2018)
      associative primingReaction time to visual stimuli (Session 1)1602 × 2 × 2300.280.960.310.930.170.81
Reaction time to visual stimuli (Session 2)0.300.960.310.920.180.94
          false memoriesCorrect related-unrelated lures (Session 1)1602 × 2 × 290.020.290.270.770.200.97
Correct related-unrelated lures (Session 2)0.060.540.260.750.311.17
            flanker taskRT stimulus congruent incongruent (Session 1)1602 × 2 × 2320.400.980.410.960.150.70
RT stimulus congruent incongruent (Session 2)0.290.960.290.920.130.52
          shape simulationRT to shape matching sentence (Seesion 1)1602 × 2 × 2150.420.930.420.910.080.27
RT to shape matching sentence (Seesion 2)0.490.970.500.940.170.50
            spacing effectMemory of massed v. spaced items (Session 1)1602 × 2 × 2400.030.540.030.400.200.87
Memory of massed v. spaced items (Session 1)0.060.840.070.750.210.92
Table 6

Intraclass correlations for between-groups designs. The table shows that the intraclass correlations mostly reached the desired level of ICC2 = .80 when multiple observations were made per condition and all the items used. In general, this improved the interpretation (see in particular the study of Pyc & Rawson, 2010). At the same time, for some dependent variable (e.g., rating scales) rather stable data can be obtained with a few questions (see the study by Wilson et al., 2014).

StudyDependent variableN partsN conditionsNobs_per_ condAcross the entire datasetAverage within conditiondN = 1dN = tot
ICC1ICC2ICC1ICC2
Camerer et al. (2018)
      Ackerman et al. (2010)Evaluating job candicates599280.470.870.470.880.100.13
    Gervais & Norenzayan (2012)Belief in God53121NANANANA–0.07–0.07
      Karpicke & Blunt (2011)Text memory4921NANANANA0.830.83
      Kidd & Castano (2013)Emotion recognition7142360.090.780.090.78–0.03–0.08
      Morewedge et al. (2010)M&Ms eaten8921NANANANA0.750.75
        Pyc & Rawson (2010)Word translations3062480.140.890.140.880.120.30
          Shah et al (2012)Dots-mixed task61921NANANANA–0.03–0.03
          Wilson et al (2014)Enjoyment ratings39230.820.930.730.891.321.44
Table 7

Numbers of participants required for various designs when d = .4, .5, and .6 and the data are analyzed with traditional, frequentist statistics (p < .05). The numbers of the d = .4 column are the default numbers to use. The higher values of d require dependent variables with a reliability of .8 at least. Therefore, authors using these estimates must present evidence about the reliability of their variables. This can easily be done by calculating the ICC1 and ICC2 values discussed above.

Traditional, frequentist analysis (p < .05)
d = .4d = .5d = .6
1 variable between-groups
    • 2 levels20013090
    • 2 levels, null hypothesis860860860
    • 3 levels (I = II > III)435285195
    • 3 levels (I > II > III)17401125795
1 variable within-groups
    • 2 levels523424
    • 2 levels, null hypothesis215215215
    • 3 levels (I = II > III)755035
    • 3 levels (I > II > III)300195130
Correlation19512585
2 × 2 repeated measures
    • Main effect one variable271813
    • Interaction (d v. 0)1107550
2 × 2 split-plot
    • Main effect between
        ◦ r = .515010070
        ◦ r = .919012090
    • Main effect repeated-measure553424
    • Interaction (d v. 0)20013090
    • ANCOVA
        ◦ rrep_measure = .516010070
        ◦ rrep_measure = .920013090
Table 8

Numbers of participants required for various designs when d = .4, .5, and .6 and the data are analyzed with Bayesian statistics (BF > 10). The numbers of the d = .4 column are the default numbers to use. The higher values of d require dependent variables with a reliability of .8 at least. Therefore, authors using these estimates must present evidence about the reliability of their variables. This can easily be done by calculating the ICC1 and ICC2 values discussed above.

Bayesian analysis (BF > 10)
d = .4d = .5d = .6
1 variable between-groups
    • 2 levels380240170
    • 2 levels, null hypothesis240024002400
    • 3 levels (I = II > III)690450300
    • 3 levels (I > II > III)285018001200
1 variable within-groups
    • 2 levels1006545
    • 2 levels, null hypothesis720720720
    • 3 levels (I = II > III)1258055
    • 3 levels (I > II > III)540340240
Correlation370230160
2 × 2 repeated measures
    • Main effect one variable523223
    • Interaction (d v. 0)21013085
2 × 2 split-plot
    • Main effect between
        ◦ r = .5290190130
        ◦ r = .9360220160
    • Main effect repeated-measure1006646
    • Interaction (d v. 0)390250170
    • ANCOVA
        ◦ rrep_measure = .5300190130
        ◦ rrep_measure = .9380230170
Table 9

Numbers of participants required for various designs when d = .4 and power is increased to 90%. The latter decreases the chances of not finding an effect present in the population.

d = .4, power = .9, p < .05d = .4, power = .9, BF > 10
1 variable between-groups
    • 2 levels264480
    • 2 levels, null hypothesis10843600
    • 3 levels (I = II > III)570840
    • 3 levels (I > II > III)21603450
1 variable within-groups
    • 2 levels70130
    • 2 levels, null hypothesis2711800
    • 3 levels (I = II > III)100150
    • 3 levels (I > II > III)360610
Correlation260460
2 × 2 repeated measures
    • Main effect one variable3565
    • Interaction (d v. 0)145270
2 × 2 split-plot
    • Main effect between
        ◦ r = .5200360
        ◦ r = .9250450
    • Main effect repeated-measure70130
    • Interaction (d v. 0)300540
    • ANCOVA
        ◦ rrep_measure = .5210360
        ◦ rrep_measure = .9260460
Table 10

Comparison of dz and dav for repeated-measures designs in Zwaan et al. (2018). They show that dav < dz when r = .5 (see Table 3). Reporting dav in addition to dz for pairwise comparisons allows readers to compare the effect sizes of repeated-measures studies with those of between-groups studies.

StudyDependent variableN partsN conditionsNobs_ per_conddzN=1dzN = totdav
Zwaan et al. (2018)
associative primingReaction time to visual stimuli (Session 1)1602 × 2 × 2300.170.810.37
Reaction time to visual stimuli (Session 2)0.180.940.35
false memoriesCorrect related-unrelated lures (Session 1)1602 × 2 × 290.200.971.66
Correct related-unrelated lures (Session 2)0.311.171.76
flanker taskRT stimulus congruent incongruent (Session 1)1602 × 2 × 2320.150.700.44
RT stimulus congruent incongruent (Session 2)0.130.520.20
shape simulationRT to shape matching sentence (Seesion 1)1602 × 2 × 2150.080.270.13
RT to shape matching sentence (Seesion 2)0.170.500.21
spacing effectMemory of massed v. spaced items (Session 1)1602 × 2 × 2400.200.871.00
Memory of massed v. spaced items (Session 2)0.210.920.87
DOI: https://doi.org/10.5334/joc.72 | Journal eISSN: 2514-4820
Language: English
Submitted on: Dec 17, 2018
Accepted on: May 22, 2019
Published on: Jul 19, 2019
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2019 Marc Brysbaert, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.