psymetadata: An R Package Containing Open Datasets from Meta-Analyses in Psychology

Josue E. Rodriguez; Donald R. Williams

doi:10.5334/jopd.61

Full Article

Background

Meta-analysis is an indispensable technique for summarizing a set of similar primary studies that share a common topic. Key goals include determining the average effect size (Hedges & Pigott, 2001) determining whether study characteristics moderate the effect size (Hedges & Pigott, 2005) and quantifying the extent of heterogeneity in the observed effects (Higgins & Thompson, 2002; Ioannidis et al., 2007). These applications make the role of meta-analysis pivotal to psychology research. Accordingly, it is essential to (1) teach psychological researchers how to be informed consumers of meta-analyses and (2) develop improved statistical methods to conduct meta-analyses. A necessity to support such efforts is the availability of high-quality and easily accessible data.

To help meet these needs, we developed the R package psymetadata, which contains 22 recent open-source datasets from meta-analyses of the psychological literature. These data were collected through the Open Science Framework (OSF) and span areas such as social, developmental, and cognitive psychology, among others.¹ The purpose of collecting these datasets was twofold. First, to provide psychologists easily accessible empirical data that facilitate “real-world” examples in pedagogical settings concerning meta-analyses, and second, to enable methodological researchers to illustrate novel statistical techniques on archetypal psychological data. For example, the data contained in this package can be used throughout an introductory meta-analysis course or in a research article to show that a particular meta-analytic method has desirable statistical properties. Further, when conducting a Bayesian meta-analysis, informative priors can be elicited based on such data (e.g., van Erp et al., 2017). Notably, similar efforts already exist, such as the metadat R package (White et al., 2021), albeit without a focus on psychological data. Therefore, the psymedata package is of wide relevance to learners, teachers, researchers, and practitioners of meta-analysis alike.

Methods

Study design

Traditionally, meta-analytic techniques include only one effect size per study because otherwise, the classical assumption of independence among effect sizes is violated. However, the collected datasets may contain more than one effect size per study, or non-independent effect sizes. As we describe in the section Reuse Potential, this feature provides great flexibility in how the datasets can be used. Moreover, each dataset contains several variables suitable for use in moderator analysis. Lastly, in meta-analysis, the primary outcome under study is the average effect size. The effect sizes in the psymetadata currently include Hedges’ g, Cohen’s d, and Pearson’s r, among others.

Time of data collection

All datasets were collected between March 2021 and May 2021.

Location of data collection

The collected datasets were all freely available on the OSF. The original OSF repository for each dataset can be found in the package documentation (https://cran.r-project.org/web/packages/psymetadata/psymetadata.pdf).

Sampling, sample and data collection

A convenience sample was obtained by searching the keyword “meta-analysis” on the OSF and clicking through as many results as we could manage over the period of data collection. Each result was checked for whether it had openly available data and whether a codebook could be found either in the manuscript or the supplemental materials. We only collected datasets where there was a corresponding codebook. This procedure resulted in 22 datasets.

Materials/Survey instruments

No materials were used aside from the computers used to search, download, and clean the data.

Quality Control

The data were collected with diligence and care. All dataset names follow the convention [firstauthor][year]. For example, the data collected in Barroso et al. (2021) was named barroso2021 (see Table 1 for all authors and years). All common variables among the datasets were renamed according to mainstream conventions stemming from the popular metafor R package (Viechtbauer, 2010). Specifically, each dataset contains at least the following variables.

study_id: Unique identifier for each study.
es_id: Unique identifier for each effect size.
yi: The estimated effect size.
vi: The estimated variance of the effect size.

Variables included in the original dataset, but whose definitions were either unavailable or unclear were excluded from the final version included in psymetadata.

Table 1

Datasets included in psymetadata.

AUTHOR(S)	YEAR	TOPIC
Agadullina & Lovakov	2018	Out-group entitativity and prejudice
Aksayli et al.	2019	The cognitive and academic benefits of Cogmed
Barroso et al.	2021	Math anxiety and math achievement
Coles et al.	2019	Facial feedback
Gambleet al.	2019	Specificity of future thinking in depression
Gnambs	2020	The color red and cognitive performance
Lowe et al.	2021	The advantage of bilingualism in children
MacCann et al.	2020	Student emotional intelligence and academic performance
Maldonado et al.	2020	Age differences in executive function
The ManyBabies Consortium	2020	Variation in infancy research
Klein et al.	2018	Variation in replicability across samples and settings
Noble et al.	2019	Shared reading and language development
Nuijten et al.	2020	Intelligence
Sala etal.	2019	Working-memory training and near- and far-transfer measures
Schroeder et al.	2020	Transcranial direct current stimulation and inhibitory control
Spaniol & Danielsson	2020	Executive function components in intellectual disability
Stasielowicz	2019	Goal orientation and performance adaptation
Stasielowicz	2020	Cognitive ability in performance adaption
Steffens et al.	2021	Social Identity Theory and leadership
Stramaccia et al.	2020	Memory suppression
Wibbelink et al.	2017	Juvenile recidivism

[i] Note: Stasielowicz (2019) contained two datasets.

Data anonymisation and ethical issues

The collected datasets are all secondary and only contain study-level information.

Existing use of data

The collected datasets all belonged to a primary meta-analysis. These original works are cited in the references and are listed in the documentation of the psymetadata package. Further, an applied exampled using one of the datasets (Gnambs, 2020) was included in Williams et al. (2021).

Dataset description and access

All datasets in psymetadata share a similar structure. To demonstrate this structure, Code 1 shows a truncated version of the coles2019 dataset originally used to conduct a meta-analysis examining the facial feedback hypothesis (Coles et al., 2019). As previously mentioned, each dataset contains the columns es_id, study_id, yi, and vi. These variables correspond to the unique identifier for the effect size, the unique identifier for the study from which the effect size was collected, the effect size, and the variance of the effect size, respectively. Additionally, most datasets contain information pertaining to the year of publication and the authors of the study from which the effect sizes were obtained. In the coles2019 dataset, the year column denotes the year the effect size was published. The remaining variables, in this case, file_drawer and w_v_b, correspond to moderator variables. For example, one may want to test whether the average effect size of a study may vary according to whether it was published in a peer-reviewed journal (file_drawer), or whether the study design was within- or between-participants (w_v_b).

CODE 1. EXAMPLE DATA FROM COLES ET AL. (2019).
`es_id`	`study_id`	`yi`	`vi`	`year`	`file_drawer`	`w_v_b`
`1`	`1`	`0.020`	`0.013`	`2013`	`yes`	`within`
`2`	`2`	`0.179`	`0.050`	`1998`	`no`	`between`
`3`	`3`	`1.019`	`0.085`	`2014`	`no`	`between`
`4`	`3`	`0.074`	`0.069`	`2014`	`no`	`between`
`5`	`3`	`1.074`	`0.131`	`2014`	`no`	`between`
`6`	`3`	`0.202`	`0.079`	`2014`	`no`	`between`
`.`	`.`	`.`	`.`	`.`	`.`	`.`
`.`	`.`	`.`	`.`	`.`	`.`	`.`
`.`	`.`	`.`	`.`	`.`	`.`	`.`
`284`	`138`	`–0.0049`	`0.0098`	`2009`	`no`	`within`
`285`	`139`	`0.5374`	`0.0440`	`1997`	`no`	`within`
`286`	`140`	`–0.2377`	`0.1222`	`2002`	`no`	`between`

Repository location

The data from psymetadata can be accessed by downloading the R package psymetadata from CRAN, using install.packages(“psymetadata”), or from GitHub. Alternatively, individual files may be downloaded from GitHub.

Object/file name

All of the datasets are saved in the “Data” folder of the psymetadata GitHub repository and are saved using the format [author][year].rda.

Data type

All datasets are secondary.

Format names and versions

The data are saved in the R data format (i.e., the .rda file extension). Accessing files in this format requires using the R programming language (R Core Team, 2021).

Language

The data are saved in American English.

License

The data are distributed under the GNU General Public License Version 2.

Limits to sharing

There are no limitations on the sharing of this data.

Publication date

The psymetadata package was originally published to CRAN on 31/05/2021.

FAIR data/Codebook

For each dataset, the variable names, variable definitions, topic, and reference(s) have been documented. The documentation is available for all datasets (https://cran.r-project.org/web/packages/psymetadata/psymetadata.pdf). The documentation of a given dataset can also be accessed using the ? function in R (e.g., ?coles2019).

Reuse potential

The psymetadata package contains 22 datasets that contain multiple, dependent effect sizes and moderator variables. This affords a great deal of flexibility in teaching a variety of common techniques. For instance, if one were to either average effect sizes within studies or select a single effect size per study, then classical fixed-effects and random-effects meta-analysis (Borenstein et al., 2009) can be taught. On the other hand, these datasets may be used to demonstrate methods that explicitly account for dependent effect sizes, such as robust variance estimation (Hedges et al., 2010) or three-level meta-analysis (Assink & Wibbelink, 2016). Of course, additional techniques may be taught with these data, including moderator analysis (Hedges & Pigott, 2005), subgroup analysis (Borenstein & Higgins, 2013), testing for publication bias (Copas, 1999; Sutton, 2000) and Bayesian meta-analysis, among many others.

For methodological researchers, illustrative examples are commonly employed to demonstrate that novel methodologies have desirable statistical properties and are suitable for studying psychological phenomena. For example, Williams et al. (2021) used the gnambs2020 dataset to show how accounting for group differences in between-study heterogeneity may have profound implications for the resulting inferences of a meta-analysis. Further, priors for Bayesian meta-analyses can be determined by using these data. One can imagine that a future meta-analysis studying whether various developmental psychology studies replicate (e.g., The ManyBabies Consortium, 2020) may rely on a random effects model to do so. By using, say, the manyBabies2020 dataset, informed priors may be determined for the overall effect size, or for the between-study heterogeneity (e.g., van Erp et al., 2017). Finally, the open-source nature of the package allows researchers to contribute their own meta-analytic datasets to the psymetadata by following the steps outlined on the psymetadata GitHub repository.

Notes

[2] The specific topics, along with the authors and years of each meta-analysis are shown in Table 1.

Funding Information

DRW was supported by a National Science Foundation Graduate Research Fellowship under Grant No. 1650042.

Competing Interests

The authors have no competing interests to declare.

Author Contribution

JER and DRW collected and cleaned the data. JER and DRW coded the corresponding R package.

Peer Review Comments

Journal of Open Psychology Data has blind peer review, which is unblinded upon article acceptance. The editorial history of this article can be downloaded here:

PR File 1

Peer Review Comments. DOI: https://doi.org/10.5334/jopd.61.pr1