Comprehensive Bibliographic Dataset on Open Science

Piotr Wciślik; Magdalena Wnuk; Cezary Rosiński; Maciej Maryl

doi:10.5334/johd.511

(1) Overview

Repository location

Context

The dataset was produced as a knowledge resource for the SCIROS project (Strategic Collaboration for Interdisciplinary Research on Open Science in the Social Sciences and Humanities) during the process of conducting a systematic literature review that addressed critical gaps in the understanding of the theory, practice, and infrastructures of open science in the context of Humanities.

(2) Method

Steps

This data paper incorporates Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Page et al., 2021). The following steps were used to produce the dataset:

1. Cross-platform query

The query applied the same set of keywords across three bibliographic databases: Scopus, the Web of Science (WoS), and OpenAlex. The design of the cross-platform query is further described in the next section. The query yielded 1,481 records in WoS; 2,293 records in Scopus; and 1,348 records in OpenAlex.

2. Formal relevance screening and integration

Each set of records was initially screened for relevance based on formal characteristics, such as the type of record (e.g., “meeting abstract”) or duplicate records within each dataset (e.g., articles duplicated as book chapters). This step eliminated 27 records for WoS, 30 records for Scopus, and 70 records for OpenAlex. The metadata of each of the three datasets was then mapped onto a common metadata model and uploaded into a single spreadsheet consisting of 4,996 records, including duplicates.

3. Systematic relevance screening

Screening for relevance was undertaken ahead of deduplication, for reasons described below. The procedure, which consisted of adding scores determined by two reviewers, is described in detail in the section “Quality control”. The procedure resulted in 1,095 records, including duplicates.

4. Production of composite records: deduplication and reconciliation

There were various discrepancies in metadata quality between the three datasets and inconsistencies in the data models. This is why, in this step data reconciliation was carried out together with deduplication, according to the procedure described in the section “Quality control”. This procedure produced 727 composite records (sometimes combining information from two or more different source datasets and in different data formats), which are, for this reason, human-readable but require further data normalisation for machine-assisted reuse. In addition, this vulnerability is partially mitigated by the process of creating the Zotero library (described below).

5. Upload to the Zotero library

The 727 composite records were added as a seed collection to the SCIROS Zotero library, which is the public-facing, open resource for the project (SCIROS, 2025). The bulk of the records were added automatically using Zotero’s “magic wand” functionality, i.e., their metadata and full-texts (where available) were ingested into the library based on DOIs. The remaining 104 records, which did not have a DOI (or had a broken one), were ingested from the internet using the Zotero connector, or, in the few cases where the connector did not yield satisfying results, typed in manually.

The SCIROS Zotero library is, by design, a living resource that is collectively and continuously curated by the project’s team and partners. The Zenodo files represent an archived version of the library. At the point of archiving, it numbered 754 records after deduplication.

Sampling strategy

The sampling strategy employed a data retrieval method designed for the SHAPE-ID project (Wciślik et al., 2020), and is described in detail on the SCIROS website (Rosiński et al., 2025). The main innovation of the strategy was to use the same query (in terms of keywords and their structure) across three platforms that differ in their data retrieval capabilities.

The discovery services of the proprietary Scopus and WoS databases represent similar standards: in addition to using filters, the researcher can create complex query strings combining Boolean and proximity operators, and wildcards into nested strings using brackets. However, proximity operators are not available for searching their open data competitor OpenAlex. Thus, the main innovation of this sampling strategy was to emulate the function of proximity operators in Python on the OpenAlex dataset.

Query in Scopus and WoS

In both the Scopus and WoS cases, we used three sets of keywords:

Set A = Open*, Citizen

Set B = Scien*, Data, Access, Method*, Research, Humanities, Scholar*, Infrastructure*

Set C= Theor*, Understanding*, Concept*, Philosoph*,Criti*, Value*, Ethic*, epistem*, Manifest*, Meaning*, Idea*, Premise*, Discourse*

While sets A and B identify the subject of open science in its various manifestations (open access, open data, open research infrastructures, etc.), set C narrows down the focus to those resources which address what open science means (theory, philosophy, ethics, epistemology, etc.). The composition of the three sets relied on our knowledge of the subjects and was aimed at striking the right balance between false positives and false negatives.

The three sets were combined into the query string C NEAR/2 (A NEAR/0 B), in which the keywords that use wildcard characters (*) in each set are connected with the Boolean operator OR, and where the proximity operator NEAR/x means that the keywords must be x words apart in the title or abstract.

This gives the following query string for WoS:

(Theor* OR Understanding* OR Concept* OR Philosoph* OR Criti* OR Value* OR Ethic* OR epistem* OR Manifest* OR Meaning* OR Idea* OR Premise* OR Discourse*) NEAR/2 ((Open* OR Citizen) NEAR/0 (Scien* OR Data OR Access OR Method* OR Research OR Humanities OR Scholar* OR Infrastructure*)).

We used the same string for Scopus; the only difference was that Scopus uses the proximity operator W instead of NEAR.

Query in Open Alex

The work-around that enabled the creation of a query of similar complexity that could be used for OpenAlex, consisted of a combination of an API query and post-processing in Python.

Working through the API, we substituted the nested part of the string, A NEAR/0 B, with a group of combinations of keywords from sets A and B:

“Open Access” OR “Citizen Science” OR “Open Science” OR “Open Methods” OR “Open Research Methods” OR “Open Humanities” OR “Open Infrastructure” OR “Open Research Infrastructure” OR “Open Scholarship”

We could not do the same for the entire string C W/2 (A W/0 B), since this would yield too many combinations. Instead, the keywords from Set C were connected through the proximity operator AND, yielding the following string:

((Open Access OR Citizen Science OR Open Science OR Open Methods OR Open Research Methods OR Open Humanities OR Open Infrastructure OR Open Research Infrastructure OR Open Scholarship) AND (Theories OR Understandings OR Concepts OR Philosophies OR Critiques OR Values OR Epistemologies OR Manifestos OR Meanings OR Ideas OR Premises OR Discourses))

Since the substitution of the proximity operator NEAR/2 with the AND operator yielded too many false positives, on the OpenAlex dataset, through the API, we ran a Python script that emulated the proximity operator. The script is available on Github (Rosiński, 2025).

Quality control

Relevance screening

Screening for relevance was carried out by Wnuk and Wciślik, who each attributed a score between 1 and 3 (from least to most relevant) to each record. The scores were then aggregated. Records with an aggregate score of 5 or 6 were included, and records with an aggregate score of 1 to 3 were excluded. As for records with an aggregate score of 4, those that received 2 points each were included, whereas those that received a score of 1 from one reviewer and 3 from the other, were given a second round of screening, during which a consensus on discrepant scores was reached by the two reviewers. In the course of the procedure, reviewers also added project-specific tags, such as thematic scope (Theory, Infrastructure, Practice) and target usage (textual versus network analysis); these will not be included in the final dataset.

Deduplication and reconciliation

A substantial challenge in creating this dataset has been the discrepancies in both the data models and the data curation standards between the three datasets. On the whole, the OpenAlex records represent the highest standard of data curation, since all named entities in the OpenAlex data model receive a unique identifier, which includes each item on the list of references; this makes them easier to reuse in machine-assisted analysis in particular (e.g., co-citation analyses). However, the abstracts that are downloadable through the OpenAlex API are abridged (they seem to exclude certain stop-words), whereas both Scopus and WoS reproduce them in full. These latter two databases also contain certain metadata fields that are absent in OpenAlex, for example, information about funding, and author-generated keywords. Among other discrepancies relevant to further reuse is that each of the three databases has its own system of data classification and its own system of encoding references. The OpenAlex encoding is structured: each item on the reference list of each record has a separate identifier. In both Scopus and WoS, references can only be retrieved as unstructured text, however WoS references contain DOIs, which can be parsed out of the text field.

Having compared the three source datasets from the perspective of their usefulness for the SCIROS project, the following method was employed to reconcile and deduplicate the records:

Where an OpenAlex record existed (all 3 datasets, OpenAlex and WoS, and OpenAlex and Scopus), it was taken as the base record, while the abstract and the metadata on funding, publisher, and author keywords were added from WoS or Scopus (using this order of preference).
Where WoS and Scopus duplicates existed, WoS was taken as the base record, without further data reconciliation.
In addition to the systematic reconciliation described above, all deduplicated records were screened for random inconsistencies (including DOIs, in particular).
All record identifiers were kept in separate columns so that the provenance of each composite record can be retraced.

Copyright screening

The full texts of records ingested automatically into Zotero library were screened for copyright concerns, due to the fact that in case of some of the open access materials ingested via Zotero, we had doubts whether third-party re-use is permitted (i.e. sharing via the Zotero public library and via Zenodo, as opposed to its use via an individual Zotero account).

Following steps were undertaken:

We have examined the records meeting two conditions: (a) access status is “closed” or blank according to our records (information from column S “open_access_status” in the spreadsheet) and (b) has full text ingested into the Zotero library
We have kept the full text whenever we could either (a) verify that the license enables public sharing or (b), in case the license statement was missing, verify that the reuse is not restricted by the repository of the full text and that there is no conflicting licensing claim (e.g. from the text’s publisher).

(3) Dataset Description

Repository name

Zenodo

Object name

SCIROS Bibliography on Open Science

Format names and versions

1. Research dataset

File name: SCIROS_dataset_open_science

File Format: CSV file

2. ZOTERO library

File name: SCIROS Zotero Library

File format: ZIP file containing a RIS file with bibliographic information and the full texts of the records, available in open access.

Creation dates

From 2025-01-15 to 2025-12-15

Dataset creators

Piotr Wciślik^a: Conceptualisation, Methodology, Data curation

Magdalena Wnuk^b: Conceptualisation, Methodology, Data curation

Cezary Rosiński^c: Conceptualisation, Methodology, Data Curation, Software

^abc Digital Humanities Centre at the Institute of Literary Research of the Polish Academy of Sciences (CHC IBL PAN), Warsaw, Poland

Language

Spreadsheet: English metadata.

Zotero library: metadata and full texts in multiple languages but predominantly in English.

GitHub Script: readme file, in English.

Note:

Non-English records available through the three platforms have multilingual metadata, and can thus be queried using English keywords. Also, the English metadata of the non-English records are available for download through the three platforms. This is reflected in the SCIROS spreadsheet, which contains all metadata in English.

However, in the process of automated ingestion using DOI numbers (the “magic wand” functionality), Zotero makes its own choice for metadata language. For example, the item with DOI 10.6018/analesdoc.378171 has metadata in both English and Spanish.

We decided not to normalise the Zotero library because users working with metadata-driven analyses are able to use the spreadsheet version, and users working with text will need to handle multilingualism regardless (and multilingualism itself is an important value in Open Science!).

License

CC BY 4.0

Notice about Third Party Content:

This dataset contains third party content—full texts of the articles. All third party content is open access, however, the individual articles may have different licenses to the CC-BY license, which applies exclusively to the work done by the authors of this dataset. Please make sure that your reuse of third party content is compliant with the indicated license.

Publication date

2026-01-10

(4) Reuse Potential

The spreadsheet can be further used in all kinds of bibliographic data-driven research, including network analysis and all types of scientometric studies, such as co-citation analysis. Since the dataset contains composite records, combining metadata from Scopus, WoS, and OpenAlex, the data (in particular the references) must be normalised to enable further reuse.

The Zotero library can be reused for bibliographic consultation for research and teaching, as well as for discourse analysis and machine-assisted corpus analysis using its full-text corpus.

In addition, the script provided in the GitHub repository enables reproducible retrieval and filtering of OpenAlex data using a query structure that is aligned with Scopus and Web of Science. It can be reused to regenerate the OpenAlex subset of the dataset, to update it over time by re-running the query, and adapt it to new research domains by modifying the keyword sets. The implemented proximity-filtering mechanism allows researchers to approximate database-specific operators (such as NEAR) in environments where they are not supported, making the approach transferable to other APIs and datasets.

Acknowledgements

The authors would like to thank the two anonymous reviewers for their insightful comments and the JOHD editorial team for their support.

Author Contributions

Piotr Wciślik: Conceptualisation, Methodology, Data curation, Writing – original draft, Writing – review & editing.

Magdalena Wnuk: Conceptualisation, Methodology, Data curation, Writing – review & editing.

Cezary Rosiński: Conceptualisation, Methodology, Data Curation, Software, Writing – review & editing.

Maciej Maryl: Writing – review & editing.