Have a personal or library account? Click to login
ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications Cover

ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

Open Access
|Oct 2020

Figures & Tables

dsj-19-1216-g1.png
Figure 1

Inclusion and exclusion flowchart for training dataset 1.

dsj-19-1216-g2.png
Figure 2

Inclusion and exclusion flowchart for training dataset 2.

dsj-19-1216-g3.png
Figure 3

Inclusion and exclusion flowchart for the validation dataset.

dsj-19-1216-g4.png
Figure 4

Schema of the algorithm workflow and architecture.

Table 1

Different keyword categories used by ODDPub to detect Open Data and Open Code.

Combined Keyword CategoriesExplanation
Open Data
Field-specific repositoriesChecks if data deposition in field-specific database and accession number is mentioned
General-purpose repositoriesChecks if data deposition in general-purpose database that uses no accession number is mentioned
DatasetChecks if dataset that is given specific number (e.g. “dataset S1”) is mentioned
Supplemental table or dataChecks if a numbered file or table or raw data is mentioned together with specific file formats
Supplementary raw/full data with specific file formatChecks if raw/full data is mentioned together with specific file formats
Data availability statementChecks if an accession number or a repository name is mentioned in the data availability section
Dataset on GitHubChecks if data deposition on GitHub is mentioned
Data journalsChecks journal DOI for certain known data journals
Open Code
Source-code availabilityChecks if availability of source code is mentioned
Supplementary Source-codeChecks if source code in the supplement is mentioned
Table 2

Predictions of ODDPub for Open Data on the validation dataset in comparison to the manual screening.

Open DataODDPub
YesNo
Human raterYes6724
No23678
Table 3

Predictions of ODDPub for Open Code on the validation dataset in comparison to the manual screening.

Open DataODDPub
YesNo
Human raterYes83
No6775
dsj-19-1216-g5.png
Figure 5

Venn diagram of the overlap between the detected Open Data publications for the four different detection methods on the validation dataset: Manual search, ODDPub, PubMed, and Web of Science. The manual search represents the gold standard. All but one of the 23 publications detected by ODDPub but not the manual search are false positive detections.

Table 4

Types of data sharing observed in the manually detected Open Data publications of the validation sample.

CategoryNumber of occurrences
Supplemental Data42
Field-specific repository40
General-purpose repository (including GitHub)14
Institutional repository0
Personal/project-specific website1
Data journal0
Table 5

Reasons for false positive cases detected by ODDPub in the validation sample. For three cases two conditions applied, as part of the shared data was not raw data and the other pared was shared with restrictions.

CategoryNumber of occurrences
Shared data not raw data9
Data sharing with restrictions4
Open Data reuse4
Only analysis code shared3
Detected sentence not related to data sharing3
Data available upon request1
Linked OSF repository was empty1
Case of Open Data missed by manual search1
Language: English
Submitted on: May 14, 2020
|
Accepted on: Oct 19, 2020
|
Published on: Oct 29, 2020
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2020 Nico Riedel, Miriam Kip, Evgeny Bobrov, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.