ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

Nico Riedel; Miriam Kip; Evgeny Bobrov

doi:10.5334/dsj-2020-042

Figures & Tables

Inclusion and exclusion flowchart for training dataset 1.

Inclusion and exclusion flowchart for training dataset 2.

Inclusion and exclusion flowchart for the validation dataset.

Schema of the algorithm workflow and architecture.

Table 1

Different keyword categories used by ODDPub to detect Open Data and Open Code.

Combined Keyword Categories	Explanation
Open Data
Field-specific repositories	Checks if data deposition in field-specific database and accession number is mentioned
General-purpose repositories	Checks if data deposition in general-purpose database that uses no accession number is mentioned
Dataset	Checks if dataset that is given specific number (e.g. “dataset S1”) is mentioned
Supplemental table or data	Checks if a numbered file or table or raw data is mentioned together with specific file formats
Supplementary raw/full data with specific file format	Checks if raw/full data is mentioned together with specific file formats
Data availability statement	Checks if an accession number or a repository name is mentioned in the data availability section
Dataset on GitHub	Checks if data deposition on GitHub is mentioned
Data journals	Checks journal DOI for certain known data journals
Open Code
Source-code availability	Checks if availability of source code is mentioned
Supplementary Source-code	Checks if source code in the supplement is mentioned

Table 2

Predictions of ODDPub for Open Data on the validation dataset in comparison to the manual screening.

Open Data		ODDPub
		Yes	No
Human rater	Yes	67	24
No	23	678

Table 3

Predictions of ODDPub for Open Code on the validation dataset in comparison to the manual screening.

Open Data		ODDPub
		Yes	No
Human rater	Yes	8	3
No	6	775

Venn diagram of the overlap between the detected Open Data publications for the four different detection methods on the validation dataset: Manual search, ODDPub, PubMed, and Web of Science. The manual search represents the gold standard. All but one of the 23 publications detected by ODDPub but not the manual search are false positive detections.

Table 4

Types of data sharing observed in the manually detected Open Data publications of the validation sample.

Category	Number of occurrences
Supplemental Data	42
Field-specific repository	40
General-purpose repository (including GitHub)	14
Institutional repository	0
Personal/project-specific website	1
Data journal	0

Table 5

Reasons for false positive cases detected by ODDPub in the validation sample. For three cases two conditions applied, as part of the shared data was not raw data and the other pared was shared with restrictions.

Category	Number of occurrences
Shared data not raw data	9
Data sharing with restrictions	4
Open Data reuse	4
Only analysis code shared	3
Detected sentence not related to data sharing	3
Data available upon request	1
Linked OSF repository was empty	1
Case of Open Data missed by manual search	1

ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

Figures & Tables

Figure 1

Figure 2

Figure 3

Figure 4

Table 1

Table 2

Table 3

Figure 5

Table 4

Table 5

Paradigm

My account