Have a personal or library account? Click to login
Unrestricted Versus Regulated Open Data Governance: A Bibliometric Comparison of SARS-CoV-2 Nucleotide Sequence Databases Cover

Unrestricted Versus Regulated Open Data Governance: A Bibliometric Comparison of SARS-CoV-2 Nucleotide Sequence Databases

Open Access
|May 2024

Figures & Tables

Table 1

Workflow for Dimensions.ai searches on publications referencing major SARS-CoV-2 data repositories.

REPOSITORYSEARCH QUERYCOVID-19 RELATED TERMSADDITIONAL FILTERS
GISAID‘GISAID’ OR ‘EpiCoV’ OR ‘Global initiative on sharing all influenza data’
  • ‘sars-cov-2’

  • ‘covid’

  • ‘covid-19’

  • ‘coronavirus’

  • ‘HCov-19’

  • Publication type must be a scientific article.

  • Articles between January 1st 2020 and January 1st 2023.

  • Articles can be from any discipline or methodology.

  • Remove duplicates of DOI’s.

  • Remove articles where country affiliation is not documented.

NCBI‘NCBI Virus’ OR ‘genebank’ OR ‘National Center for Biotechnology Information’ OR ‘Accession: PRJ’ OR ‘International Nucleotide Sequence Database Collaboration’
ENA‘The Covid-19 Data Portal’ OR ‘European Nucleotide Archive’ OR ‘ENA’ OR ‘Accession: PRJ’ OR ‘International Nucleotide Sequence Database Collaboration’
DDBJ‘DDBJ’ OR ‘DNA Data Bank of Japan’ OR ‘Accession: PRJ’ OR ‘International Nucleotide Sequence Database Collaboration’
dsj-23-1579-g1.png
Figure 1

Corpus similarity across major SARS-CoV-2 repositories: Similarity of publication corpuses across major SARS-CoV-2 repositories (January 2020–January 2023). Each oval represents a repository’s corpus with GISAID (red), ENA (green), NCBI (blue) and DDJB (purple). The total number of unique publications are labelled for each overlapping circle as well as the percentage of the entire corpus.

dsj-23-1579-g2.png
Figure 2

Temporal trends in scholarly metrics across major SARS-CoV-2 repositories. This line graph illustrates longitudinal variations in key bibliometric indicators: (1) Total Number of Publications (top left), (2) Average Citations per Publication (top right), (3) Average Altmetric Score (bottom left) and (4) Cumulative Citation Count (bottom right). Data points span from January 2020 to January 2023. The DNA Data Bank of Japan (DDBJ), National Center for Biotechnology Information (NCBI), European Nucleotide Archive (ENA), and Global Initiative on Sharing All Influenza Data (GISAID) are represented by red, purple, green and aqua blue lines, respectively.

dsj-23-1579-g3.png
Figure 3

Publication access types across major SARS-CoV-2 repositories. This treemap quantifies the distribution of open access (OA)—Gold, Green, Bronze and Hybrid—and closed access publications across The DNA Data Bank of Japan (DDBJ), National Center for Biotechnology Information (NCBI), European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID). Each rectangle’s area is proportional to the frequency of publications within that category. Databases are delineated by white borders and labelled at their centres in italicised black text. Within each database, access types are labelled in white text.

dsj-23-1579-g4.png
Figure 4

Publisher landscape across major SARS-CoV-2 repositories. This treemap quantifies the distribution of publishers across The DNA Data Bank of Japan (DDBJ), National Center for Biotechnology Information (NCBI), European Nucleotide Archive (ENA) and Global Initiative on Sharing All Influenza Data (GISAID). Each rectangle’s area is proportional to the frequency of publications within that category. Databases are delineated by white borders and labelled at their centres in italicised black text. Within each database, publishers are labelled in white text.

dsj-23-1579-g5.png
Figure 5

Frequency distribution of viral variants mentioned in abstracts across major SARS-CoV-2 repositories. This bar chart represents the frequency of mentions for specific viral variants and lineages in the dataset’s abstracts. WHO labels (e.g., alpha, beta) and Pango lineages (e.g., b.1.1.7, b.1.351) are accounted for. Each bar corresponds to a distinct variant or lineage, ordered in descending frequency of mentions. The x-axis quantifies the number of abstracts mentioning each variant and the y-axis identifies the respective variants and lineages.

dsj-23-1579-g6.png
Figure 6

Co-occurrence network of top MeSH terms across major SARS-CoV-2 repositories. This graph depicts a co-occurrence network of the top 15 Medical Subject Headings (MeSH) terms within the given dataset, represented as nodes. Edges between nodes are weighted by the frequency of co-occurring terms across publications, and edge thickness scales with weight; edges with weight below a threshold of 5 are excluded for clarity. Node size is dictated by the node’s degree and colour is mapped to betweenness centrality.

Table 2

Number of papers per number of authors across major SARS-CoV-2 repositories. This table displays the cumulative number of papers per number of authors for each repository between 2020 and 2022.

NUMBER OF AUTHORS
12–55–1010–2020–5050–100100–250250–500500–10001000–5000
RepoYearNumber of Papers
GISAID202064609548511275254031
GISAID20219391110621207634607051
GISAID202295916113512546724512021
ENA202072420304178104189110
ENA202127512196084101982814011
ENA202240713478755251803012401
NCBI202080613546288114143000
NCBI20211261032965542209275030
NCBI202211311891258718210214000
DDBJ202015493616300000
DDBJ202110221917100000
DDBJ20226322323701000
dsj-23-1579-g7.png
Figure 7

Distribution of single- and multi-region collaborations in scholarly publications by top 20 countries across major SARS-CoV-2 repositories. This stacked bar chart portrays the extent of single-region and multi-region collaborations in scholarly documents for the top 20 countries based on publication volume. The x-axis indicates the number of documents associated with each country, and the y-axis lists the countries in descending order of total documents. The colours in each bar segment represent the type of collaboration: single-region or multi-region.

Table 3

Income collaboration across major SARS-CoV-2 repositories. This table presents the share of income group (based on World Bank classifications) collaborations for each repository.

INCOME COLLABORATIONGISAIDNCBIENADDBJ
HI-HI7894 (41.3%)6340 (41.5%)10372 (47.7%)151 (64.3%)
HI-LI138 (0.72%)113 (0.74%)163 (0.8%)6 (2.6%)
HI-LMI2161 (11.3%)2161 (14%)1724 (8%)51 (21.7%)
HI-MIX2099 (11%)1754 (11.2%)3830 (24.7%)14 (6%)
HI-UMI6494 (34%)4521 (30.3%)5376 (26%)49 (20.9%)
LMI-LI4 (0.02%)19 (0.15%)21 (1%)2 (0.9%)
UMI-LI17 (0.09%)16 (0.1%)13 (0.06%)12 (5.1%)
UMI-LMI269 (1.41%)349 (2.28%)224 (1%)0 (0%)
UMI-MIX25 (0.13%)14 (0.1%)19 (0.09%)0 (0%)
Table 4

Comparative network statistics across major SARS-CoV-2 repositories. This table delineates the network attributes of various databases across distinct classifications: Authors, Country, Funder and Institution. Metrics included are node count, edge count, clustering coefficients and density scores.

DATABASETYPECLUSTERING COEFFICIENTDENSITYMEAN DEGREEMEAN WEIGHTED DEGREEMAX COMMUNITY SIZEMEAN COMMUNITY SIZEMEDIAN COMMUNITY SIZEMIN COMMUNITY SIZETOTAL COMMUNITIESNUMBER OF EDGESNUMBER OF NODES
DDBJAuthors0.960.0116.1317.54101.008.757.002.00246.0017,359.002152.00
Country0.420.084.5616.7043.0019.0012.002.003.00130.0057.00
Funder0.770.688.1544.1513.0013.0013.0013.001.0053.0013.00
Institution0.600.015.245.6071.004.803.002.0087.001095.00418.00
ENAAuthors0.990.00246.25252.472987.0011.286.002.006179.008,583,776.0069,717.00
Country0.710.2943.549220.42153.00153.00153.00153.001.003331.00153.00
Funder0.930.9214.711752.8217.0017.0017.0017.001.00125.0017.00
Institution0.900.03221.45242.596395.0068.302.002.00101.00763,790.006898.00
GISAIDAuthors0.910.00158.61207.153310.0018.258.002.004679.006,771,757.0085,387.00
Country0.650.2640.50823.06159.00159.00159.00159.001.003220.00159.00
Funder0.930.9113.632289.0016.0016.0016.0016.001.00109.0016.00
Institution0.310.0027.3548.774447.0048.243.002.00125.0082,446.006030.00
NCBIAuthors0.970.0057.7361.321088.0010.597.002.007002.002,140,401.0074,155.00
Country0.560.1927.31741.80148.00148.00148.00148.001.002021.00148.00
Funder0.890.8413.411448.1217.0017.0017.0017.001.00114.0017.00
Institution0.300.0024.9832.795956.0049.502.002.00131.0081,002.006485.00
dsj-23-1579-g8.png
Figure 8

Country collaborations across major SARS-CoV-2 repositories. The figure provides a visual representation of global collaborations among countries. It plots a geographical map overlaid with collaboration lines between countries, where the colour of the lines represents the intensity or weight of the collaborations. Countries are color-coded based on their total number of publications.

dsj-23-1579-g9.png
Figure 9

Funder group collaborations across major SARS-CoV-2 repositories. This graph depicts a collaboration network of funder groups within the given dataset, represented as nodes. Edges between nodes are weighted by the frequency of co-occurring funders across publications. Node size is dictated by the node’s weighted degree.

dsj-23-1579-g10.png
Figure 10

Institution group network across major SARS-CoV-2 repositories. This graph depicts a collaboration network of the institutions within the given dataset, represented as nodes. Edges between nodes are weighted by the frequency of collaborating institutions across publications. Node size is dictated by the node’s weighted degree, and colour is mapped to betweenness centrality, following a viridis scale.

Table 5

List of pros and cons for the strategies adopted by each repository based on results.

BIBLIOMETRIC INDICATORSKEYWORD & VARIANT DISTRIBUTIONCOUNTRY COLLABORATIONREUSE NETWORKS
PROCONPROCONPROCONPROCON
GISAIDBest performance in Altmetric and citation indicators.
Predominantly Open Access papers.
Reduced Aggregate of publications compared to the entire INSDC.Focus on therapies and treatments.
Most number of mentioned variants for WHO and Pangleo
Highest weighted degree.
Highest single region collaboration.Small share of low-income collaboration
High income only collaborations being the dominate collaboration.
Biggest density of Author and country collaboration.Sparse Institutional collaboration.
ENABest INSDC member in terms of impact factor metrics.
Primarily Open Access Papers.
Features highest closed access.Focus on animal and sex studies and wider methods like computational biology.Less mentions of variants compared to GISAID and NCBI.Highest multi-region collaboration share
Highest High income MIX collaboration.
Small share of low-income collaboration
High income only collaborations being the dominate collaboration.
Highest density of institutional and funder collaboration
Greater author and country collaboration than NCBI
Shorter paths in comparison to NCBI and GISAID.
Smaller share of authors compared to GISAID.
NCBIMost number of articles from the INSDC group.
Primarily Open Access.
Focus on animal and sex studies and wider methods like computational biology.Less mentions of variants compared to GISAID.Highest High Income Lower-Middle-Income collaboration
Leading collaborators from the most type of income group
Small share of low-income collaboration
High income only collaborations being the dominate collaboration.
Greater Institution collaboration than GISAID
Smaller Average Paths than GISAID
Smallest Author and country Collaboration size between GISAID and ENA
DDBJSpike of very influential papers citing.
Predominantly gold open access.
Lowest number of publications
In general, lowest impact factor metrics.
Focus on internal studies on human population.Least number of variants mentioned.
Lowest weighted degree and degree.
Small share of low-income collaboration
High income only collaborations being the dominate collaboration.
Language: English
Submitted on: May 18, 2023
|
Accepted on: Mar 28, 2024
|
Published on: May 13, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Nathanael Sheehan, Federico Botta, Sabina Leonelli, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.