
Figure 1
Workflow for the ingestion of citation data and bibliographic metadata into the OpenCitations datasets.

Figure 2
Flowchart describing the preliminary processing of citing bibliographic entities.

Figure 3
Flowchart describing the processing of cited bibliographic entities, their validation, and the production of metadata and citation tables.
Table 1
Sample of Meta input tables produced by oc_ds_converter, storing bibliographic entities’ metadata.
| ID | TITLE | AUTHOR | PUB_DATE | VENUE | VOLUME | ISSUE | PAGE | TYPE | PUBLISHER | EDITOR |
|---|---|---|---|---|---|---|---|---|---|---|
| DOI: 10.14825/kaseki.68.0_14 | 本邦産白亜紀アンモナイトデータベースおよび種多様性について | 利光, 誠一; 平野, 弘道; 松本, 崇; 高橋, 一晴 | 2000 | 化石 [issn:0022-9202 issn:2424-2632 jid:kaseki] | 68 | 0 | 14–16 | journal article | 日本古生物学会 | |
| DOI: 10.1126/science.235.4793.1156 | Chronology of fluctuating sea levels since the Triassic | 1987 | Science | 235 | 1156–1167 |
Table 2
Sample of Index input tables, produced by oc_ds_converter, storing citation data.
| CITING | CITED |
|---|---|
| DOI: 10.14825/kaseki.68.0_14 | DOI: 10.1126/science.235.4793.1156 |

Figure 4
Language distribution in Meta bibliographic entities, calculated on Meta dump, version 5 (https://doi.org/10.6084/m9.figshare.21747461.v5). The analysis was performed on bibliographic entities with a declared title.

Figure 5
Bar charts illustrating the analysis of multilingualism within the input dataset, categorized by bibliographic metadata fields.
Table 3
Table showing the metadata languages in the original dataset and the linguistic information loss due to OCDM constraints. The total amount of metadata provided for a field is the sum of the number of values provided solely in one language, twice the number of values supplied in two languages, and the product between the number of values provided in more than two languages and the precise number of furnished languages. The information loss is calculated as the sum of values provided in more languages out of the total calculated. The publisher’s name field has not been included in the table since it does not necessarily concern the loss of linguistic information but might involve cases where the information loss derives from having multi-publisher values.
| 1 LANGUAGE | 2 LANGUAGES | 3+ LANGUAGES | TOTAL VALUES PROVIDED | INFORMATION LOSS WRT. THE ORIGINAL DATASET | |
|---|---|---|---|---|---|
| title citing | 5,701,285 | 1,641,895 | 39(3 languages) | 8,985,192 | 1,641,973; 18.27% |
| title cited | 217,316 | 12,616 | 0 | 242,548 | 12,616; 5.2% |
| authors citing | 9,892,522 | 4,556,812 | 39(3 languages) | 19,006,263 | 4,556,890; 23,98% |
| authors cited | 308,079 | 157,556 | 0 | 623,191 | 157,556; 25.28% |
| journal title citing | 1,137,368 | 2,658,678 | 21,213 (20,572 3 languages; 641 4 languages) | 6,519,004 | 2,701,745; 41.44% |
| journal title cited | 180,515 | 0 | 0 | 180,515 | 0 |

Figure 6
Language distribution in Meta bibliographic entities, calculated on Meta dump, version 6 (https://doi.org/10.6084/m9.figshare.21747461.v6). The analysis was performed on bibliographic entities with a declared title.
Table 4
Yellow cells represent the single contribution of each collection to OpenCitations Index, i.e., the number of citations uniquely derived by a given source. Pink cells represent the number of citations in the sources’ intersection. The table is based on OpenCitations data at its latest update (29 November 2023).
| INDEX | CROSSREF | DATACITE | PUBMED | OPENAIRE | JALC | |
|---|---|---|---|---|---|---|
| INDEX | 1,975,552,846 | 1,563,218,160 | 169,814,412 | 695,988,810 | 14,645,838 | 396,788 |
| Crossref | 1,100,963,346 | 27,051 | 458,309,297 | 3,917,329 | 1,137 | |
| DataCite | 169,663,255 | 9,623 | 114,483 | 0 | ||
| PubMed | 237,208,867 | 9,711,789 | 125 | |||
| OpenAire | 1,067,712 | 0 | ||||
| JaLC | 395,526 |
