Skip to main content
Have a personal or library account? Click to login
Making the Complete OpenAIRE Citation Graph Easily Accessible Through Compact Data Representation Cover

Making the Complete OpenAIRE Citation Graph Easily Accessible Through Compact Data Representation

By:  and    
Open Access
|Apr 2026

Figures & Tables

Table 1

Combined validation script output. This is the automated output log of the validation scripts.

SCRIPTMETRICVALUE
full_id_coverage (pass)
raw_count205841448
parquet_count205841448
missing_count0
extra_count0
distinct_constraints (pass)
publications total205841448
publications distinct205841448
publications duplicates0
citations total2184347684
citations distinct2184347684
citations duplicates0
citations null_rows0
format_checks (pass)
nodeid gaps0
nodeid contiguoustrue
Table 2

Comparison of storage size and memory requirements of the full OpenAIRE (OA) dataset (release 2025-12-01) and its compact versions. Memory usage corresponds to the amount of GB each dataset occupied when loaded into a Pandas dataframe (The pandas development team, 2026). Since the citations are loaded as int32, the memory size is much lower than the disk size.

DATASETSIZE ON DISK (GB)MEMORY USAGE (GB)
citations.tsv3917
publications.tsv65
publications_large.tsv187185
citations.parquet817
publications.parquet25
publications_large.parquet68185
Full OA – edges1820NA
Full OA – nodes700NA
Table 3

Memory size of columns with Python types and short descriptions. Arrow is the memory size loaded using PyArrow; Pandas is the size if the data is loaded straight into a default Pandas dataframe. The Pandas Optimized column uses PyArrow as a backend, and utilizes “categorical” on the container and language fields to further lower their footprint. The MAG IDs are Microsoft Academic Graph IDs.

COLUMNPYTHON TYPEDESCRIPTIONARROWPANDASPANDAS OPT
nodeIdint32Node ID0.767 GB0.767 GB0.767 GB
openaireIdstrOpenAIRE unique ID9.585 GB19.746 GB9.586 GB
titlestrPaper title16.486 GB29.707 GB16.486 GB
authorslist[str]List of authors11.037 GB23.005 GB11.038 GB
descriptionstrMain text/abstract131.248 GB193.583 GB131.248 GB
datedatetimePublication date0.767 GB7.587 GB0.768 GB
containerstrJournal/conference name5.471 GB13.953 GB2.181 GB
citationsintCitation count1.558 GB1.534 GB1.558 GB
languagestrLanguage1.394 GB11.555 GB0.197 GB
pid_doislist[str]DOI identifiers5.639 GB19.453 GB5.639 GB
pid_mag_idslist[str]MAG IDs2.004 GB12.748 GB2.005 GB
pid_pmidslist[str]PubMed IDs1.202 GB7.948 GB1.202 GB
pid_handleslist[str]Persistent handles1.149 GB6.134 GB1.149 GB
pid_pmcslist[str]PubMed Central IDs0.921 GB5.480 GB0.921 GB
pid_arxiv_idslist[str]ArXiv IDs0.885 GB4.856 GB0.886 GB
TOTAL190.113 GB358.055 GB185.631 GB
DOI: https://doi.org/10.5334/johd.520 | Journal eISSN: 2059-481X
Language: English
Page range: 63 - 63
Submitted on: Feb 13, 2026
Accepted on: Apr 10, 2026
Published on: Apr 30, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Joakim Skarding, Pavel Sanda, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.