Skip to main content
Have a personal or library account? Click to login
Making the Complete OpenAIRE Citation Graph Easily Accessible Through Compact Data Representation Cover

Making the Complete OpenAIRE Citation Graph Easily Accessible Through Compact Data Representation

By:  and    
Open Access
|Apr 2026

Full Article

1 Overview

Repository location

We use two repositories: Zenodo to share our data and fixed version of the processing pipeline, and git repository on Codeberg to facilitate sharing and further development of the processing pipeline.

Data repository

https://zenodo.org/records/19207803

Code repository

https://codeberg.org/Zmeos/OpenAIRE-citation-extraction

Context

The OpenAIRE graph (Manghi et al., 2025; Rettberg & Schmidt, 2012) is a large knowledge graph storing several kinds of research data. In this work we focus on extracting the citations and publications from that dataset, and distilling the information to make it more accessible.

There are several other citations graphs available – OpenAlex (Priem et al., 2022), Open Research Knowledge Graph (Jaradeh et al., 2019), Crossref (Hendricks et al., 2020) and OpenCitations (Peroni & Shotton, 2020). These datasets differ in many ways; some are only citation graphs, while others like OpenAlex and OpenAIRE are research knowledge graphs that record much more than just citations and publications. The datasets also differ in their publication coverage (Culbert et al., 2025; Martín-Martín et al., 2021) and how the works are indexed, e.g., how they determine taxonomy (Ciuciu-Kiss & Garijo, 2024b). Details of publication coverage between OpenAIRE and other open datasets are not well explored – a subset of OpenAIRE has been compared to OpenAlex (Ciuciu-Kiss & Garijo, 2024a), but to the best of our knowledge, no coverage analysis to the extent of previous works like (Martín-Martín et al., 2021) or (Culbert et al., 2025) has taken place. It is not sufficient to compare publication numbers since these datasets aggregate publications from multiple sources and a large node number might simply be due to poor de-duplication.

The OpenAIRE initiative offers several APIs, a search engine, cloud access through Google BigQuery, and a complete dataset dump (Manghi et al., 2025). The API and search engine serve as accessible avenues for any researcher, however, they are limited in the scope of data that can be efficiently accessed. A scan of the complete graph is outside the free tier of BigQuery, at the time of writing. If a researcher is interested in large portions of the graph, or the graph as a whole, then the dump is more suitable.

Processing the whole graph dump requires software coding skills, a large amount of memory, storage and computational power, resources not readily accessible to many scholars in the humanities (Dederke et al., 2024). Accessing the raw data also requires solid understanding of the OpenAIRE JSON schema. To fill this gap, we pre-process the dump, distilling it into a more manageable size, and distributing it as simply structured plain text and Parquet files, making the whole graph more accessible. Similar work has also been done for OpenAlex (Caetano Machado Lopes & Chacko, 2024).

2 Method

The OpenAIRE graph consists of many entities and many relations. These are stored in many files in the dump. We only need the publication files (nodes) and the relation files (citations/edges) to extract the citation network. The publication files themselves contain a lot of information about publications, only some of which is relevant to most researchers. The relation files include many types of relations. For this dataset we are only interested in citations (the type “Cites”). The following list shows the generic steps we followed to extract the citation network and minimize memory footprint. The list is organized by python files and shows the file responsible for each step (the full code is available in the Codeberg repository).

Steps

  • 0. step0_download_data_and_extract.py

    Download the complete OpenAIRE graph dump (Manghi et al., 2025), including all publication and relation files. When calling the pipeline, this is an optional step.

  • 1. step1_extract_raw.py

    Partially extract the compressed dump files.

  • 2. step2_publications.py

    • (a) Filter records, obtaining publication-type only.

    • (b) Flatten nested JSON structures to obtain a tabular representation.

    • (c) Generate new, more memory-efficient(int32) nodeIDs, and build a hashtable for translating between OpenAIRE IDs and our nodeIDs.

    • (d) For the TSV files, replace any tabs or newlines with whitespace.

    • (e) Export TSV and Parquet publication files.

  • 3. step3_citations.py

    • (a) Process relation files using PySpark (Zaharia et al., 2016), retaining only relations of type Cites and extracting source and target identifiers.

    • (b) Use hashtable produced by step2_publications.py to replace OpenAIRE IDs. Update all citation relations to reference the new identifiers.

    • (c) Export TSV and Parquet citation files.

  • 4. step4_distribution.py

    Compresses the TSV files to xz.

This achieves a distilled dataset where the relations are efficiently stored using pairs of short integers.

Quality control

Several validation scripts are run after processing to control for mistakes in the processing. The scripts check whether any entries from the original dump were lost and report any missing values.

  • full_id_coverage.py Verifies that every publication in the raw source data is present in the final output.

  • distinct_constraints.py Checks that there are no duplicate publications or duplicate citation edges in the output.

  • format_checks.py Verifies that publication node IDs form a complete, gap-free sequence starting from zero.

Additionally the file run_all_validations.py runs all validation scripts, collects their results, and writes a consolidated summary report. The entire validation pipeline is automatically run after the processing pipeline completes. Output from the validation files are shown in Table 1.

Table 1

Combined validation script output. This is the automated output log of the validation scripts.

SCRIPTMETRICVALUE
full_id_coverage (pass)
raw_count205841448
parquet_count205841448
missing_count0
extra_count0
distinct_constraints (pass)
publications total205841448
publications distinct205841448
publications duplicates0
citations total2184347684
citations distinct2184347684
citations duplicates0
citations null_rows0
format_checks (pass)
nodeid gaps0
nodeid contiguoustrue

Output

The processing pipeline takes the full OpenAIRE dump as input and transforms it into the TSV (tab-separated values) and equivalent Parquet files described below. TSV is a variant of a more common CSV. We chose TSV because several of the text fields in the publications_large.tsv routinely contained commas, and our attempts at quoting the commas turned out to be error prone. Parquet is a binary format for high-performance data processing.

  • citations.tsv.xz and citations.parquet – a simple edge list of the graph (all the citations). Each row has a simple form of two connected nodeIDs, e.g.: 159486578 118392581

  • publications.tsv.xz and publications.parquet – all graph nodes (publications), it contains only the nodeID and, when available, the DOI. Each row has a simple form, e.g.: 14209 10.3931/e-rara-45685

  • publications_large.tsv.xz and publications_large.parquet – includes the same number of nodes as the publication files, but includes additional fields, e.g. title, authors, description, etc. For a full overview of fields, see Table 3.

The pid_* columns (included in publications_large, see Table 3) are only extracted schemes from the pids field in the dump which OpenAIRE populates with identifiers collected from authoritative sources. The instances field (a nested field of additional metadata, Table 5) also includes identifiers, those were not collected. An exception to this is the primary_doi field in the publications file, in which the doi is drawn from the identifiers field if it is not found in the pids field.

Memory requirements

Below, we summarize the hardware requirements for the transformed dataset and the original OpenAire full dataset dump. An overview of memory and disk requirements are shown in Table 2. For loading the citations.tsv into memory with int32 using the Pandas library in Python, the following can be used:

df_refs = pd.read_csv(

      “citations.tsv.xz”, sep=“\t”,

      dtype={“source_nodeId”: “int32”, “target_nodeId”: “int32”}

)

Table 2

Comparison of storage size and memory requirements of the full OpenAIRE (OA) dataset (release 2025-12-01) and its compact versions. Memory usage corresponds to the amount of GB each dataset occupied when loaded into a Pandas dataframe (The pandas development team, 2026). Since the citations are loaded as int32, the memory size is much lower than the disk size.

DATASETSIZE ON DISK (GB)MEMORY USAGE (GB)
citations.tsv3917
publications.tsv65
publications_large.tsv187185
citations.parquet817
publications.parquet25
publications_large.parquet68185
Full OA – edges1820NA
Full OA – nodes700NA

The data formats (TSV and Parquet) are of course generic and can be loaded using any tool that supports these common formats. When loading the Parquet, int32 will be used automatically.

df_cites = pd.read_parquet(

      “citations.parquet”,

      engine=“pyarrow”,

      dtype_backend=“pyarrow”,

)

The publications files benefit from being loaded using the PyArrow backend. This significantly reduces the memory usage of the doi field.

df_pubs = pd.read_parquet(

      “publications.parquet”,

      engine=“pyarrow”,

      dtype_backend=“pyarrow”,

)

The publications_large files are the most important to load efficiently as ∼200GB of RAM can be saved. Be aware that even efficient loading still requires ∼185GB of RAM. For users interested in a subset of the provided columns, we recommend selecting the columns on load. For loading the full dataset the following code can be used, and selected columns can be removed. This efficient loading is the approach used for the “Pandas Opt” column in Table 3. In the example below, the columns nodeId, title and pid_dois are selected.

Table 3

Memory size of columns with Python types and short descriptions. Arrow is the memory size loaded using PyArrow; Pandas is the size if the data is loaded straight into a default Pandas dataframe. The Pandas Optimized column uses PyArrow as a backend, and utilizes “categorical” on the container and language fields to further lower their footprint. The MAG IDs are Microsoft Academic Graph IDs.

COLUMNPYTHON TYPEDESCRIPTIONARROWPANDASPANDAS OPT
nodeIdint32Node ID0.767 GB0.767 GB0.767 GB
openaireIdstrOpenAIRE unique ID9.585 GB19.746 GB9.586 GB
titlestrPaper title16.486 GB29.707 GB16.486 GB
authorslist[str]List of authors11.037 GB23.005 GB11.038 GB
descriptionstrMain text/abstract131.248 GB193.583 GB131.248 GB
datedatetimePublication date0.767 GB7.587 GB0.768 GB
containerstrJournal/conference name5.471 GB13.953 GB2.181 GB
citationsintCitation count1.558 GB1.534 GB1.558 GB
languagestrLanguage1.394 GB11.555 GB0.197 GB
pid_doislist[str]DOI identifiers5.639 GB19.453 GB5.639 GB
pid_mag_idslist[str]MAG IDs2.004 GB12.748 GB2.005 GB
pid_pmidslist[str]PubMed IDs1.202 GB7.948 GB1.202 GB
pid_handleslist[str]Persistent handles1.149 GB6.134 GB1.149 GB
pid_pmcslist[str]PubMed Central IDs0.921 GB5.480 GB0.921 GB
pid_arxiv_idslist[str]ArXiv IDs0.885 GB4.856 GB0.886 GB
TOTAL190.113 GB358.055 GB185.631 GB

df_large = pd.read_parquet(

      “publications_large.parquet”,

      columns=[“nodeId”, “title”, “pid_dois”],

      engine=“pyarrow”,

      dtype_backend=“pyarrow”,

)

df_large[[“language”, “container”]] = (

      df_large[[“language”, “container”]].astype(“category”)

)

3 Dataset Description

Repository name

Zenodo

Object name

Compact representation of the OpenAIRE citation graph

Format names and versions

TSV (Tab Separated Values) and Apache Parquet

Creation dates

Based on the 2025-12-01 OpenAIRE dump.

Dataset creators

Joakim Skarding wrote the pipeline that processed the dump, Pavel Sanda supervised the work. The OpenAIRE initiative, with which this project is unaffiliated, created the OpenAIRE dump.

Language

English

License

CC BY 4.0

Publication date

2026-02-12

4 Reuse Potential

Citation networks are routinely used in a wide range of scientific fields. This includes among others: research in historical trends of science (Drivas, 2024; Frank et al., 2019; González-Márquez et al., 2024; Kitajima & Okamura, 2025), sociology of scientific knowledge (Carradore, 2022; Crothers et al., 2020), network science (Costa & Frigori, 2024; Xiao et al., 2025), and as training sets for graph neural networks (Kipf, 2016; Leskovec & Sosič, 2016). Since the dataset is an evolving network, it can also be used to train temporal models, such as dynamic graph neural networks (Skarding et al., 2021).

Compared to working directly with the original OpenAIRE data dump, this dataset significantly reduces the time and effort required to begin analysis. The data is provided as flat TSV and Parquet files, removing the need to parse and process the raw JSON. The citation graph is already structured as an edge list with integer node IDs, the format expected by most graph libraries, meaning graph algorithms (e.g., community detection (Fortunato, 2010) and centrality measures (Bloch et al., 2023)) can be applied directly without further transformation. The reduced file size allows the full dataset to be downloaded and explored locally. Together, these properties make the dataset a convenient starting point for bibliometric studies, network analysis, and graph learning research.

We provide the source code used to produce the data, allowing researchers to run the pipeline when new versions of the OpenAIRE graph is released, as well as make their own customized distilled dataset with the fields they desire.

Additional Files

The additional file for this article can be found as follows:

Supplementary file

Supplementary information. DOI: https://doi.org/10.5334/johd.520.s1

DOI: https://doi.org/10.5334/johd.520 | Journal eISSN: 2059-481X
Language: English
Page range: 63 - 63
Submitted on: Feb 13, 2026
Accepted on: Apr 10, 2026
Published on: Apr 30, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Joakim Skarding, Pavel Sanda, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.