Making the Complete OpenAIRE Citation Graph Easily Accessible Through Compact Data Representation

Joakim Skarding; Pavel Sanda

doi:10.5334/johd.520

1 Overview

Repository location

We use two repositories: Zenodo to share our data and fixed version of the processing pipeline, and git repository on Codeberg to facilitate sharing and further development of the processing pipeline.

Data repository

https://zenodo.org/records/19207803

Code repository

https://codeberg.org/Zmeos/OpenAIRE-citation-extraction

Context

The OpenAIRE graph (Manghi et al., 2025; Rettberg & Schmidt, 2012) is a large knowledge graph storing several kinds of research data. In this work we focus on extracting the citations and publications from that dataset, and distilling the information to make it more accessible.

There are several other citations graphs available – OpenAlex (Priem et al., 2022), Open Research Knowledge Graph (Jaradeh et al., 2019), Crossref (Hendricks et al., 2020) and OpenCitations (Peroni & Shotton, 2020). These datasets differ in many ways; some are only citation graphs, while others like OpenAlex and OpenAIRE are research knowledge graphs that record much more than just citations and publications. The datasets also differ in their publication coverage (Culbert et al., 2025; Martín-Martín et al., 2021) and how the works are indexed, e.g., how they determine taxonomy (Ciuciu-Kiss & Garijo, 2024b). Details of publication coverage between OpenAIRE and other open datasets are not well explored – a subset of OpenAIRE has been compared to OpenAlex (Ciuciu-Kiss & Garijo, 2024a), but to the best of our knowledge, no coverage analysis to the extent of previous works like (Martín-Martín et al., 2021) or (Culbert et al., 2025) has taken place. It is not sufficient to compare publication numbers since these datasets aggregate publications from multiple sources and a large node number might simply be due to poor de-duplication.

The OpenAIRE initiative offers several APIs, a search engine, cloud access through Google BigQuery, and a complete dataset dump (Manghi et al., 2025). The API and search engine serve as accessible avenues for any researcher, however, they are limited in the scope of data that can be efficiently accessed. A scan of the complete graph is outside the free tier of BigQuery, at the time of writing. If a researcher is interested in large portions of the graph, or the graph as a whole, then the dump is more suitable.

Processing the whole graph dump requires software coding skills, a large amount of memory, storage and computational power, resources not readily accessible to many scholars in the humanities (Dederke et al., 2024). Accessing the raw data also requires solid understanding of the OpenAIRE JSON schema. To fill this gap, we pre-process the dump, distilling it into a more manageable size, and distributing it as simply structured plain text and Parquet files, making the whole graph more accessible. Similar work has also been done for OpenAlex (Caetano Machado Lopes & Chacko, 2024).

2 Method

The OpenAIRE graph consists of many entities and many relations. These are stored in many files in the dump. We only need the publication files (nodes) and the relation files (citations/edges) to extract the citation network. The publication files themselves contain a lot of information about publications, only some of which is relevant to most researchers. The relation files include many types of relations. For this dataset we are only interested in citations (the type “Cites”). The following list shows the generic steps we followed to extract the citation network and minimize memory footprint. The list is organized by python files and shows the file responsible for each step (the full code is available in the Codeberg repository).

Steps

0. step0_download_data_and_extract.py
Download the complete OpenAIRE graph dump (Manghi et al., 2025), including all publication and relation files. When calling the pipeline, this is an optional step.
1. step1_extract_raw.py
Partially extract the compressed dump files.
2. step2_publications.py
- (a) Filter records, obtaining publication-type only.
- (b) Flatten nested JSON structures to obtain a tabular representation.
- (c) Generate new, more memory-efficient(int32) nodeIDs, and build a hashtable for translating between OpenAIRE IDs and our nodeIDs.
- (d) For the TSV files, replace any tabs or newlines with whitespace.
- (e) Export TSV and Parquet publication files.
3. step3_citations.py
- (a) Process relation files using PySpark (Zaharia et al., 2016), retaining only relations of type Cites and extracting source and target identifiers.
- (b) Use hashtable produced by step2_publications.py to replace OpenAIRE IDs. Update all citation relations to reference the new identifiers.
- (c) Export TSV and Parquet citation files.
4. step4_distribution.py
Compresses the TSV files to xz.

This achieves a distilled dataset where the relations are efficiently stored using pairs of short integers.

Quality control

Several validation scripts are run after processing to control for mistakes in the processing. The scripts check whether any entries from the original dump were lost and report any missing values.

full_id_coverage.py Verifies that every publication in the raw source data is present in the final output.
distinct_constraints.py Checks that there are no duplicate publications or duplicate citation edges in the output.
format_checks.py Verifies that publication node IDs form a complete, gap-free sequence starting from zero.

Additionally the file run_all_validations.py runs all validation scripts, collects their results, and writes a consolidated summary report. The entire validation pipeline is automatically run after the processing pipeline completes. Output from the validation files are shown in Table 1.

Table 1

Combined validation script output. This is the automated output log of the validation scripts.

SCRIPT	METRIC	VALUE
full_id_coverage (pass)
	raw_count	205841448
	parquet_count	205841448
	missing_count	0
	extra_count	0
distinct_constraints (pass)
	publications total	205841448
	publications distinct	205841448
	publications duplicates	0
	citations total	2184347684
	citations distinct	2184347684
	citations duplicates	0
	citations null_rows	0
format_checks (pass)
	nodeid gaps	0
	nodeid contiguous	true

Output

The processing pipeline takes the full OpenAIRE dump as input and transforms it into the TSV (tab-separated values) and equivalent Parquet files described below. TSV is a variant of a more common CSV. We chose TSV because several of the text fields in the publications_large.tsv routinely contained commas, and our attempts at quoting the commas turned out to be error prone. Parquet is a binary format for high-performance data processing.

citations.tsv.xz and citations.parquet – a simple edge list of the graph (all the citations). Each row has a simple form of two connected nodeIDs, e.g.: 159486578 118392581
publications.tsv.xz and publications.parquet – all graph nodes (publications), it contains only the nodeID and, when available, the DOI. Each row has a simple form, e.g.: 14209 10.3931/e-rara-45685
publications_large.tsv.xz and publications_large.parquet – includes the same number of nodes as the publication files, but includes additional fields, e.g. title, authors, description, etc. For a full overview of fields, see Table 3.

pipeline.tar.xz – processing pipeline which transforms full OpenAIRE dump into the dataset files above. Full details are documented in included README.md. The source code is also present online at https://codeberg.org/Zmeos/OpenAIRE-citation-extraction.

The pid_* columns (included in publications_large, see Table 3) are only extracted schemes from the pids field in the dump which OpenAIRE populates with identifiers collected from authoritative sources. The instances field (a nested field of additional metadata, Table 5) also includes identifiers, those were not collected. An exception to this is the primary_doi field in the publications file, in which the doi is drawn from the identifiers field if it is not found in the pids field.

Memory requirements

Below, we summarize the hardware requirements for the transformed dataset and the original OpenAire full dataset dump. An overview of memory and disk requirements are shown in Table 2. For loading the citations.tsv into memory with int32 using the Pandas library in Python, the following can be used:

df_refs = pd.read_csv(

“citations.tsv.xz”, sep=“\t”,

dtype={“source_nodeId”: “int32”, “target_nodeId”: “int32”}

)

Table 2

Comparison of storage size and memory requirements of the full OpenAIRE (OA) dataset (release 2025-12-01) and its compact versions. Memory usage corresponds to the amount of GB each dataset occupied when loaded into a Pandas dataframe (The pandas development team, 2026). Since the citations are loaded as int32, the memory size is much lower than the disk size.

DATASET	SIZE ON DISK (GB)	MEMORY USAGE (GB)
citations.tsv	39	17
publications.tsv	6	5
publications_large.tsv	187	185
citations.parquet	8	17
publications.parquet	2	5
publications_large.parquet	68	185
Full OA – edges	1820	NA
Full OA – nodes	700	NA

The data formats (TSV and Parquet) are of course generic and can be loaded using any tool that supports these common formats. When loading the Parquet, int32 will be used automatically.

df_cites = pd.read_parquet(

“citations.parquet”,

engine=“pyarrow”,

dtype_backend=“pyarrow”,

)

The publications files benefit from being loaded using the PyArrow backend. This significantly reduces the memory usage of the doi field.

df_pubs = pd.read_parquet(

“publications.parquet”,

engine=“pyarrow”,

dtype_backend=“pyarrow”,

)

The publications_large files are the most important to load efficiently as ∼200GB of RAM can be saved. Be aware that even efficient loading still requires ∼185GB of RAM. For users interested in a subset of the provided columns, we recommend selecting the columns on load. For loading the full dataset the following code can be used, and selected columns can be removed. This efficient loading is the approach used for the “Pandas Opt” column in Table 3. In the example below, the columns nodeId, title and pid_dois are selected.

Table 3

Memory size of columns with Python types and short descriptions. Arrow is the memory size loaded using PyArrow; Pandas is the size if the data is loaded straight into a default Pandas dataframe. The Pandas Optimized column uses PyArrow as a backend, and utilizes “categorical” on the container and language fields to further lower their footprint. The MAG IDs are Microsoft Academic Graph IDs.

COLUMN	PYTHON TYPE	DESCRIPTION	ARROW	PANDAS	PANDAS OPT
nodeId	int32	Node ID	0.767 GB	0.767 GB	0.767 GB
openaireId	str	OpenAIRE unique ID	9.585 GB	19.746 GB	9.586 GB
title	str	Paper title	16.486 GB	29.707 GB	16.486 GB
authors	list[str]	List of authors	11.037 GB	23.005 GB	11.038 GB
description	str	Main text/abstract	131.248 GB	193.583 GB	131.248 GB
date	datetime	Publication date	0.767 GB	7.587 GB	0.768 GB
container	str	Journal/conference name	5.471 GB	13.953 GB	2.181 GB
citations	int	Citation count	1.558 GB	1.534 GB	1.558 GB
language	str	Language	1.394 GB	11.555 GB	0.197 GB
pid_dois	list[str]	DOI identifiers	5.639 GB	19.453 GB	5.639 GB
pid_mag_ids	list[str]	MAG IDs	2.004 GB	12.748 GB	2.005 GB
pid_pmids	list[str]	PubMed IDs	1.202 GB	7.948 GB	1.202 GB
pid_handles	list[str]	Persistent handles	1.149 GB	6.134 GB	1.149 GB
pid_pmcs	list[str]	PubMed Central IDs	0.921 GB	5.480 GB	0.921 GB
pid_arxiv_ids	list[str]	ArXiv IDs	0.885 GB	4.856 GB	0.886 GB
TOTAL			190.113 GB	358.055 GB	185.631 GB

df_large = pd.read_parquet(

“publications_large.parquet”,

columns=[“nodeId”, “title”, “pid_dois”],

engine=“pyarrow”,

dtype_backend=“pyarrow”,

)

df_large[[“language”, “container”]] = (

df_large[[“language”, “container”]].astype(“category”)

)

3 Dataset Description

Repository name

Zenodo

Object name

Compact representation of the OpenAIRE citation graph

Format names and versions

TSV (Tab Separated Values) and Apache Parquet

Creation dates

Based on the 2025-12-01 OpenAIRE dump.

Dataset creators

Joakim Skarding wrote the pipeline that processed the dump, Pavel Sanda supervised the work. The OpenAIRE initiative, with which this project is unaffiliated, created the OpenAIRE dump.

Language

English

License

CC BY 4.0

Publication date

2026-02-12

4 Reuse Potential

Citation networks are routinely used in a wide range of scientific fields. This includes among others: research in historical trends of science (Drivas, 2024; Frank et al., 2019; González-Márquez et al., 2024; Kitajima & Okamura, 2025), sociology of scientific knowledge (Carradore, 2022; Crothers et al., 2020), network science (Costa & Frigori, 2024; Xiao et al., 2025), and as training sets for graph neural networks (Kipf, 2016; Leskovec & Sosič, 2016). Since the dataset is an evolving network, it can also be used to train temporal models, such as dynamic graph neural networks (Skarding et al., 2021).

Compared to working directly with the original OpenAIRE data dump, this dataset significantly reduces the time and effort required to begin analysis. The data is provided as flat TSV and Parquet files, removing the need to parse and process the raw JSON. The citation graph is already structured as an edge list with integer node IDs, the format expected by most graph libraries, meaning graph algorithms (e.g., community detection (Fortunato, 2010) and centrality measures (Bloch et al., 2023)) can be applied directly without further transformation. The reduced file size allows the full dataset to be downloaded and explored locally. Together, these properties make the dataset a convenient starting point for bibliometric studies, network analysis, and graph learning research.

We provide the source code used to produce the data, allowing researchers to run the pipeline when new versions of the OpenAIRE graph is released, as well as make their own customized distilled dataset with the fields they desire.

Additional Files

The additional file for this article can be found as follows:

Supplementary file

Supplementary information. DOI: https://doi.org/10.5334/johd.520.s1

Making the Complete OpenAIRE Citation Graph Easily Accessible Through Compact Data Representation

Full Article

1 Overview

Repository location

Data repository

Code repository

Context

2 Method

Steps

Quality control

Table 1

Output

Memory requirements

Table 2

Table 3

3 Dataset Description

Repository name

Object name

Format names and versions

Creation dates

Dataset creators

Language

License

Publication date

4 Reuse Potential

Additional Files

Supplementary file

Paradigm

My account