Table 1
Combined validation script output. This is the automated output log of the validation scripts.
| SCRIPT | METRIC | VALUE |
|---|---|---|
| full_id_coverage (pass) | ||
| raw_count | 205841448 | |
| parquet_count | 205841448 | |
| missing_count | 0 | |
| extra_count | 0 | |
| distinct_constraints (pass) | ||
| publications total | 205841448 | |
| publications distinct | 205841448 | |
| publications duplicates | 0 | |
| citations total | 2184347684 | |
| citations distinct | 2184347684 | |
| citations duplicates | 0 | |
| citations null_rows | 0 | |
| format_checks (pass) | ||
| nodeid gaps | 0 | |
| nodeid contiguous | true | |
Table 2
Comparison of storage size and memory requirements of the full OpenAIRE (OA) dataset (release 2025-12-01) and its compact versions. Memory usage corresponds to the amount of GB each dataset occupied when loaded into a Pandas dataframe (The pandas development team, 2026). Since the citations are loaded as int32, the memory size is much lower than the disk size.
| DATASET | SIZE ON DISK (GB) | MEMORY USAGE (GB) |
|---|---|---|
| citations.tsv | 39 | 17 |
| publications.tsv | 6 | 5 |
| publications_large.tsv | 187 | 185 |
| citations.parquet | 8 | 17 |
| publications.parquet | 2 | 5 |
| publications_large.parquet | 68 | 185 |
| Full OA – edges | 1820 | NA |
| Full OA – nodes | 700 | NA |
Table 3
Memory size of columns with Python types and short descriptions. Arrow is the memory size loaded using PyArrow; Pandas is the size if the data is loaded straight into a default Pandas dataframe. The Pandas Optimized column uses PyArrow as a backend, and utilizes “categorical” on the container and language fields to further lower their footprint. The MAG IDs are Microsoft Academic Graph IDs.
| COLUMN | PYTHON TYPE | DESCRIPTION | ARROW | PANDAS | PANDAS OPT |
|---|---|---|---|---|---|
| nodeId | int32 | Node ID | 0.767 GB | 0.767 GB | 0.767 GB |
| openaireId | str | OpenAIRE unique ID | 9.585 GB | 19.746 GB | 9.586 GB |
| title | str | Paper title | 16.486 GB | 29.707 GB | 16.486 GB |
| authors | list[str] | List of authors | 11.037 GB | 23.005 GB | 11.038 GB |
| description | str | Main text/abstract | 131.248 GB | 193.583 GB | 131.248 GB |
| date | datetime | Publication date | 0.767 GB | 7.587 GB | 0.768 GB |
| container | str | Journal/conference name | 5.471 GB | 13.953 GB | 2.181 GB |
| citations | int | Citation count | 1.558 GB | 1.534 GB | 1.558 GB |
| language | str | Language | 1.394 GB | 11.555 GB | 0.197 GB |
| pid_dois | list[str] | DOI identifiers | 5.639 GB | 19.453 GB | 5.639 GB |
| pid_mag_ids | list[str] | MAG IDs | 2.004 GB | 12.748 GB | 2.005 GB |
| pid_pmids | list[str] | PubMed IDs | 1.202 GB | 7.948 GB | 1.202 GB |
| pid_handles | list[str] | Persistent handles | 1.149 GB | 6.134 GB | 1.149 GB |
| pid_pmcs | list[str] | PubMed Central IDs | 0.921 GB | 5.480 GB | 0.921 GB |
| pid_arxiv_ids | list[str] | ArXiv IDs | 0.885 GB | 4.856 GB | 0.886 GB |
| TOTAL | 190.113 GB | 358.055 GB | 185.631 GB |
