(1) Context and motivation
In digital infrastructures, multilingualism is often presented as a sign of inclusivity. Repositories and aggregators may describe themselves as “multilingual” when they host records in multiple languages or provide translation interfaces. However, these claims are constrained by distribution mechanisms that decide how languages are encoded, how they will be visible, and under what conditions they circulate. In this sense, we foreground the fact that multilingualism as a policy does not necessarily guarantee multilingualism as a practice, a challenge compounded when these digital infrastructures theoretically contain many languages, but only one functions as the default descriptive or procedural language (which is predominantly English). In this context, the aim of the present study is not to assess multilingual provision in absolute terms, but to benchmark how infrastructures distribute linguistic visibility in practice.
Linguistic asymmetry (Battaner & Spence, in preparation) refers to the unequal distribution of access, visibility, and descriptive completeness among languages as a structural property of digital infrastructures. It is a structural rather than a linguistic condition, shaped by how diversity is organised, normalised, and exposed. Although the concept can be applied to many digital systems, from search engines to Large Language Models, in this paper we will focus on the study of linguistic asymmetry in digital research infrastructures where asymmetry can be observed and measured through metadata practices. Here, we will treat linguistic asymmetry as a property of infrastructures to reveal how technical and institutional design decisions in/for these infrastructures consolidate a digital monolingualism (Spence, 2021) that can have potentially severe epistemic consequences.
The study therefore asks the following question: how can linguistic asymmetry be operationalised as a reproducible, metadata-based benchmark that supports comparative diagnosis of linguistic visibility, descriptive completeness, and access conditions across infrastructures, without treating heterogeneous systems as directly rankable? We address this question by introducing the Linguistic Asymmetry Index (LAI), a prototype benchmarking instrument computed entirely from publicly available metadata. The LAI combines five components that correspond to distinct mechanisms through which asymmetry becomes visible at the infrastructural level: language representation, English anchoring, metadata completeness disparity, institutional concentration, and access inequality.
Benchmarking linguistic inequality has tended to proceed either by assessing the “digital readiness” of languages, exemplified by the Digital Language Equality (DLE) metric (Gaspari et al., 2023), or by treating multilingualism as a repository-design requirement grounded in metadata and encoding practices (COAR Task Force, 2023; Dony et al., 2024). LAI shifts the unit of analysis from languages to research infrastructures, using open metadata to diagnose how schemas and aggregation logics allocate linguistic visibility. It also draws on metadata-quality measurement, such as the completeness analyses in Europeana (Stiller & Király, 2017; Candela et al., 2020), reframing completeness as a cross-language distributional issue. Informed by scholarship on quantification and classification (Bowker & Star, 1999; Espeland & Stevens, 2008; Beer, 2016), LAI is framed as a transparent diagnostic benchmark rather than a compliance score.
As we shall see, where digital infrastructures produce or transmit linguistic asymmetry it is not through deliberate policy, but through design choices that determine how information can be described and retrieved. Each layer of technical decision-making—from defining metadata fields to harmonising vocabularies and automating translation—sets limits on which languages can be made visible and to what extent (and, consequently, which digital objects are made visible and how). As infrastructures expand and become interconnected, these decisions accumulate and consolidate into structural patterns: multilingualism may then persist at the content level, but asymmetry takes hold at the level of description and access and, in this way, calls into question the linguistic equity of such digital systems.
Within European research policy, for example, multilingualism has been promoted as a key ethical, heritage, cultural, and technical commitment through inclusion and openness. However, infrastructures may comply with these in principle,1 while inadvertently reproducing inequality in practice. Language is often treated as an attribute field to be completed, rather than a dimension that shapes governance. The resulting gap between principle and application defines the space in which linguistic asymmetry operates, between the rhetoric and material effects of research infrastructures; as such, the LAI aims to address precisely this gap.
The consequences of linguistic asymmetry extend beyond the technical sphere. Infrastructures do not merely store or distribute data but organise the epistemic conditions in which knowledge is produced and retrieved, especially today in the AI/Retrieval Augmented Generation (RAG) systems era. When certain languages remain systematically underrepresented digitally (for example in relation to their actual presence in source materials), the humanities and social sciences inherit a distorted picture of their own discourse, as some languages function as gateways to visibility, while others become epistemically peripheral. The LAI prototype therefore functions as a tool for diagnosing epistemic injustice (Fricker, 2007; Dotson, 2014) by revealing how linguistic hierarchies are materialised in the systems that increasingly mediate digital scholarship.
As noted already, equity cannot be achieved simply by adding more languages, but rather by redistributing visibility among them, and so here we use the lens of asymmetry to explore possible parameters for defining equitable multilingual infrastructure. In its current prototypical form, LAI is still an experimental instrument, but we contend that it establishes the foundation for an evidence-based framework which may provide a helpful complement to qualitative critiques of language asymmetry.
As a benchmarking instrument, the LAI enables infrastructures, curators, and researchers to observe how linguistic equity is configured within and across systems, using shared metrics applied to heterogeneous contexts. Rather than producing rankings or compliance scores, it supports comparative diagnosis and longitudinal monitoring together with methodological transparency. In this sense, the LAI is intended to make linguistic asymmetry empirically traceable and open to discussion as a dimension of infrastructure governance.
The following sections outline how the LAI translates this proposal into empirical analysis. Section 2 summarises the repository structure supporting reproducibility. Section 3 describes the methodological design of the index and its five components. Section 4 applies the LAI to four major European infrastructures (CLARIN ERIC, Europeana, EUDAT/B2FIND, and OpenAIRE),2 while Section 5 discusses the broader implications of benchmarking linguistic asymmetry within digital research environments.
(2) Record description
Repository location
Repository name
Zenodo
Object name
Linguistic Asymmetry Index (LAI) Benchmarking Multilingual Research Infrastructures. Dataset and Reproducible Workflow (v.1.0)
Format names and versions
Python (3.x), JSON (1.0), Markdown (1.0), TXT (1.0)
Creation dates
2025-10-17 to 2025-10-30
Record creators
Elena Battaner (Universidad Rey Juan Carlos, Madrid, Spain) and Paul Spence (King’s College London, London, United Kingdom)
Language
English, with multilingual metadata fields (ISO 639-1/3 codes)
License
Creative Commons Attribution 4.0 International (CC BY 4.0)
Publication date
2025-11-13
The LAI Benchmarking Repository (Battaner & Spence, 2025) contains the harvesting and preprocessing scripts, infrastructure-specific configuration files, and the derived outputs generated for each infrastructure (component scores, composite LAI scores, and summary reports) examined in this paper. It is designed to support reproduction of the computation from the documented parameters while acknowledging that harvested metadata are a snapshot and may change over time.
All derived outputs were generated from representative samples harvested using documented parameters that reflect the constraints and affordances of each infrastructure’s access mechanisms and metadata model. Sampling, filtering, and deduplication procedures therefore vary in their concrete implementation across cases but are explicitly recorded through configuration files and provenance logs. Metadata were normalized to ensure consistent field naming, language identifiers (ISO 639-1/3), and provider labels, while missing values were retained to reflect each infrastructure’s descriptive practices and field-usage patterns.
The deposit provides the empirical basis for the analyses presented in this paper through the deposited derived outputs and the workflow that produces them. Reproducibility is ensured through the combination of infrastructure-specific configuration, shared computational scripts, and recorded parameters, rather than through a single uniform data structure or the redistribution of complete harvested metadata tables. In this context, “dataset” refers to these case-specific derived outputs and the configuration required to regenerate the analyses. The record in Zenodo is intended both to support the replication of the LAI computation and to enable wider reuse for exploratory or comparative analyses of linguistic representation, metadata completeness, institutional concentration, or access conditions within and across research infrastructures.
(3) Method
The LAI was conceived to measure linguistic asymmetry as an infrastructural property rather than a linguistic one. While existing frameworks such as FAIR (Wilkinson et al., 2016), CoreTrustSeal or META-NET’s Language White Papers (Rehm & Uszkoreit, 2012) assess technical quality or language-technology readiness, as seen, none measure the extent to which languages are structurally represented in digital infrastructures. The LAI maps the concept of linguistic asymmetry onto a reproducible benchmark grounded in open metadata, which can complement other critical-qualitative approaches.
(3.1) Concept and rationale
The LAI attempts to assess the degree to which languages are proportionally represented, consistently described, and equally accessible within metadata ecosystems. It does not analyse linguistic content or corpus size, but focuses on metadata architectures and governance arrangements that shape linguistic visibility in digital infrastructures. Its premise is that linguistic asymmetry arises from design choices in schemas, aggregation policies, and/or institutional hierarchies that condition what can be made visible within them.
The LAI follows a composite benchmarking logic3 commonly used to examine multidimensional phenomena such as inequality or sustainability, where measurement serves not as an endpoint but as an instrument for diagnosis and accountability (Espeland & Stevens, 2008). Linguistic asymmetry is treated here as similarly composite, involving representation, linguistic anchoring, descriptive completeness, institutional concentration, and access parity. These dimensions are operationalised through five weighted components whose combined score expresses the overall degree of linguistic asymmetry.
The LAI is therefore not theory-agnostic; each component corresponds to a mechanism through which linguistic asymmetry becomes observable at the infrastructural level: representation captures distributive imbalance; anchor bias reflects mediation and centralisation; completeness addresses descriptive capacity; institutional concentration reflects curatorial authority; and access inequality signals differential conditions of use. These components do not exhaust the phenomenon of linguistic bias nor claim to capture linguistic value or cultural significance; rather, they delimit a set of empirically tractable dimensions that can be observed consistently across heterogeneous infrastructures through metadata practices.
(3.2) Components and weighting logic
Each of the five components aims to isolate a mechanism by which linguistic asymmetry becomes measurable. Selection criteria were conceptual coherence, empirical observability through standard metadata fields, and applicability across heterogeneous infrastructures:
Language Representation Asymmetry (LRA) measures the imbalance of resources across languages. The Gini coefficient and Herfindahl-Hirschman index (HHI) are standard measures of uneven distribution, here used to describe how metadata are concentrated across languages or providers.
English Anchor Bias4 (EAB) (or Parallel Resource Bias (PRB) for corpora)5 captures the extent to which English functions as a systemic pivot, expressed as the ratio of English to non-English records.
Metadata Completeness Disparity (MCD) measures differences in descriptive richness through the coefficient of variation of non-null core descriptive fields such as title, description, and keywords (or their mapped equivalents). Because infrastructures expose descriptive information through different serialisations and layers (for example, metadata embedded in TEI headers versus catalogue-level records), the analysis relies on crosswalk-style mappings that extract and normalise these elements into comparable fields before computing disparity (Trippel, 2025).
Institutional Concentration Index (ICI) measures the extent to which language-specific subsets of metadata are dominated by a small number of providers, operationalised through provider shares and HHI.
Access Inequality Index (AII) calculates the Gini coefficient of open versus restricted access proportions per language.
The weighting we chose balances representational, curatorial, and governance factors as follows (see Table 1): LRA 0.35; EAB/PRB 0.20; MCD 0.20; ICI 0.15; AII 0.10. LRA receives the largest share because disproportionate language presence forms the structural basis of asymmetry; EAB and MCD capture centralisation and quality variation; and ICI and AII describe contextual inequalities influencing those distributions.
Table 1
LAI components and weights.
| CODE | COMPONENT | DEFINITION | INDICATOR/METRIC | WEIGHT |
|---|---|---|---|---|
| LRA | Language Representation Asymmetry | Distributional inequality of resources per language | Gini/HHI of language shares | 0.35 |
| EAB | English Anchor Bias | Relative dominance of English as the metadata or interface language | Ratio of English records to mean of non-English records | 0.20 |
| MCD | Metadata Completeness Disparity | Variance in field completeness across languages | Coefficient of variation of non-null core descriptive fields (title, description, keywords, license, provider) | 0.20 |
| ICI | Institutional Concentration Index | Degree of provider clustering per language | Share of total records controlled by top three providers/HHI of providers per language | 0.15 |
| AII | Access Inequality Index | Variation in openness or license distribution across languages | Gini coefficient of open vs. restricted access proportions | 0.10 |
The composite LAI is the weighted sum of normalised component values on a 0–1 scale. Values are normalised so that 0 indicates full symmetry and 1 maximal asymmetry, following the convention of inequality metrics where higher values denote greater imbalance. The scheme privileges interpretability: each component remains independently reproducible and can be recalibrated as new data appear. What must remain constant is the proportional relation between representation (LRA), mediation (EAB), completeness (MCD), concentration (ICI), and access (AII).
Component scores indicate where asymmetry is observable in metadata and access signals; they do not, on their own, identify causal mechanisms, which must be inferred through infrastructure-specific knowledge and, where possible, complementary qualitative analysis.
(3.3) Workflow
The workflow below operationalises the research question by specifying which metadata signals are treated as evidence of representational, descriptive, institutional, and access asymmetries, and by documenting how these signals are normalised and combined into LAI component scores.
The computation of the LAI follows a shared analytical logic applied across all infrastructures, while allowing for infrastructure-specific instantiations dictated by differences in metadata models, access mechanisms, and available fields. All analyses are based exclusively on publicly available metadata obtained from official APIs and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) endpoints.
For each infrastructure, metadata records are harvested using documented parameters stored in the Zenodo repository, including query constraints, date ranges, and endpoint-specific filters. Because infrastructures expose different metadata schemas and levels of granularity, preprocessing steps such as filtering, deduplication, and language selection necessarily vary in their concrete implementations. These variations reflect infrastructural design choices rather than analytical discretion and are fully documented through configuration files and provenance logs.
Once harvested, metadata are normalised by aligning field names (language, title, description, provider, licence, access rights), mapping language identifiers to ISO 639-1/3, standardising provider labels, and removing duplicates while retaining missing values. Each LAI component is then computed independently using established quantitative measures: distribution and concentration (Gini coefficient and HHI), EAB (ratio of English to non-English records), completeness metrics (coefficient of variation across non-null fields), and access parity (Gini coefficient of open versus restricted records). Component values are normalised to the [0–1] interval and combined with fixed weights to generate composite LAI scores.
Reproducibility is ensured at the level of analytical logic, metrics, and weighting scheme rather than through identical preprocessing operations across infrastructures. The workflow is therefore designed to benchmark structural configurations of linguistic asymmetry across heterogeneous systems, not to enforce uniform corpus construction. Comparability across infrastructures is therefore not grounded in the homogeneity of their scope, content, or communities, but in the application of a shared analytical lens that measures how linguistic asymmetry is structurally configured through metadata across heterogeneous systems.
(4) Results and discussion
To address the research question empirically, we applied the LAI to four major European infrastructures representing distinct modes of digital knowledge organisation: CLARIN (CLARIN ERIC, 2012), Europeana (Europeana Foundation, 2008), EUDAT/B2FIND (EUDAT Consortium, 2011), and OpenAIRE (OpenAIRE MAKE, 2018). Together they cover linguistic, cultural-heritage, scientific, and scholarly communication digital environments. The objective was not to provide definitive comparisons but to test the LAI’s diagnostic stability across contrasting architectures and governance models, and to evaluate its analytical potential as a benchmarking tool. All computations used the same five-component model (LRA 0.35, EAB/PRB 0.20, MCD 0.20, ICI 0.15, AII 0.10) and a shared analytical workflow. Metadata were harvested directly from official public endpoints between 17 and 19 October 2025, using only open, verifiable fields.
While a uniform weighting scheme limits methodological noise, variations in composite scores reflect structural differences through the combined action of distinct components. Each case below, as we will see, therefore exemplifies an empirical configuration of asymmetry within its own infrastructural logic.
(4.1) CLARIN ERIC
The CLARIN ERIC research infrastructure6 provides a distributed environment for linguistic data and tools, organised through 47 national and institutional centres connected via the Virtual Language Observatory (VLO) and OAI-PMH endpoints. In CLARIN, discoverability is mediated through Component Metadata Infrastructure (CMDI)-based harvesting: the Virtual Language Observatory aggregates CMDI records from distributed repositories, providing a single search layer over heterogeneous local descriptions (Trippel, 2025). The sample analysed comprised 1,450,774 metadata records, of which 590,774 unique resources were retained after deduplication by persistent identifiers and metadata fingerprinting. The multilingual subset contained 36,790 resources covering 487 languages.
The composite LAI of 0.552 (see Table 2) indicates moderate-high asymmetry. LRA (0.64) is the main contributor: English appears in 78.7% of multilingual resources, and the five most frequent languages (English, German, French, Spanish, Italian) account for over 60% of language occurrences. PRB (0.787) confirms English’s centrality: nearly nine of ten parallel corpora include English as a pivot. MCD is moderate (0.34); field completeness is generally high but drops by about 30 points for less represented languages. ICI (0.528) shows that three centres (Czechia/CZ, Slovenia/SI, Poland/PL) produce over half of the multilingual records. Access Inequality (0.241) indicates a 17.9% gap in open-access availability between languages.
Table 2
Composite LAI for CLARIN.
| COMPONENT | DESCRIPTION | SCORE | WEIGHT | WEIGHTED VALUE |
|---|---|---|---|---|
| LRA | Language Representation Asymmetry | 0.638 | 0.35 | 0.223 |
| PRB | Parallel Resource Bias (English anchor bias) | 0.787 | 0.20 | 0.157 |
| MCD | Metadata Completeness Disparity | 0.341 | 0.20 | 0.0682 |
| ICI | Institutional Concentration Index | 0.528 | 0.15 | 0.0792 |
| AII | Access Inequality Index | 0.241 | 0.10 | 0.0241 |
| Total LAI | 0.552 |
CLARIN has mature metadata governance and established FAIR-oriented practice; the observed asymmetry is consistent with known patterns of disciplinary focus and legacy resource creation within the CLARIN federation.
(4.2) Europeana
The Europeana Data Portal7 aggregates cultural-heritage records from national and thematic providers via the Europeana Data Model (EDM). A sample of 1,000 records retrieved on 19 October 2025 included 15 language codes, with 14% missing dc:language values.
The composite score (0.31) for Europeana (see Table 3) corresponds to a configuration of asymmetry largely shaped by institutional segmentation. In this case, LRA (0.45) reflects a skew toward Swedish records (≈85%) caused by one large provider. EAB is negligible (0.008) as English appears in ≈1% of entries. Metadata completeness is uniform (≈80%) and shows no variation across languages. Institutional Concentration (1.0) is absolute: each language is represented by a single provider. Access Inequality is absent (0.0).
Table 3
Composite LAI for Europeana.
| COMPONENT | DESCRIPTION | SCORE | WEIGHT | WEIGHTED CONTRIBUTION |
|---|---|---|---|---|
| LRA | Language Representation Asymmetry | 0.4522 | 0.35 | 0.158 |
| EAB | English Anchor Bias | 0.0079 | 0.20 | 0.001 |
| MCD | Metadata Completeness Disparity | 0.0002 | 0.20 | 0.000 |
| ICI | Institutional Concentration Index | 1.0000 | 0.15 | 0.150 |
| AII | Access Inequality Index | 0.0000 | 0.10 | 0.000 |
| Total LAI | 0.309 | |||
Europeana’s metadata architecture is characterised by controlled vocabularies and standardised fields that support linguistic consistency. However, the distribution of data by provider suggests a degree of institutional segmentation. The relatively low asymmetry values of the sample should therefore be interpreted as reflecting technical uniformity rather than cross-lingual integration.
(4.3) EUDAT/B2FIND
EUDAT’s B2FIND8 catalogue aggregates scientific datasets from European repositories using a harmonised DataCite/DCAT-AP model. Language metadata are optional. The sample of 15,000 records (≈1% of the catalogue) was retrieved from the public API on 19 October 2025.
The composite score (0.53) for EUDAT/B2FIND (see Table 4) indicates moderate-high asymmetry. Ninety-five per cent of records lack language tags; of the remainder, English covers ≈54%. The high Gini (0.82) and HHI (4471) values confirm high concentration. EAB (0.68) adds further imbalance: English records outnumber others 14 to 1. MCD is virtually nil because completeness is uniformly minimal. Institutional Concentration (0.80) reveals that three providers (Blue-Cloud, DANS, Askeladden) account for 96% of resources. Access Inequality (0.40) shows that none of the sampled records explicitly indicate open access.
Table 4
Composite LAI for EUDAT/B2FIND.
| COMPONENT | DESCRIPTION | SCORE | WEIGHT | WEIGHTED CONTRIBUTION |
|---|---|---|---|---|
| LRA | Language Representation Asymmetry | 0.6703 | 0.35 | 0.234 |
| EAB | English Anchor Bias | 0.6775 | 0.20 | 0.135 |
| MCD | Metadata Completeness Disparity | 0.0002 | 0.20 | 0.000 |
| ICI | Institutional Concentration Index | 0.8049 | 0.15 | 0.120 |
| AII | Access Inequality Index | 0.4000 | 0.10 | 0.040 |
| Total LAI | 0.531 | |||
The EUDAT/B2FIND results correspond to a pattern that can be described, analytically, as technocratic asymmetry, where efficiency and standardisation appear to be associated with reduced linguistic specificity at the level of metadata description. Because the DataCite schema treats language fields as optional, many repositories omit them, which results in uniform but linguistically limited metadata.
(4.4) OpenAIRE Graph
The OpenAIRE Research Graph9 aggregates metadata for publications and datasets from institutional repositories and data archives across Europe. A file of 5,130 records retrieved on 18 October 2025 covered 20 language codes with complete descriptive and access metadata.
The composite LAI of 0.44 for the OpenAIRE Graph (see Table 5) shows moderate asymmetry. English accounts for 68% of records; the next four languages (German, French, Spanish, Italian) together represent less than 20%. MCD (0.06) is low; field completeness ranges from 59% to 85%. Institutional Concentration (0.45) reveals a limited set of dominant providers (top 3 ≈45% outputs). Access Inequality (0.11) shows modest variation in open-access rates by language.
Table 5
Composite LAI for OpenAIRE Graph.
| COMPONENT | DESCRIPTION | SCORE | WEIGHT | WEIGHTED CONTRIBUTION |
|---|---|---|---|---|
| LRA | Language Representation Asymmetry | 0.630 | 0.35 | 0.221 |
| EAB | English Anchor Bias | 0.654 | 0.20 | 0.131 |
| MCD | Metadata Completeness Disparity | 0.060 | 0.20 | 0.0120 |
| ICI | Institutional Concentration Index | 0.446 | 0.15 | 0.0669 |
| AII | Access Inequality Index | 0.108 | 0.10 | 0.0108 |
| Total LAI | 0.441 | |||
The asymmetry observed in OpenAIRE can be read as systemic rather than architectural. It reflects broader trends in global scholarly communication, where English often functions as the main language of record. Technical governance does not explicitly encode linguistic hierarchies, but the aggregated corpus is not neutral in its linguistic distribution. The LAI here registers how external linguistic economies shape infrastructural outputs even under FAIR-compliant conditions.
(4.5) Cross-infrastructural synthesis
The application of the proposed LAI across metadata samples from four heterogeneous infrastructures appears to suggest its potential use in capturing structural patterns of linguistic asymmetry, independently of disciplinary scope or data architecture.10 Although the infrastructures analysed operate under distinct missions and technical frameworks, each produces a measurable configuration of asymmetry that offers potential insights into its internal logic of data/metadata production, aggregation, and governance. While still an experimental prototype, the results demonstrate the diagnostic versatility of the LAI model: rather than yielding comparative scores, the index reveals how different infrastructural designs enact distinct regimes of linguistic asymmetry according to different component metrics.
The LAI highlights that linguistic asymmetry may take multiple infrastructural forms: disciplinary, when epistemic norms consolidate around English (CLARIN); institutional, when multilingualism depends on fragmented national providers (Europeana); technocratic, when standardisation erases linguistic difference (EUDAT/B2FIND); or systemic, when publication and indexing reproduce global monolingualism (OpenAIRE). Taken together, these configurations suggest that linguistic asymmetry is not incidental but recurrent within digital infrastructures. Each LAI score should therefore be understood as a situated diagnostic rather than a comparative evaluation.
Methodologically, this cross-infrastructural exercise has demonstrated the potential value of the LAI in capturing linguistic asymmetry, supported by the index’s reproducibility and interpretive stability, as all analyses used identical weights, normalisation, and workflow, allowing linguistic asymmetry to be treated as an infrastructural variable measurable through open metadata. The diversity of asymmetry types, whether disciplinary, institutional, technocratic, or systemic, illustrates that such imbalance takes different forms depending on the degree to which authority resides in factors such as expertise, federation, standardisation, or publishing, in this case. Hence, these configurations do not derive from individual component scores but from the combined pattern formed by all five dimensions.
Conceptually, these findings help to situate multilingualism within debates about open science governance. Infrastructures may comply with FAIR data principles and still be linguistically asymmetric, as openness alone does not guarantee equity. Benchmarking linguistic representation therefore complements existing audits of accessibility and interoperability by addressing visibility and inclusion.
By transforming benchmarking from a performance metric into a diagnostic of equity, the proposed LAI aims to provide an empirical foundation to accompany qualitative reflections on how infrastructures distribute linguistic visibility. Making asymmetry measurable turns it into an accountable dimension of governance: a property that can be monitored, critiqued, discussed, and eventually mitigated through policy.
(5) Implications/Applications
The discussion explores how we might interpret the component profiles as infrastructure-specific configurations of linguistic visibility rather than as performance rankings, and by identifying which elements of design and governance plausibly condition the observed asymmetries.
The results suggest that linguistic asymmetry is recurrent within digital infrastructures, as each case exhibits a particular configuration shaped by its technical and institutional context. CLARIN appears to concentrate descriptive capacity in a few dominant languages and centres; Europeana suggests how institutional silos may limit cross-lingual integration; EUDAT/B2FIND indicates that interoperability can reduce linguistic specificity; and OpenAIRE reflects language dynamics in global publishing systems.11
This diagnostic perspective invites a reconsideration of benchmarking as a practice of observation12 rather than validation. First, the proposed LAI does not determine whether an infrastructure succeeds or not, but rather examines how its design may distribute linguistic (under)representation. Such observation is necessary if openness and inclusion are to be understood not only as ethical commitments but as structural conditions. Second, it contributes to understanding that measurement and interpretation can coexist because quantification here is not an endpoint but a form of evidence to support critical analysis.
The LAI also has limitations that must be acknowledged. It relies on samples of publicly available metadata, which represent only part of each infrastructure’s holdings. These samples, drawn from specific API or OAI-PMH endpoints, are not always exhaustive—nor always easily accessible—and may omit records that are temporarily inaccessible or inconsistently indexed. Moreover, the analysis corresponds to a specific temporal snapshot, a harvest conducted in October 2025, and as infrastructures are continuously updated, their linguistic composition and metadata quality evolve. The results should therefore be read as indicative rather than definitive, representing a diagnostic snapshot of each infrastructure at a given time. The five components simplify complex processes; they offer an operational approximation to multilingual dynamics rather than a comprehensive account. The LAI captures a surface of representation: what infrastructures make visible at a given moment, not the totality of their linguistic practices.
Despite these constraints, the index provides a shared vocabulary for describing infrastructural linguistic asymmetry, enabling multilingualism to be analysed in terms of proportionality, visibility, and practice. Its reproducibility also makes it suitable for longitudinal observation, as repeated computation could reveal how policy or governance changes affect asymmetry over time. The same analytical logic may be transferable to other contexts, such as large knowledge graphs, digital archives, or AI training datasets, for example, where similar mechanisms of concentration or imbalance can occur.
Ultimately, this contribution of the LAI has aimed to demonstrate that linguistic equity can be examined empirically without reducing it to a technical metric. Measuring asymmetry exposes how infrastructures condition linguistic visibility, and it prompts reflection on the effect of design choices on representation. The goal is diagnostic, not prescriptive: used as a benchmark, the LAI establishes a baseline profile of linguistic asymmetry and supports monitoring the effects of interventions such as language-tagging mandates, provider diversifications, and/or access-metadata harmonisation, among others. Future work should refine both the conceptual model and the empirical procedure, expanding the quantitative base, integrating qualitative policy analysis, and testing the index across other linguistic and geographical contexts. In this sense, the LAI remains an open framework conceived as an evolving method for analysing how infrastructures shape the linguistic dimension of digital scholarship.
Notes
[1] See Europeana Foundation, Europeana DSI-4 Multilingual Strategy (2020), available at https://pro.europeana.eu/post/europeana-dsi-4-multilingual-strategy (last accessed: 2026-01-23); cf. OpenAIRE Research Graph Documentation, available at https://graph.openaire.eu (last accessed: 2026-01-23), where multilingual integration occurs through aggregated metadata but without a formal policy.
[2] When referring to OpenAIRE in this study, the term designates the OpenAIRE Research Graph, the central aggregation infrastructure operated by OpenAIRE AMKE. The Graph integrates metadata from national repositories, data archives, and publication databases into a unified open knowledge graph.
[3] For a complementary approach, see (Akindotuni, 2025), which measures linguistic inequality through technology-oriented indices—Resource Parity Index (RPI), Linguistic Coverage Score (LCS), Tool Ecosystem Completeness Score (TECS) describing resource coverage and tool availability. While LAI focuses on infrastructural equity, these metrics address language-technology readiness (see also (di Buono et al., 2022)).
[4] In this study, English (ISO 639-3: eng) is used as the candidate pivot language for EAB/PRB due to its dominant role in many digital research environments. Low EAB scores suggest that English does not function as a pivot. PRB can be applied in analogous ways using other pivot languages, as needed.
[5] English Anchor Bias (EAB) designates the extent to which English functions as a systemic pivot language in metadata or documentation, shaping interoperability and visibility across infrastructures. In corpus contexts, this phenomenon is termed Parallel Resource Bias (PRB), referring to the recurring pairing of English with other languages in multilingual datasets.
[6] https://www.clarin.eu/ (last accessed: 2026-01-29).
[7] https://www.europeana.eu/en (last accessed: 2026-01-29).
[8] https://eudat.eu/service-catalogue/b2find (last accessed: 2026-01-29).
[9] https://explore.openaire.eu/ (last accessed: 2026-01-29).
[10] A comparative table of LAI values is omitted by design. Although all infrastructures were analysed with the same components and weights, the aim is to test the LAI’s benchmarking potential rather than to rank them. Each score reflects a specific configuration of asymmetry shaped by infrastructural design and metadata policy, not a measure of relative performance.
[11] While the analysis links each infrastructure to a characteristic form of asymmetry, this does not reflect a one-to-one correspondence between LAI components and cases. The typology is descriptive, indicating patterns observed across metrics rather than results of specific variables.
[12] The cultures of measurement have been extensively examined within the sociology of quantification, which shows how indicators acquire normative authority within institutions (Espeland & Stevens, 2008).
Competing interests
The authors have no competing interests to declare.
Author Contributions
Elena Battaner: Conceptualization, Investigation, Methodology, Writing – original draft, Writing – Review and editing.
Paul Spence: Conceptualization, Investigation, Methodology, Writing – original draft, Writing – Review and editing.
