Modelling Temporal Public Transport Systems in Wikidata: A Workflow for Integrating Historical Infrastructure Data

Noah J. Kim-Baumann

doi:10.5334/johd.458

Full Article

1. Context and Motivation

1.1 The Challenge of Historical Infrastructure in Knowledge Bases

Historical transport systems resist standard data models because they embody multiple intersecting temporalities, changing names across political regimes, evolving line configurations, discontinuous operation periods, and varying service patterns. Historical sources present particular challenges for linked open data representation, requiring careful consideration of how to maintain immutability while representing change over time, and how to document provenance for entities that undergo transformation (Meroño-Peñuela & Hoekstra, 2014). The abstraction process involved requires multitudes of decisions on feature-selection, data formats, etc. which the project chooses in alignment with their research question. Representing critical discourses, including uncertain information, competing hypotheses, and temporally evolving data, adds further complexity to knowledge representation in collaborative platforms (Di Pasquale et al., 2023). This variability and messiness make integrating historical data into Linked Open Data (LOD) complex and rarely straightforward.

This paper addresses these tensions by proposing a systematic workflow for integrating historical humanities research data with Wikidata that explicitly foregrounds temporal complexity, geographic uncertainty, and provenance documentation. The workflow emphasises systematic investigation of existing modelling practices before contribution, recognising that Wikidata’s data quality and completeness vary significantly by domain and must be assessed empirically (Färber et al., 2017; Zhao, 2023). Using transcribed Berlin transport timetables (Fahrplanbücher) spanning 1946–1989, we demonstrate how computational investigation of Wikidata’s current modelling practices can inform upload strategies that both respect community standards and advance representational capacities for historical phenomena.

1.2 Wikidata as Infrastructure for Historical Research Data

Founded in 2012 as a Wikimedia project, Wikidata has evolved into a multilingual, collaboratively edited knowledge base containing over 100 million items structured through properties and qualifiers (Lemus-Rojas & Pintscher, 2018). For research projects, Wikidata offers several benefits: community maintenance extending beyond individual project lifespans, integration with broader linked open data networks, and multilingual support facilitating international collaboration (Thornton et al., 2018).

Yet Wikidata’s coverage of historical domains remains uneven. Research on knowledge graphs demonstrates this pattern, Gottschalk and Demidova (2018, p. 2) found that in Wikidata, only 33% of events include temporal information and merely 11.70% provide spatial context. Data quality assessments reveal systematic gaps in completeness defined as “the extent to which data are of sufficient breadth, depth, and scope for the task at hand” and timeliness, particularly for historical phenomena (Färber et al., 2017, p. 10). While some areas benefit from systematic data integration, historical infrastructure networks receive fragmentary attention focused on contemporary configurations according to our own investigations. Critically, data absence can be misleading; when researchers query incomplete collections without awareness of the limitations, missing data may be misinterpreted as a lack of historical existence.

While Wikidata research has grown substantially since 2012, use cases remain concentrated in biomedical and linguistic domains, with humanities applications relatively scarce (Farda-Sarbas & Müller-Birn, 2019, p. 8). A systematic review found that digital humanities projects predominantly consume Wikidata rather than publish to it, with 45 of 50 reviewed projects extracting data while fewer engaged in contribution (Zhao, 2023, p. 869). This consumption-focused pattern persists despite Wikidata’s design as collaborative infrastructure where community contributions extend project lifespans beyond individual institutional commitments (Thornton et al., 2018). The asymmetry limits Wikidata’s capacity as historical research infrastructure. It is with this in mind that we choose to upload our data on Berlin’s transport history to Wikidata and suggest a workflow for other researchers to follow.

1.3 Berlin’s Divided Transport as Critical Case

Berlin’s transport history during division (1945–1989) provides a rich case study for examining temporal modelling challenges. The city’s partition created duplicate administrative structures, separate timetable publications, and differential investment patterns that profoundly shaped infrastructure development. This historical complexity intersects with a broader challenge. Formal data standards for public transport emerged only in the 1990s–2000s (Transmodel, GTFS), with no standardised formats existing for our study period (Keller et al., 2014). While contemporary transport data benefits from these standards and linked data publication efforts (Plu & Scharffe, 2012), historical transport networks remain largely disconnected from semantic web infrastructure such as Wikidata.

The complexities of this division amplify challenges common to any historical infrastructure project: discontinuous operation periods, renamed entities requiring temporal qualification, relocated stations raising identity questions, and service characteristics varying across time periods. Rather than treating Berlin’s division as exceptional, we suggest it is paradigmatic of issues faced by humanities research projects dealing with changes over time and space. Historical infrastructure systems require robust temporal modelling to represent events, relationships, and changing configurations—elements that existing knowledge graphs inadequately capture (Gottschalk & Demidova, 2018, p. 1). Representational inadequacies become visible in this context and illuminate issues present in other attempts to model infrastructure temporality.

Our research question thus extends beyond Cold War transport to address fundamental methodological concerns: What workflows enable humanities researchers to integrate historically complex datasets with Wikidata while preserving temporal nuance, documenting provenance, representing uncertainty, and remaining maintainable by the broader community? Answering this requires bridging traditional ontology design approaches, which impose tightly controlled schemas, with Wikibase’s flexible, community-driven model where semantics emerge bottom-up from usage (Shimizu et al., 2022). The following sections present a three-phase methodology developed in response to this question, analyse findings from computational investigation of current Wikidata modelling practices, and propose strategies applicable beyond transport history to any historical infrastructure domain.

1.4 Paper Contributions

This paper makes three contributions to digital humanities research infrastructure. First, we present a comprehensive historical dataset of Berlin’s public transport network (1946–1989, described in Section 2), published openly on Zenodo to enable research on Cold War urban infrastructure, transport accessibility evolution, and comparative urban development independent of Wikidata integration. Second, we demonstrate how systematic computational investigation of existing Wikidata modelling practices—examining how transport infrastructure is currently represented, analysing coverage of relevant properties and their qualifiers (particularly temporal qualifiers and source references)—can reveal gaps and inconsistencies that inform upload strategies. Third, we propose a three-phase workflow (documentation, investigation, planning) for preparing historical research data for Wikidata contribution that addresses temporal complexity, geographic uncertainty, and provenance documentation while maintaining compatibility with community standards. This workflow is transferable to other historical infrastructure domains considering linked open data integration.

2. Dataset Description

This section describes the Berlin Public Transport Network 1946–1989 dataset (Zenodo DOI: 10.5281/zenodo.16352848), which forms the foundation for both this methodological study and an ongoing doctoral project investigating public transport accessibility changes in divided Berlin during the Cold War.

2.1 Research Context

The dataset emerges from doctoral research investigating how Berlin’s political division shaped public transport accessibility patterns from 1946 to 1989. The central question of the project investigates how accessibility to essential urban services via public transport diverge between East and West Berlin, and asks what these patterns reveal about the relationship between political systems and infrastructure development?

While archival sources exist for both sectors, no unified digital representation spans the division period. This dataset enables computational accessibility analysis using network science methods, comparative infrastructure analysis revealing differential investment patterns, and establishes research data infrastructure for future Cold War urban history studies.

2.2 Integration Strategy with Wikidata

During active research (current status: one year into doctoral project), data resides in Neo4j as a property-graph, enabling complex network queries about route efficiency and accessibility. However, long-term sustainability requires migration to Wikidata. Upon project completion (anticipated 3–4 years), the dataset will be uploaded to Wikidata as linked open data. This dual-database strategy acknowledges that optimal research infrastructure differs from optimal preservation infrastructure while planning explicit bridges between them.

2.3 Dataset Specifications

Repository location: https://doi.org/10.5281/zenodo.16352848

Repository name: Zenodo

Object name: Berlin Public Transport Network 1946–1989: Transcribed Timetable Data from Divided Berlin

Format names and versions: CSV (UTF-8); three relational tables (stations.csv, lines.csv, line_stops.csv)

Creation dates: 2024-03-01 to 2025-10-26 (ongoing expansion planned)

Dataset creators: Noah Baumann (https://orcid.org/0009-0004-6368-3061), Humboldt-Universität zu Berlin

Language: German (station names, variable names); English (documentation)

License: CC-BY SA

Publication date: 2025-10-26 (Version 1.0.1.)

2.4 Dataset Contents

The dataset currently comprises 19758 station records, 2099 line records, and 28982 connections from systematic transcription of 36 Fahrplanbücher (timetable books) held in the BVG-Archive and Wilhelm-Grimm-Zentrum.

Temporal coverage: 13 snapshots from 1946–1989 (1946, 1951, 1956, 1960, 1961, 1964, 1967, 1971, 1976, 1980, 1982, 1984, 1989). Sampling ensures at least one data point per five-year period while emphasising historically significant moments. From 1955 onward, separate Fahrplanbücher reflect administrative division.

Geographic coverage: All four main transport modalities: U-Bahn (underground), S-Bahn (urban rail), Straßenbahn (tram), and bus are included.

Data structure: The dataset uses a relational model with three CSV tables that capture different aspects of the transport network:

stations.csv represents physical infrastructure—the stations where passengers could board the respective transport mode. Each record includes: stop_id (unique identifier), stop_name, typ (modality: U-Bahn, S-Bahn, Straßenbahn, or bus), standort (WGS84 coordinates), in_linien (which lines served this station), and wikidata_id (for reconciliation). The single physical locations are ungrouped across time or space allowing for flexibility which is required for different research contexts.
lines.csv represents service offerings—the routes passengers could take between stations. Each record captures a line in a specific year with: line_id, jahr, linien_name, typ, dauer (end-to-end journey time in minutes), ost_west (administrative zone), frequenz (service frequency at 7:30am weekdays in minutes), länge (route length in km), start, and ziel (terminal stations). Line renumbering or route changes appear as distinct records.
line_stops.csv links stations to lines, establishing the sequence of stations along each route through line_id (foreign key to lines.csv), stop_order (sequential position), and stop_id (foreign key to stations.csv).

This structure enables network topology reconstruction: which stations connected to which others, how many transfers were required between locations, and how route configurations evolved over time.

2.5 Methodology and Normalisations

Manual transcription addressed complex historical layouts. Key normalisations: station naming added cross-street information where needed (“X Ecke Y” format); U-Bahn and S-Bahn comprehensively recorded while trams/buses show only “important” stops (criteria varying by year/zone); S-Bahn data supplemented from Reichsbahn sources; geolocation via hierarchical methodology (Wikidata reconciliation, historical maps, cross-street analysis).

2.6 FAIR Data Principles

This dataset addresses FAIR principles for research data management. Findability is supported through persistent DOI assignment on Zenodo (10.5281/zenodo.16352848), comprehensive metadata, and inclusion of Wikidata identifiers enabling cross-platform discovery. Accessibility is maintained through open publication under CC-BY-SA license on Zenodo guaranteeing long-term preservation. Full accessibility, however, faces constraints: the original Fahrplanbücher in the BVG-Archive lack systematic catalogue numbers, and while transcription involved photographing archival pages, these images are not released currently. Interoperability is supported by standard CSV format with UTF-8 encoding, WGS84 coordinate system, and explicit field definitions enabling integration with GIS tools, network analysis software, and other geo-temporal datasets. Reusability is facilitated through detailed methodology documentation in the accompanying README, explicit description of normalisations and data quality considerations, and clear provenance chains to archival sources.

2.7 Usage

The dataset enables researchers to examine Berlin’s transport history at scale. The temporal coverage supports “distant reading” of infrastructure evolution—identifying system-wide changes, centrality shifts, and accessibility patterns—that can be combined with qualitative sources like resident surveys, private correspondence, and policy debates to understand how Berliners experienced and contested these transformations.

Current applications of this approach being developed include examining how East Berlin’s new social housing developments in Marzahn and Hellersdorf during the 1970s–1980s were integrated into the transport network. Network analysis reveals which existing districts became accessible from these peripheral developments, while contemporary resident surveys provide evidence of satisfaction or frustration with transport provision. In West Berlin, the dataset supports analysis of the 1960s–1970s tram network removal by comparing capacity across Bezirke and Ortsteile before and after the transition to buses and U-Bahn expansion. Letters to the BVG and Berlin government requesting new routes or protesting service reductions provide qualitative context for understanding which neighbourhoods gained or lost access through this infrastructure transformation.

These usage examples are currently being developed but many more are possible. Full methodology documentation accompanies the dataset at Zenodo. Upon doctoral research completion, comprehensive Wikidata upload will make this data available as linked open data, following the workflow detailed in subsequent sections.

2.8 Known Limitations

Several limitations constrain the dataset’s scope and precision. Station coverage is selective for bus and tram networks: Fahrplanbücher listed only “important” stops, with selection criteria varying between years and administrative zones. U-Bahn and S-Bahn stations are comprehensively recorded, but complete bus/tram stop inventories are not available from the historical sources consulted.

Frequency data represents a single temporal slice—7:30am on weekdays—rather than comprehensive service schedules. This “snapshot” approach, where data captures network state at specific moments rather than continuous monitoring, enables temporal comparison across the 13 sampled years but cannot reveal hourly, daily, or seasonal service variations within those years.

The dataset documents planned service configurations as published in official timetables, excluding temporary diversions, construction-related route changes, or service disruptions. Geographic precision varies by modality: while U-Bahn and S-Bahn coordinates derive from Wikidata reconciliation with high accuracy, bus and tram stop coordinates are often estimated from street intersection analysis and may deviate from exact historical platform locations.

Finally, the 13 temporal snapshots (1946–1989) provide representative coverage with at least one year per five-year period, but gaps remain. Network evolution between snapshot years cannot be traced, and short-lived route modifications or experimental services operating only between sampled years would not appear in the dataset.

3. Method

This section presents a three-phase workflow for integrating historical research datasets with Wikidata, developed through iterative application to Berlin public transport data. Phases 1 and 2 document implemented methodology during active research (2024–2025), while Phase 3 presents planned upload strategies informed by the investigation findings. This structure reflects the reality that methodological planning must precede large-scale contribution to collaborative knowledge bases. Each phase addresses distinct challenges while maintaining connectivity across the research lifecycle, from active data collection through exploratory analysis to preservation-oriented publication. The methodology emphasises transparency in decision-making, computational validation where possible, and explicit documentation of limitations and uncertainty.

3.1 Phase 1: Data Documentation and Preliminary Reconciliation

The integration of Wikidata with historical research projects through reconciliation of entities can provide immediate benefits for data enrichment and reveals modelling challenges that inform both database design and eventual upload strategies. For our project Wikidata was a beneficial LOD platform due to the complete coverage of S-Bahn and U-Bahn transport modes, importantly including the coordinate location for each station.

3.1.1 Documentation as Research Tool

The modelling process for historical projects involves extensive decisions leading from historical source to data model. Documentation is essential both for future data reuse and as a record of decisions that can be reviewed and amended during the project lifecycle. We therefore created a created a detailed README document concurrent with data transcription rather than as post-hoc documentation. This README serves multiple functions. It explains methodological decisions during data collection (e.g., when to treat slightly relocated bus stops as identical vs. distinct entities), documents normalisations applied to historical sources (e.g., standardising street name suffixes), and makes data limitations explicit (e.g., acknowledging that bus/tram station coverage follows historical timetables’ “important stations” criteria rather than comprehensive enumeration).

The resulting README provides comprehensive documentation of source materials, temporal coverage decisions, geolocation methodology, and scope limitations, making data derivation transparent for future users. This documentation strategy attempts to anticipate future users’ questions about data derivation and making those answers immediately accessible.

3.1.2 Early Reconciliation with OpenRefine

While transcription was ongoing, we performed iterative reconciliation using OpenRefine’s Wikidata reconciliation service. This process involved loading station names into OpenRefine, reconciling against Wikidata entities, and manually verifying match quality. Reconciliation results were stored in the dataset itself via a wikidata_id column, creating a persistent mapping between our research database and Wikidata’s identifier space.

Reconciliation outcomes across 19758 station records:

U-Bahn: 100% matched (1459 records)
S-Bahn: 100% matched (1767 records)
Tram: 36.3% matched (1404 of 3869 records)
Bus: 0.1% matched (12 of 12413 records)

These figures revealed Wikidata’s coverage gaps quantitatively and informed our understanding of which station types would require creation versus enhancement during upload. Importantly, unmatched stations were not assumed to be absent from Wikidata, rather, each required manual verification to prevent duplicate creation. Explicit recording of reconciliation decisions (i.e. why certain matching types were accepted or rejected) proved valuable during later upload planning when edge cases required consistent handling. A key benefit is that we recorded which stations exist in Wikidata and which do not. Upon project completion, we will be able to fill identified gaps with stations that have proper temporal identification and manually georeferenced coordinate locations.

3.2 Phase 2: Computational Investigation of Wikidata’s Current State

Before designing an upload strategy, we systematically investigated how Berlin public transport is currently modelled in Wikidata. This investigation combines close reading of individual entity structures with distant reading via SPARQL queries.

3.2.1 Close Reading: Manual Entity Inspection

We began with detailed examination of exemplar entities representing different station types and temporal situations. This manual investigation documented:

Classification patterns: How stations use ‘instance of’ (P31) for both generic types (e.g., metro station Q928830) and local types (e.g., Berlin U-Bahn station Q110977120)
Property usage: Which properties are consistently populated (coordinates P625, named after P138) versus sporadically present (state of use P5817)
Temporal modelling approaches: How existing editors represent line changes, station renaming, and service modifications
Reference practices: Presence/absence of citations, types of sources referenced
Administrative geography: How entities handle East Berlin (Q56037) versus West Berlin (Q56036) versus unified Berlin (Q64)

This close reading identified exemplar cases that informed distant reading query design. Classification patterns warrant particular scrutiny because prior research has documented large-scale conceptual disarray in Wikidata’s multi-level taxonomies, with 87.5% of classes in multi-level structures flagged for ambiguous or inconsistent use of instantiation and subclassing relations (Dadalto et al., 2024, p. 2). Our investigation therefore prioritised examining whether transport infrastructure exhibited similar classification problems. For instance, Brandenburger Tor station (Q3643649) used temporal qualifiers on connecting service (P1192) relationships, establishing this as existing practice, while Jannowitzbrücke (Q1647893) lacked such qualifiers despite its ghost station history.

3.2.2 Distant Reading: SPARQL-Based Systematic Analysis

We developed a Python-based investigation framework using SPARQLWrapper to query Wikidata systematically across all Berlin transport entities. The code structure implemented reusable functions for querying Wikidata, parsing results into pandas DataFrames, and visualising patterns. The complete investigation framework, including Python code using SPARQLWrapper and Jupyter notebooks for all analyses, is publicly available at https://scm.cms.hu-berlin.de/baumanoa/johd_wikidata_paper (accessed 30.11.2025). This approach follows established methodologies for extracting empirical patterns from knowledge graphs through systematic property analysis and domain-specific sub-schema identification (Baroncini et al., 2022; Carriero et al., 2024). Key analyses included:

Coverage analysis: Querying all stations of each type in Wikidata, comparing with dataset coverage
Temporal qualifier analysis: Checking for presence/absence of ‘start time’ (P580) and ‘end time’ (P582) qualifiers on time-varying relationships
Reference quality analysis: Determining what proportion of statements included citations and what sources were used

Wikidata uses reification to attach qualifiers to statements, “turning a property into an instance, so that additional assertions may be attached to the property” (Shimizu et al., 2022, p. 2). This enables temporal scoping, for example, a station’s connection to a line from 1968–2011 uses temporal qualifiers to constrain the validity of the base statement. Our queries systematically checked whether this mechanism was applied to transport relationships.

This computational investigation generated quantitative evidence for modelling inconsistencies that manual inspection suggested but could not verify at scale. Results are detailed in Section 4.

3.2.3 Integrating Close and Distant Reading

The investigation methodology deliberately alternated between scales. Close reading identified patterns to investigate systematically, distant reading revealed their distribution and consistency, anomalies in distant reading prompted targeted close reading of specific entities. This scalar mixing addresses a fundamental challenge in large knowledge base analysis, neither purely qualitative nor purely quantitative approaches adequately capture modelling complexity.

3.3 Phase 3: Modelling-Aware Upload Planning

Our upload strategy must balance three competing concerns: historical accuracy and nuance, compatibility with Wikidata’s existing practices and community expectations, and long-term maintainability by editors unfamiliar with the originating research project.

3.3.1 Temporal Modelling Strategy Selection

Based on investigation findings (detailed in Section 4), we developed decision heuristics for representing temporal properties:

For discrete operational periods (e.g., ghost stations):

Use ‘connecting service’ (P1192) with ‘start time’ (P580) and ‘end time’ (P582) qualifiers
Add ‘significant event’ (P793) linking to Berlin Wall construction (Q108529701) for context
Use ‘state of use’ (P5817) with temporal qualifiers: “in use” (Q55654238) or “abandoned” (Q63065035)

This approach leverages Wikidata’s support for representing information with “weaker logical status” through ranked statements and temporal qualifiers (Di Pasquale et al., 2024). However, our investigation (detailed in Section 4.2) revealed that these mechanisms remain underutilised in existing transport infrastructure representation.

For continuously varying properties (e.g., service frequency):

If limited variation (2–3 levels): use qualifiers with ‘point in time’ (P585)
If complex variation (hourly schedules): link to external time-series dataset via ‘described by source’ (P1343)
Document decision rationale in item talk page

For name changes:

Maintain single item with current name as primary label
Add historical names via ‘official name’ (P1448) with temporal qualifiers
Include ‘named after’ (P138) relationships when relevant

Frequency and capacity modelling challenges:

Service frequency and vehicle capacity present distinct modelling challenges currently unaddressed in Wikidata’s transport representation. Investigation revealed that even well-documented systems like London Underground lack frequency measures, though some lines include vehicle normally used (P3438) with temporal qualifiers (e.g., Victoria line Q203030 shows London Underground 1967 Stock operational 1968–2011, then 2009 Stock from 2009 onward). However, these and all identified public transport vehicles lack capacity statements despite maximum capacity (P1083) being an allowable property for vehicles.

For a service frequency of a transport line, no established pattern exists, event interval (P2257) offers one approach but requires substantial abstraction from timetable data. Our strategy uses event interval (P2257) with a point in time (P585) for snapshot measurements (e.g., “20-minute intervals” qualified with “Monday, 31 March 1971 at 7:30 AM”), consistent with how frequency information was extracted from Fahrplanbücher. This sacrifices comprehensiveness for verifiability and aligns with timetable sources’ structure. Future work might explore more granular temporal modelling, but we prioritise implementability over theoretical completeness.

3.3.2 Geographic Uncertainty Documentation

Investigation revealed that existing coordinates lack derivation documentation. Our strategy for new stations that have been georeferenced during the project must include this form of documentation:

Coordinate precision qualifiers:

Georeferenced historical map: 0.01 (hectometre)
Street intersection estimation: 0.1 (kilometre)

Provenance documentation:

Use ‘determination method or standard’ (P459) on coordinate statements
Values: “historical map digitisation”/“street intersection”

3.3.3 Reference Architecture Design

We developed an ideal design for the statements derived from Fahrplanbücher during the project. The statements should receive the following standardised references:

claim: [property-value pair]

reference:

- stated in (P248): [Wikidata item for dataset “Berlin Public Transport Network 1946-1989”]

- reference URL (P854): [Zenodo DOI]

- publication date (P577): [dataset publication date]

- retrieved (P813): [date of query/upload]

- page(s) (P304): [specific Fahrplanbuch identifier, e.g., “BVG Fahrplanbuch 1956”]

This structure follows Wikidata’s recommendations for authoritative sources, which prioritise primary sources with clear provenance chains over user-generated or self-published materials (Piscopo et al., 2017). The reference architecture enables verification while acknowledging that operational timetables occupy an ambiguous status between primary sources (for service patterns) and secondary sources (for infrastructure existence). By citing the archival location and catalogue codes of original Fahrplanbücher, we establish the dataset’s relationship to verifiable historical documents rather than relying on circular Wikimedia references.

4. Results and Discussion

This section presents findings from our computational investigation of Berlin public transport representation in Wikidata, demonstrating how systematic analysis reveals tensions between Wikidata’s design and historical research requirements. Our analysis draws on data quality frameworks assessing linked data across dimensions including completeness, consistency, and timeliness (Färber et al., 2017), applying these specifically to historical infrastructure.

4.1 Coverage Disparities Across Transport Modalities

Analysis of reconciliation results revealed significant coverage gaps that varied by transport modality (Table 1). U-Bahn and S-Bahn achieved complete coverage (100% of dataset stations had Wikidata IDs), reflecting these systems’ persistent physical infrastructure and documented histories. Conversely, trams demonstrated 36.3% coverage with a complete East Berlin bias¹ (see Figure 1). Bus stops showed near absence at 0.1% coverage despite comprising the largest network by station count.2

Table 1

Wikidata Coverage by Modality.

MODALITY	DATASET RECORDS²	WITH WIKIDATA ID	COVERAGE %	UNIQUE STATIONS IN WIKIDATA
S-Bahn	1767	1767	100.0%	579
U-Bahn	1459	1459	100.0%	482
Tram	3869	1404	36.3%	923
Bus	12413	12	0.1%	47

This pattern illustrates an “evolutionary bias”. Wikidata captures durable infrastructure comprehensively while ephemeral systems remain fragmentary, potentially misleading researchers about historical transport composition. This coverage bias reflects broader patterns in Wikidata, where well-known and contemporary entities receive more attention than historical and local phenomena (Farda-Sarbas & Müller-Birn, 2019). The absence of West Berlin trams creates a coverage gap that inverts historical reality, as trams were abundant in West-Berlin before removal.

4.2 Temporal Modelling Inadequacy

Beyond coverage quantity, systematic inadequacy emerged in representing temporal complexity. Analysis of temporal qualifiers across properties revealed that even well-covered modalities lack temporal documentation on time-varying relationships (Table 2).

Table 2

Temporal Qualifier Usage on Key Properties.

PROPERTY	MODALITY	TOTAL STATEMENTS	WITH START TIME (P580)	COVERAGE %
connecting line (P81)	U-Bahn	602	25	4.2%
state of use (P5817)	U-Bahn	498	0	0.0%
connecting line (P81)	S-Bahn	255	0	0.0%
replaces (P1365)	U-Bahn	6	0	0.0%

Only 4.2% of U-Bahn connecting line statements include temporal qualifiers indicating when those connections held. For S-Bahn, no connecting line statements include temporal bounds. This present-bias collapses historical network evolution into atemporal snapshots, making temporal queries unreliable. These findings align with broader patterns observed in Wikidata, where only 33% of events include temporal information despite the platform’s technical capacity to represent temporal scoping through qualifiers (Gottschalk & Demidova, 2018).

Ghost stations exemplify this challenge most acutely. Analysis of 41 entities classified as ghost stations (Q168565) found 90.2% lacked any temporal qualifiers marking their interruption periods (Table 3). Known Berlin ghost stations (Nordbahnhof, Jannowitzbrücke) that ceased operations 1961–1989 appear in queries without indication of their inoperability, potentially misleading researchers about network topology during division.

Table 3

Ghost Station Temporal Qualification.

TEMPORAL COVERAGE	COUNT	PERCENTAGE
With start time (P580)	4	9.8%
With end time (P582)	2	4.9%
With both qualifiers	2	4.9%
No temporal qualifiers	37	90.2%

A concrete example illustrates the practical consequences. Querying “Which S-Bahn lines served Nordbahnhof station in 1975?” returns the same results as “Which S-Bahn lines serve Nordbahnhof station today?” This occurs because connecting line statements lack temporal qualifiers that would differentiate between three operational periods: pre-1961 network, 1961–1989 closure, and post-1989 operation. A researcher unfamiliar with Berlin’s division history would receive a technically accurate but historically misleading answer, as the station lay within East Berlin but served no passengers during the division period. This information exists in Wikidata (the station has a ghost station (Q56274561) classification) but remains disconnected from the operational relationships, exemplifying how temporal information captured in one property may not propagate to related statements.

Wikidata’s optional temporal qualifiers create an uneven landscape where some relationships have precise bounds while others imply permanency. This challenge echoes fundamental questions about representing historical sources in linked data. Entities that undergo discontinuous existence (opening, closing, reopening) resist straightforward RDF representation because their identity persists across gaps in their physical or operational continuity (Meroño-Peñuela & Hoekstra, 2014).

4.3 Provenance and Data Quality Issues

Beyond the temporal modelling inadequacies detailed in Section 4.2, our computational investigation revealed systemic data quality issues related to property completeness, semantic ambiguity, and provenance. The property coverage heatmap (Figure 2) provides a quantitative summary of these challenges, illustrating how Wikidata’s “evolutionary bias” extends beyond entity coverage to entity completeness.

The heatmap demonstrates that heavy rail modalities (U-Bahn and S-Bahn) show near-100% coverage for foundational properties like coordinate location (P625), administrative territorial entity (P131), and transport network (P16). This completeness is lacking for the tram and bus systems. For instance, date of official opening (P1619) coverage for Trams (1.1%) and Buses (4.3%) is extremely lacking in comparison. Likewise, properties essential for network analysis, such as connecting line (P81) and fare zone (P3610), are virtually absent for trams and buses. This pattern indicates that while an entity for a stop may exist, it often lacks the core temporal and relational data required for historical analysis.

Perhaps more problematic than these overt gaps is the semantic and geographic ambiguity hidden behind high coverage scores. The administrative territorial entity (P131) property, for example, shows 100% coverage for all four modalities, yet our qualitative analysis found its application to be inconsistent, with statements arbitrarily linking to different hierarchical levels (43% city-level, 31% borough-level, 18% locality-level). Critically, no stations systematically use temporal qualifiers to mark their location in East or West Berlin, preventing reliable queries such as “find all stations operational in East Berlin in 1975”.

These problems of completeness and ambiguity are underpinned by a systemic failure in provenance, which poses a significant barrier to scholarly use. Our analysis revealed that 73% of infrastructure measurements (like line length) lack any references at all. Where references do exist, they often create circular verification chains within the Wikimedia ecosystem; 31% of analysed references relied solely on imported from a Wikipedia edition (P143), which itself may lack citations. This circularity prevents tracing claims to archival or primary sources, undermining scholarly verification requirements.

The challenges of time-sensitive data, such as patronage figures, exemplify this convergence of issues. We found 42% of ridership figures lacked point in time (P585) qualifiers, rendering the statistics meaningless without temporal context. Of those that did possess a temporal qualifier, only 34% included a source reference, making the claim unverifiable.

5. Implications and Applications

The investigation documented in Sections 3 and 4 reveals tensions between Wikidata’s design and the requirements of historical infrastructure research. This section examines platform limitations requiring community attention, strategies for managing research complexity alongside preservation needs, and the transferability of this workflow to other historical domains.

5.1 Platform Limitations for Historical Representation

Our investigation identified two specific inadequacies in Wikidata’s current property architecture for historical infrastructure.

First, service frequency remains unrepresentable in any systematic way. Our Fahrplanbücher transcription captured frequency data like “20-minute intervals at 7:30am on weekdays in 1967,” but no existing property adequately models this. While event interval (P2257) with point in time (P585) qualifiers offers a partial solution, it requires multiple statements for each temporal condition (morning/evening, weekday/weekend), creating unsustainable maintenance burdens.

Second, operational discontinuity remains awkward to represent and nearly impossible to query reliably. As Section 4.2 demonstrated, ghost stations illustrate this problem. A station operational 1936–1961, then closed 1961–1989, then reopened requires constructing this sequence through multiple property-qualifier combinations whose relationships remain implicit. Current best practice involves separate statements for each service relationship with temporal bounds, plus state of use (P5817) statements indicating “in use” and “abandoned” with corresponding date ranges. While technically possible, this makes temporal queries fragile. SPARQL queries must reconstruct operational status from scattered evidence rather than interrogating it directly.

These problems extend beyond transport infrastructure. Any historical entity with cyclical states faces identical modelling challenges. Domain-specific modelling patterns reveal systematic variations in how temporal complexity is handled, suggesting that community guidance rather than platform constraints determines representation quality (Baroncini et al., 2022). We do not propose specific property designs here but flag these limitations as barriers to systematic historical data integration.

5.2 Dual Infrastructure Strategy

The workflow presented in this paper deliberately separates research infrastructure from preservation infrastructure. During active research, our data resides in Neo4j (built from the csv tables in zenodo), enabling complex network analysis queries that would prove cumbersome or impossible via SPARQL. Attempting to use Wikidata as primary research infrastructure would create several problems. Each data correction would require editing public entities, potentially disrupting community maintenance work. Complex graph algorithms run far more efficiently in purpose-built graph databases than via SPARQL federation. Experimental modelling requires flexibility that collaborative editing environments necessarily constrain.

However, research databases face their own limitations. Neo4j instances depend on institutional server infrastructure vulnerable to funding interruptions and technical obsolescence. Custom schemas resist integration with other datasets. These weaknesses mirror Wikidata’s strengths of community maintenance, linked open data integration, and preservation commitment.

Our strategy therefore treats database choice as temporal rather than exclusive. Neo4j serves active research while Wikidata serves preservation and dissemination. The upload occurs as a single batch operation upon project completion, transforming research data into linked open data without forcing research itself into Wikidata’s constraints. This requires maintaining identifier mappings and documenting modelling decisions but preserves analytical flexibility during research while ensuring long-term accessibility afterward.

5.3 Workflow Transferability

While this paper focuses on Berlin public transport, the three-phase methodology applies to historical infrastructure domains more broadly. The “evolutionary bias” we documented likely characterises other domains.

Beyond infrastructure, the investigation methodology adapts to other historical domains. Projects working with historical organisations, events, or cultural practices could apply the same scalar reading approach. Close examination of exemplar entities identifies modelling patterns. Systematic SPARQL queries measure how consistently those patterns appear across the domain. This investigation reveals coverage gaps, identifies modelling decisions requiring documentation, surfaces data quality issues, and exposes representational limitations.

The workflow’s most transferable element may be its analytical stance. By investigating before uploading and treating integration as negotiation between research requirements and platform affordances, researchers increase the likelihood that their contributions prove sustainable. This acknowledges that contributing to collaborative knowledge bases differs from publishing traditional research outputs.

5.4 Conclusion

The workflow presented offers a practical model for digital humanities projects. Research databases provide analytical flexibility during active work, while Wikidata ensures long-term preservation afterward. The investigation methods and modelling decisions documented here provide templates for other historical infrastructure projects considering similar integration work.

The Berlin transport dataset will be uploaded to Wikidata upon completion of the doctoral project. Until then, it remains available on Zenodo for researchers interested in Cold War infrastructure or historical transport networks.

Data Accessibility Statement

The dataset described in this paper is openly available at Zenodo: https://doi.org/10.5281/zenodo.16352848. The investigation code, is available at: https://scm.cms.hu-berlin.de/baumanoa/johd_wikidata_paper.

Notes

[1] All Berlin tram stations in Wikidata are either in East-Berlin or they are new tram stations in West-Berlin resulting from expansion of the tram station into West-Berlin post-2000 after they had been completely removed in West Berlin by 1967.

[2] This is all station entries across snapshots, the number of unique stations is lower but hard to clearly define, particularly for the bus network.

Acknowledgements

The author acknowledges the BVG-Archiv and the Wilhelm-Grimm-Zentrum (Jacob-und-Wilhelm-Grimm-Zentrum der Humboldt-Universität zu Berlin) for providing access to the historical Fahrplanbücher that form the empirical basis of this research.

The initial data publication on Zenodo was supported by the NFDI4Memory FAIR Data Fellowship, supervised by the Deutsches Museum and funded by NFDI4Memory, which provided essential support for developing research data management practices and exploring data publication strategies. The fellowship’s emphasis on open science principles directly informed the workflow presented here.

Competing Interests

The author has no competing interests to declare.

Author Contributions

For determining author roles, please use following taxonomy: https://credit.niso.org/

Noah J. Kim-Baumann: Conceptualisation, Data Curation, Formal Analysis, Investigation, Methodology, Software, Visualisation, Writing – Original Draft, Writing – Review & Editing.