Building Cultural Heritage Data Infrastructures with Wikidata: The Case of the Congruence Engine Data Register

Anna-Maria Sichani; Kunika Kono; Jane Winters

doi:10.5334/johd.448

Full Article

(1) Context and motivation

This article discusses the development and implementation of a prototype Data Register for Congruence Engine (CE), one of five “Discovery Projects” funded by the Arts and Humanities Research Council under the “Towards a National Collection” programme.¹ Led by the Science Museum Group, CE ran for just over three years, from November 2021 to January 2025. The aim of the project was to connect industrial history collections held in different museums and archives across the United Kingdom (UK) using the latest digital techniques. The project ultimately proposed developing “a minimum technical passive provision to serve as a foundational infrastructure for drawing together a cultural heritage national collection” (Boon et al., 2025, p. 61). Such an infrastructure would include a simple data register recording the whereabouts of museum, library and archive collections, heritage data and other historical materials, and the linkages between them.

(2) Background

(2.1) The rationale behind the CE Data Register: a prototype for a minimum community-driven infrastructure

While it was beyond the scope of the project to develop a national-level data register, the prototyping of an accessible, scalable register for the data collected by the CE team and its partners was a major objective. The main purpose of the CE Data Register is to record and describe distributed datasets and to facilitate comprehensive search, discovery, and linking of those datasets.

The CE Data Register, described in detail below, is a collaboratively designed and curated record of collections and cultural heritage data related to the UK’s industrial past (see Figure 1). It is intended to help people interested in industrial history to find, use, edit, and contribute to this type of cultural heritage. But it is also intended to be a model for others to copy, develop, and extend. Consequently, it was developed as a lightweight, exploratory proof of concept rather than a fully-fledged production resource, while remaining scalable and sustainable beyond the funding period that enabled its creation.

Data Register Landing page https://congruence-engine.wikibase.cloud/wiki/Main_Page [last accessed: 26 February 2026].

A key concept and organising principle for CE was that of the “social machine”, which has been explored in depth by Shadbolt et al. (2019). Social machines exhibit many different characteristics, but of most significance for us was the idea that “social machines have the potential to empower, via a democratic spirit of cooperation and respect. They bring people together, using technology to enable cooperation and communication at scale” (Shadbolt et al., 2019, pp. 4–6). The CE data register was an example of this in practice, drawing on lightweight, open infrastructure made available by the Wikimedia Foundation, which is itself responsible for one of the most successful examples of a social machine—Wikipedia. Through a socially-infused infrastructure built on Wikidata and Wikibase, users and communities—ranging from cultural heritage professionals in institutions to researchers, students, educators, and the general public—can contribute knowledge and work by enriching and linking datasets. In this way, infrastructure development becomes a genuinely democratic and social process.

(2.2) A Bradford-inspired experiment

The initial idea for the CE Data Register emerged from close collaboration with community groups in Bradford, which was an important locus of participatory research for the project. In conversation with community groups in and around the city, a shared interest emerged, not just in sharing information about the data held or created by CE, but in raising awareness of other sources of relevant data, wherever they might be held. Once a mechanism to do this was in place, there was further potential to allow data descriptions to be added by anyone, facilitating community “stewardship” of the information.

The first step in achieving this aim was to commission research that would: conduct an audit to identify and map relevant data and sources for the industrial history of Bradford; develop a draft specification for a public register; and recommend an appropriate form of knowledge organisation.² Bradford was initially used as a proof of concept, but the approach was gradually scaled up to encompass the broader scope of the CE project, addressing the UK’s industrial heritage.

(3) Method

(3.1) Contextualising the problem: challenges in humanities and cultural heritage data infrastructures

Humanities and cultural heritage data infrastructures continue to grapple with challenges stemming from the inherent heterogeneity of their materials. This diversity appears across multiple dimensions. The coexistence of varied data formats—textual, visual, audio, video, tabular, and multimodal—often within a single collection, is compounded by differing digitisation standards: many datasets stem from early digitisation or legacy research projects, while others are generated through contemporary initiatives, creating difficulties in standardisation, integration, and long-term preservation. Metadata variation further fragments the landscape, as collections rely on descriptive schemas shaped by disciplinary traditions, institutional priorities, and evolving best practices. Adding to this complexity is a multi-faceted rights regime, with materials ranging from being openly licensed to tightly copyright restricted—or indeed of unknown status.

Together, these variations form a fragmented and uneven ecosystem that complicates the development of cohesive, sustainable, and equitable data infrastructures across the humanities and cultural heritage. Efforts to build shared registries or discovery platforms must recognise not only the diversity of data formats and structures but also differences in digitisation practice, institutional context, and access rights. The goal is not to enforce uniformity but to design interoperable frameworks that can connect, contextualise, and make sense of heterogeneous resources, enabling scholars and professionals to locate, retrieve, analyse, and reuse data responsibly.

Given this, we decided from the outset to ground our infrastructure repository/data register design in four principal requirements:

Federated architecture. The system must connect disparate datasets without forcing uniformity or centralised control, respecting institutional and project autonomy while enabling discovery.
Accessibility. Both data entry and querying must be achievable by users without extensive technical expertise. There should be support for data entry interfaces and, ideally, automated ingestion (via APIs or scripts) for broad adoption.
Extensibility and reusability. The data model must scale from small individual projects to large institutional or national datasets, supporting data reuse through standardised identifiers, linked data principles, and integration with external platforms.
Sustainability. The infrastructure must be open-source, community-supported, and interoperable to ensure long-term viability beyond individual project or funding cycles.

(3.2) Wikidata, humanities, and cultural heritage data infrastructures

Originally developed in 2012 by the Wikimedia Foundation, Wikidata was envisioned as a cornerstone of the linked open web—a freely accessible, multilingual database designed to serve as a central source for structured data such as dates of birth, geographic coordinates, name and authority records. Its creation aimed to mitigate the effects of “cultural diglossia”, empowering smaller communities to achieve greater visibility and impact within the Wikimedia ecosystem (Hinojo, 2015). Functioning as a truly multilingual and collaborative knowledge base, Wikidata uses language-independent identifiers that link to names in hundreds of languages and connect to external authority records. As Poulter (2017) notes, it has become the “the ultimate authority file; a modern Rosetta Stone connecting identifiers from institutions’ authority files, scholarly databases, and other catalogues (Hinojo 2015)”.

Wikidata’s open structure and flexible data model allow users to perform complex, dynamic queries, making it not only a cross-institutional, multilingual mega-aggregator but also a powerful foundation for open, interoperable knowledge infrastructures. Over time, this vast, collaboratively curated dataset has evolved into a global cultural and semantic hub, connecting knowledge across languages, disciplines, and institutions. It enables new forms of discovery by revealing connections and relationships that were previously invisible, transforming how we understand and navigate cultural and scholarly data.

Although Wikidata was envisioned as a natural home for cultural heritage collections (Fauconnier et al., 2015; Poulter, 2017), its uptake across the Galleries, Libraries, Archives, and Museums (GLAM) sector has remained sporadic and largely experimental. Many institutions have engaged with it through short-term projects or pilot integrations as part of edit-a-thons, but few have embedded Wikidata into their long-term digital infrastructure strategies or collection management workflows. This limited adoption is often due to institutional barriers such as lack of technical expertise, uncertainty and a well-documented risk-averse attitude around data ownership and governance (Wallace, 2022), as well as concerns about aligning institutional metadata standards with Wikidata’s open, community-driven structure.

A notable exception is the National Library of Wales, which has been a pioneer in integrating Wikidata into its digital ecosystem. By appointing the first Wikipedian in Residence, the Library strategically embedded open knowledge practices into its operations, using Wikidata as a tool for enhancing collections’ metadata via the Semantic Name Authority Repository Cymru (SNARC) service, a customised data pipeline designed to bridge open, crowdsourced knowledge from Wikidata with the Library’s authoritative metadata systems. SNARC effectively serves as a mediation layer between open and institutional data environments: it simplifies the complexity of Wikidata’s structure, allowing the Library to adapt data models to its own standards, while maintaining high levels of data quality and provenance control (Evans, 2024). The initiative not only aligned the Library’s collections with global linked data networks but also demonstrated a replicable model for other GLAM institutions seeking to engage meaningfully with open knowledge infrastructures.

(3.3) Wikibase as semantic infrastructure

Wikibase emerged relatively quickly as a candidate platform for the Data Register, designed as it is to connect “the knowledge of people and organizations” and deliver “the infrastructure for the participation of humans and machines in the world of linked open data”.³

Wikibase offers an open-source, flexible infrastructure built to support collaborative knowledge creation and participation in the Linked Open Data (LOD) ecosystem. Originally developed for Wikidata.org and maintained by Wikimedia Deutschland, Wikibase is designed to handle structured data with a highly adaptable and flexible model that accommodates both human- and machine-readable formats. Its core data model comprises items (entities), properties (relationships), and statements (assertions with qualifiers and references). This Resource Description Framework (RDF)-based structure represents rich, structured, and interconnected data aligned with semantic web principles. Wikibase shines in collaborative environments where multiple contributors are involved in creating and editing data. Its community-driven nature ensures a growing ecosystem of support, making it an ideal tool for institutions and projects seeking a transparent, open, and sustainable approach to managing complex, interlinked data.

To simplify setup and management, we opted for Wikibase.Cloud,⁴ a hosted version of Wikibase operated by the non-profit Wikimedia Deutschland. This platform is designed to “make it easier than ever to create, connect, and grow a linked open database”.⁵ It offers a fully functional instance of Wikibase that can be deployed within minutes of sign-up, with hosting and maintenance handled externally, removing the need for us to manage infrastructure or backend systems. Wikibase.Cloud is also free to use, funded through donations to Wikipedia, and is General Data Protection Regulation (GDPR)-compliant, making it a good fit for public and academic institutions. Though still in open beta, it is under active development and already supports many of the features we rely on, such as the Query Service, QuickStatements and integration with OpenRefine.

For the CE Data Register, Wikibase.Cloud offered an optimal balance between functionality and simplicity. We chose Wikibase for its granular, extensible data structure—essential for modelling the complexity of our heritage collections—and its capability for linking data across sources. A core strength of Wikibase is its ability to support interconnected knowledge representation, making it ideal for linking data points and uncovering meaningful relationships within heritage collections. Its capacity to also track provenance and attribution through built-in version history was especially important for our scholarly and heritage-focused work. As an open, standards-based tool with strong community support, it offers long-term sustainability without vendor lock-in while relying more on sustainability through community-driven governance.

Finally, we chose Wikibase.Cloud for its ease of setup, defined focus, and ability to model data beyond the constraints of Wikidata’s schema. While it comes with some limitations—such as restricted customization options, no local file uploads, a steep learning curve, and data export only via the Query Service—we found that its advantages far outweighed the drawbacks for our current project needs.

(3.4) Data modelling with Wikidata: from silos to semantic networks

Based on the four principal design requirements outlined above, we developed a data model that captures metadata rather than datasets themselves, defining the minimum information needed to identify, contextualise, and locate datasets across distributed repositories. In practical terms, each dataset is represented as a distinct entry (“item”) that can be described through multiple characteristics (“statements” that consist of “property” and “value”): what subjects and topics it covers, where it can be accessed, which organisation or person holds it, its copyright and licensing status, and associated projects, and a datasheet where available. It is important to reiterate that the Data Register is primarily a proof of concept infrastructure focused on capturing and linking metadata to support dataset discovery and reuse, rather than hosting or storing datasets themselves. As such, there is no single repository of datasets within the Register;⁶ however, the datasheets for the sample datasets used in the prototype are available via the Science Museum Group Research Repository.⁷

The data structure and schema evolved through three iterative cycles of implementation, testing, and refinement. Each iteration involved practical assessment and reflection with users, ensuring the structure reflected both user needs and technical constraints (see Figures 2 & 3). Our evaluation focused on four dimensions aligned with our core design requirements: ease of data entry and curation (supporting accessibility); data quality and stability (ensuring sustainability); efficiency of data management workflows (enabling scalability); and capacity for reuse and interoperability through SPARQL querying and API exposure (facilitating extensibility).

The iterative process revealed that supporting accessibility and quality required pedagogical as well as technical approaches. Rather than enforcing rigid validation rules that might create barriers to entry or impede users during data input, we developed user guides alongside an at-a-glance reference table detailing data type specifications, requirement levels, and entry guidelines, ensuring that accessibility did not come at the cost of standardisation while supporting consistent data creation across diverse user groups.

User feedback during the iterative cycles highlighted a particular challenge: the semantic data model presented a steep learning curve for many contributors who were more accustomed to linear or relational data structures. In response, we adopted Wikidata’s “described at URL” property as a key feature of our data model. Conceptually familiar to a typical “website URL” field, it allows users to add basic links without needing to think about semantic connections. As users gain experience with linked data, the same property can accommodate richer contextual links, such as references to Wikidata entities. This design choice provided flexibility for users to elaborate on dataset characteristics in ways that felt intuitive to them, without requiring strict adherence to predefined relationships. By accommodating both rich semantic links and simple URL references, the model bridged different mental models of data organization while maintaining the underlying linked data architecture. In this way, the data model itself became a learning tool, enabling users to gradually adopt semantic web practices at their own pace. Through this process, the data model itself also matured from having a simple relational structure into a semantic schema expressing nuanced relationships between datasets, institutions, and projects. The emphasis shifted from representing discrete collections to mapping connections within a federated research environment.

This experience underscores that data modelling in humanities infrastructures is not a one-time technical exercise but an ongoing, collaborative process. By involving users throughout the development cycles, we gained critical insights that shaped both the technical architecture and the pedagogical scaffolding around it. The challenges users articulated, from navigating semantic relationships to managing metadata requirements, revealed not just obstacles to overcome but opportunities to design more thoughtful, adaptive systems. Ultimately, the iterative, user-centred approach allowed us to build infrastructure that balances semantic coherence with practical accessibility, creating a foundation that can evolve alongside its community of users.

Figure 4 describes the data structure (resulting from the data modelling exercise described above) and guidelines for the CE Data Register.

Data schema and guidelines https://congruence-engine.wikibase.cloud/wiki/User_Guide#Data_structure_for_the_Congruence_Engine_Data_Register [last accessed: 26 February 2026].

Finally, our implementation of the “described at URL” property connects items to a wider context (often Wikidata entries), situating the Data Register within a broader ecosystem of linked data. A user exploring a dataset might follow links to related resources, source materials, or organisations that would otherwise remain siloed and undiscoverable. This interconnectedness is made possible because Wikibase supports explicit linking to external data resources and persistent identifiers, offering a radically different approach to structuring, linking, and querying data. Unlike traditional cultural heritage catalogues, which typically store information as flat records, Wikibase models data as interconnected entities, enabling richer relationships, more flexible queries, and integration within a wider linked data ecosystem.

The SPARQL endpoint, accessible from the landing page, makes the entire data model queryable, meaning discovery is shaped by the data itself rather than by pre-built search interfaces.⁸ Users can also query against information that is not stored in the Data Register itself through federated queries such as:

“find datasets held by organisations located in a particular city or region” (where location is drawn from Wikidata)
“find datasets that can be used freely” (querying Wikidata’s structured licence properties rather than matching a specific licence name)

This approach keeps the Data Register minimal and maintainable while enabling queries that draw on the richness of the wider linked data ecosystem. This linked structure supports a mode of discovery that conventional websites cannot easily replicate. Rather than browsing a static list or relying on keyword search alone, users can traverse connections and uncover relationships. The data becomes a network rather than a catalogue.

The example queries above are specific to the Data Register’s data model, but they illustrate a broader principle. The value lies not in the particular queries but in the approach: linking to external structured data sources, particularly other Wikibase instances including Wikidata, and thereby gaining access to properties that would be impractical to maintain locally. If two independent data registers both link their datasets to the same Wikidata identifier for an organisation, queries could surface connections between them. In each case, the project benefits from querying against rich, externally-maintained data without needing to replicate or update that information itself.

This is a key consideration in linked data modelling: thinking carefully about where to draw the boundary between what is stored locally and what is referenced externally. The “described at URL” property in our data model reflects this thinking—keeping the Data Register minimal and maintainable while enabling discovery that extends far beyond its own contents.

(4) Results and discussion

(4.1) Opening up to the community: edit-a-thon and user guide

Given that the CE Register is intended to be a community-driven infrastructure, we prioritized engaging the community from the outset to ensure that it would effectively meet their needs. Towards this effort, we organized a hybrid edit-a-thon, primarily involving CE Research Fellows. This event was designed to actively engage the community, allowing participants to contribute to the development of the CE Data Register while providing valuable feedback on its usability and functionality in a crucial point of its development. The goals of this event were multi-faceted, with each contributing to the broader mission of enhancing the CE Data Register. First and foremost, we aimed to assess the data model and to identify any gaps or areas for improvement. Next, we sought to test the data entry process through various use case scenarios to identify barriers faced by non-expert users. We also aimed at augmenting the data pool to refine the development of the web app—we managed to add 17 new entries into our data system. Finally, through this process we emphasized the necessity for clear and accessible instructions on the schema and data entry process, which led to a user guide.

The accessible user guide⁹ was developed to support the community using the Data Register. It includes context around the foundational technologies of the Wikibase and Wikidata, as well as the principles of linked open data. In addition, it offers detailed guidance on how to effectively use the CE Data Register, including adding items and searching. By clearly outlining systems and processes around data entry and search, the user guide increases accessibility and inclusivity, making it easier for all users to engage with the Data Register, regardless of their technical expertise. The user guide further improves overall usability and ease of use, making the platform more intuitive and straightforward for non-expert users, while the clear guidelines help to ensure that practices remain consistent, which is vital for maintaining data quality over time, especially for a community-driven project. Finally, the well-documented processes in the user guide aid in the onboarding of new users and support the scalability of the register in the future.

(5) Implications/Applications

(5.1) Data Register web app

While Wikibase provides a robust infrastructure for structured data storage, linking, and querying, its reliance on semantic web technologies presents significant barriers to entry. Users unfamiliar with the RDF conceptual model face a steep learning curve when using Wikibase. Additionally, querying the data requires knowledge of SPARQL, a specialised query language that differs fundamentally from conventional search interfaces. These technical prerequisites can discourage or prevent users from interacting with the CE Data Register.

A web application was therefore developed to explore these challenges and experiment with ways to address them, and best support the discovery and exploration of registered datasets without prior knowledge of linked data concepts or query languages (see Figure 5).¹⁰ The web application has been built with two modes of search: full-text and vector. The full-text search supports keywords and modifiers mapped to the WikibaseCirrusSearch functionality, enabling users to refine queries while learning the data model through use rather than upfront instruction. This provides users an immediate means to discover content while also leaving room to engage more deeply with the underlying structure as their confidence grows through a text-based search interface similar to conventional search engines. Beyond usability, the web application serves as an experimental environment for exploring data reuse through existing web technologies. Rather than creating a closed, bespoke system, the application leverages pre-existing frameworks and the Wikibase API, demonstrating how the data, once modelled semantically, can be integrated and repurposed within broader digital ecosystems.

CE Data Register Web App https://congruence-engine.github.io/data-register-app/ [last accessed: 26 February 2026].

A further motivation was investigating the affordances of language models in data discovery. Specifically, we compared traditional full-text search methods with vector-based semantic search, exploring how these differing approaches interpret and retrieve information from a relatively small, well-defined collection. The Data Register thus provided a controlled testbed for this comparison, offering insight into how linguistic similarity, conceptual proximity, and structured metadata jointly shape information retrieval. This experimental component makes the application useful as both functional tool and research instrument, an exploration in bridging semantic infrastructure and user-centred information access.

(5.2) Archiving and data preservation

While Wikimedia Deutschland’s commitment to Wikibase.Cloud promises long-term platform availability, data preservation remains a consideration for research projects. Humanities infrastructures often outlive individual project cycles, and continuity cannot depend solely on third-party hosting. Archiving is essential to guarantee that a Wikibase instance’s information and structure remain accessible, citable, and recoverable even if the live service changes, migrates, or is discontinued.

(5.2.1) Static archiving with Snowman

To safeguard the project’s Wikibase instance, an archival snapshot was created using Snowman,¹¹ an open-source tool developed by the Glaciers in Archives project.¹² Snowman is a static site generator for SPARQL backends. It can be used to create HTML websites from Wikibase data, producing a static snapshot that is preserved and presented in a browsable format without requiring a live connection to the original database. The snapshot serves as a web archive, ensuring that even if the live Wikibase instance becomes inactive or modified, a representative version of its structure and data remains permanently viewable.¹³

However, the static nature of this archive imposes several inherent limitations. First, it is temporally constrained, capturing only a single moment in time. Once archived, it does not incorporate later edits, deletions or additions to the live Wikibase, effectively freezing the data in its recorded state. Second, the archive is functionally static, consisting of rendered HTML pages with data embedded directly into the markup. This means it is disconnected from the live database and cannot support dynamic operations such as searches, filtering or complex SPARQL and API queries. While limited search functionality could be recreated using JavaScript, doing so would require continuous maintenance as frameworks and libraries evolve or become deprecated. Finally, although the use of HTML ensures long-term accessibility through any standard web browser, it limits interoperability and machine readability. The data is not readily amenable to reuse, integration, or computational analysis.

Despite these constraints, the Snowman archive provides a valuable research data documentation tool, offering a durable record of the system’s contents, layout, and metadata context. It functions as a digital snapshot, preserving a Wikibase instance as a historical artefact of the research data and infrastructure.

(5.2.2) Complementary preservation strategies

Given the archival limitations of static HTML, complementary approaches may be needed to support data reusability and interoperability. We experimented with several export-based strategies to preserve underlying data in structured formats suitable for re-ingestion and analysis:

JSON exports. Wikibase’s API allows extraction of full item data in JSON format, capturing statements, references, and qualifiers. JSON is machine-readable and can be re-imported into another Wikibase instance or parsed programmatically.
CSV exports. Tabular representations of data (item-property-value triples) can be exported using the Query Service or reconciliation tools such as OpenRefine. CSVs are widely compatible with tools like Excel, Google Sheets, and data visualisation software.
XML exports. For institutions using XML-based metadata workflows, RDF or schema-based XML exports can preserve both data content and semantic structure.
Turtle RDF dumps. Wikibase supports RDF dumps in Turtle syntax, which can be ingested into triple stores or queried via SPARQL endpoints elsewhere.
OpenRefine project archives. When data has been curated or reconciled in OpenRefine, exporting the entire project ensures that transformation steps and reconciliation histories are also preserved.

Each approach serves different preservation goals: Snowman provides human-readable archival access, while structured exports (JSON, CSV, XML, RDF dumps) ensure machine-readable continuity and potential for data reuse. Combining both ensures the Wikibase instance remains both historically visible and technically recoverable.

(6) Conclusion: Towards a minimum infrastructure for a UK collection

The CE data register was developed as a prototype within our broader vision for a minimum viable infrastructure—offering a lightweight yet resilient framework designed to support a distributed, community-led UK collection. Rather than being anchored to a single institution or time-limited funded project (though this prototype has its origins in one), the infrastructure is envisioned as a shared, living resource, open and accessible to individuals, community groups and institutions alike. It builds on existing, trusted digital infrastructure that is already supported by engaged user communities, fostering sustainability and shared ownership from the outset.

Importantly, this prototype avoids presupposing or imposing full standardisation across all data. Instead, it offers a lightweight framework that defines the minimum metadata required to make diverse and distributed collections discoverable. This approach aligns with similar strategies adopted by initiatives such as the Museum Data Service¹⁴ and the Archaeological Data Service,¹⁵ and increasingly represents best practice in the field. This prototype promotes strong documentation practices—particularly the use of datasheets for digital cultural heritage datasets—to enhance transparency, accountability, and reusability (Alkemade et al., 2023). Crucially, the infrastructure is built to interoperate with other elements of the broader digital research ecosystem, including data services and platforms. In doing so, it offers a practical, scalable pathway towards more inclusive, connected, and sustainable cultural heritage research, helping to surface and link the UK’s rich and distributed collections.

When developing the CE Data Register, we learned valuable lessons that shaped our work. We began with a simple proof of concept, focusing on a streamlined data model with minimal functionalities to ensure user-friendliness. By prioritizing scalability and easy data entry, we created a design that remained adaptable and sustainable. This simplicity helped the project to stay aligned with its community-driven nature, fostering ongoing contributions and improvements.

Equally important was the decision to engage early with the Wikimedia community as well as with our potential users. Reaching out for feedback and support from the start allowed us to ensure the model met real community needs. It also made data entry more intuitive and ensured that there were accessible resources to support users in making the most of the Data Register. This early involvement was a game changer, solidifying the project’s relevance and usability.

Working in this way surfaced a number of key concerns and considerations for future research. First, it highlighted the importance of user training and support. For users accustomed to spreadsheets and relational databases, working with Wikibase requires a fundamental shift in how they conceptualize and interact with data. Instead of thinking in rows and columns, users must learn to think in terms of interconnected entities and relationships—a graph-based mental model where connections between data points are as significant as the data itself. This means grappling with unfamiliar concepts such as RDF triples, properties, and qualifiers, while adapting established workflows for data entry, querying, and maintenance. Users need comprehensive training and ongoing support to make this transition successfully. Without accessible documentation, practical examples, and responsive help resources, even motivated users may struggle to apply semantic principles, leading to frustration, inconsistent data practices, and reluctance to fully engage with the system. Support of this kind may not always be possible, however, particularly once funding for a project has come to an end. Consequently, as with the web application described above, an alternative data entry interface with integrated guidance and interactive validation can be vital in lowering the initial barrier and help users ease into the semantic paradigm through “doing”. Overall, encouraging sustained contribution from research communities requires addressing the learning curve through documentation, training, thoughtful user interfaces, and a demonstrable value proposition.

A second set of considerations concerns documentation, metadata, interoperability, and openness—and the trade-offs and compromises that may be necessary. With regard to metadata standards, and consequently interoperability, ensuring compatibility with established standards such as Dublin Core and Text Encoding Initiative (TEI) will support ongoing mapping and reconciliation efforts, for example with the Museum Data Service. Integration with other data register Wikibase instances, however, seems more challenging. As a small-scale experiment, it is difficult to see how federation with existing national and international Wikibase instances would currently work. The experience of the Museum Data Service also suggests that automated metadata harvesting for institutional collections is far more complex than it sounds, and potentially reputationally damaging without some degree of manual curation (Museum Data Service, 2024). The extent to which automation (programmatic and/or AI) may be useful here remains to be seen and would be a useful area for further research.

We allowed, indeed encouraged, the incorporation of datasheets for digital cultural heritage datasets into the CE Data Register data model to support more complex querying, but there is a risk that this can make data entry and upkeep cumbersome. This is mitigated to an extent by not mandating the inclusion of a datasheet for each dataset recorded, even while suggesting their enormous value for data documentation and reuse. Finally, throughout the project, we were conscious of the need to balance openness with care and sensitivity. For genuinely community-based resources to be developed and sustained, it is vital to accommodate more granular intellectual property concerns, take account of a range of ethical considerations (which may vary within datasets), and differing institutional policies regarding data sharing. There is no “one size fits all” approach to navigating such a complex landscape, but a lightweight, community-led solution stands a strong chance of balancing difference and standardisation successfully. Humanities data, after all, is so valuable precisely because of its heterogeneity and messiness.

Notes

[1] Towards a National Collection https://www.nationalcollection.org.uk/ [last accessed: 26 February 2025].

[2] This research was undertaken by Dr Beatrice Cannelli.

[3] Wikibase https://wikiba.se/ [last accessed: 26 February 2025].

[4] https://www.wikibase.cloud [last accessed: 26 February 2025].

[5] https://tech-news.wikimedia.de/2022/06/03/wikibase-cloud-a-new-project-at-wikimedia-deutschland/ [last accessed: 26 February 2025].

[6] A single entry point for all datasets currently represented in the CE Data Registry can be found at https://congruence-engine.wikibase.cloud/wiki/Special:AllPages?from=&to=&namespace=120 [last accessed: 15 January 2026].

[7] For the CE datasheets see https://sciencemuseumgroup.iro.bl.uk/collections/a88a6f8b-0825-4fcc-a839-fd502f0d23a4?locale=en [last accessed: 15 January 2026].

[8] For SPARQL Query Examples please see https://congruence-engine.wikibase.cloud/wiki/SPARQL_Query_Examples [last accessed: 15 January 2026].

[9] The user guide is available as a standalone living document at https://docs.google.com/document/d/1FPO0bb1PJQOTpZqHIZnMjlISO6oUydyQ2U8wb_nBfOg/edit?tab=t.0#heading=h.yi65r5ibs3cq and via the Data Register wiki at https://congruence-engine.wikibase.cloud/wiki/User_Guide [both last accessed: 26 February 2026].

[10] The CE Data Register Web App can be accessed at https://congruence-engine.github.io/data-register-app/ and for its full documentation see https://github.com/congruence-engine/data-register-app [last accessed: 15 January 2026].

[11] https://github.com/glaciers-in-archives/snowman [last accessed: 15 January 2026].

[12] https://github.com/glaciers-in-archives [last accessed: 15 January 2026].

[13] The static CE Data Register Archive can be accessed at https://congruence-engine.github.io/data-register-archive/ and for full documentation see https://github.com/congruence-engine/data-register-archive [both last accessed: 15 January 2026].

[14] Museum Data Service https://museumdata.uk/ [last accessed: 29 October 2025].

[15] Archaeology Data Service https://archaeologydataservice.ac.uk/ [last accessed: 29 October 2025].

Acknowledgements

The authors would like to thank Arran Rees and Felix Needham-Simpson for their collaboration on the initial development of the Data Register as part of the CE project, contributing to the conceptualisation of the data model and to both the conceptualisation and development of the web app, respectively. We also extend our thanks to the CE project team for their valuable feedback and support throughout this work.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Anna-Maria Sichani: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review & editing.

Kunika Kono: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

Jane Winters: Conceptualization, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review & editing.