Decentralised Semantics: A Semantic Engine User Perspective

Carly M. Huitema; Paul Knowles; Philippe Page; A. Michelle Edwards

doi:10.5334/dsj-2024-042

Full Article

Introduction

Contextual information is essential for understanding and effectively using research data. Data context is typically described in metadata associated with the data. The benefit of context aligns closely with the Findable, Accessible, Interoperable, and Reusable (FAIR) data principles, which were developed to guide improvements in research data management (Wilkinson et al., 2016). Ensuring data is FAIR, especially in a machine-readable format, improves the context of data and ultimately increases the value of data.

Agri-food Data Canada (ADC) was established at the University of Guelph in Canada to facilitate integrating agricultural research data into a broader societal and environmental context. One aim of ADC is to support cross-domain research questions that improve agricultural production while minimizing environmental impact. Recognizing the importance of comprehensive contextual information for data, ADC has created the Semantic Engine—a suite of tools to help researchers write better, machine-actionable, contextual metadata with less effort.

One of the major challenges in agri-food research lies in the heterogeneity of research data and the decentralized nature of both research and data collection efforts. To address these challenges, the Semantic Engine is centered on metadata creation tools that are platform-independent web applications. These tools are designed to bring machine-actionable, portable, contextual data documentation to a wider audience, thus promoting consistency, reproducibility, and usability in research data. This approach aims to empower researchers across diverse domains, with a focus on the agri-food sector, enabling them to better collaborate and share valuable insights while improving the usability of their data.

Data Schema Documentation with Overlays Capture Architecture

Data schemas are contextual data documents that contain information to help users understand the structure and content of datasets. A data schema of tabular data will reference the column names (attributes) and provide the dataset user with information such as relevant units, rich descriptions of the attributes, and data formatting rules. Going further, schemas can contribute to internationalization if they contain descriptions and labels in multiple languages. Schemas can support interoperability when they connect attributes to specific ontological terms. All these different features of a schema are useful for different types of dataset users in different locals.

The Semantic Engine uses the standard Overlays Capture Architecture (OCA) (Knowles, 2022) for schema documentation. Overlays Capture Architecture is a global, open, extensible standard for writing machine actionable data schemas and is hosted at the non-profit Human Colossus Foundation in Switzerland. Membership is open to anyone who wants to contribute to the development of the schema language standard.

There are multiple existing, machine-readable formats for documenting schemas including JSON Schema (Wright et al., 2022), XML Schema Definition (Gao et al., 2012), JSON-LD (Sporny et al., 2020), and Link-ML (Mungall et al., 2023). All these formats are rich and expressive for documenting schemas with different emphasis for use cases they have been optimized for and much of what OCA documents is compatible with these formats. Unique to OCA is its optimization for decentralized collaboration realized through its overlay architecture and the embedded use of self-addressing identifiers which enable greater schema decentralization and research reproducibility.

Overlays Capture Architecture does not prescribe exactly what details should be included when describing data, which is the advantage of its flexibility. Overlays Capture Architecture ensures the researcher can describe their data to the degree that they find beneficial and provides a roadmap of improvement through the continued use of OCA to add detail to their machine-readable documentation. In contrast, more prescriptive methods of data documentation provide benefits of structure for researchers such that their documentation may follow expert recommendations and ensure confirmation to a specific field of study standard, but with the trade-off of less flexibility.

Overlays in OCA Support Decentralization and Community Schema Improvements

Overlay Captures Architecture schemas are optimized for decentralized collaboration because they are multilayered, built upon a capture base and collected in a single schema bundle. The capture base contains four minimum pieces of information about the dataset (attribute names aka variables, attribute data types, notation about the sensitivity of attributes, and a general schema classification). By ensuring that the capture base contains only the minimal information it makes schemas more flexible. Multiple schemas can be built with the same capture base but with regionally important differences, such as native language support, or data collection in different units. These regional differences are expressed by additional schema features (the overlays of Overlay Capture Architecture). Overlays of a schema bundle reference the capture base of the schema using identifiers. A schema bundle can have overlays added, subtracted, or modified depending on schema usage without influencing the basic data structure of the capture base. This means that two separate schemas with measurements in metric and imperial systems can have their specific units documented in separate schemas via different overlays but maintain the same data structure by having the same capture base. The identity and content of the capture base (and all overlays) can be verified by using self-addressing identifiers which are embedded into the OCA architecture (Figure 1).

An overlay schema architecture. Information about a dataset is described in a schema which details features of attributes. Each one of these features can be documented as separate overlays in an overlay architecture schema model (OCA). Practically, each one of these overlays is written in JSON and the entire structure is bound together when each overlay references the specific SAID of the schema capture base.

The multilayered architecture of capture base and overlays described by OCA supports multiple contributors working on a schema. In the appropriate governance framework, there can be separate owners controlling specific overlays based on expertise and need. For example, a schema may have unit specialists or language specialists contributing overlays to a shared schema standard. Overlays can be shared and combined with existing schemas and the resulting schema bundle, referenced by its identifier can be shared, cited, and reused.

Self-Addressing Identifiers in OCA Support Decentralization and Reproducibility

A key technology embedded within OCA is self-addressing identifiers (SAIDs). A SAID is a cryptographic digest of a digital object (such as the overlays and capture base in an OCA schema) and is produced by a hashing function. These SAIDs are embedded into, and referenced between, components of a schema which itself is given a SAID. For example, each overlay lists the capture base digest (SAID) it references. By comparing the calculated digest of the schema or overlay with the referenced SAID identifier researchers can verify the schema or schema layer is authentic and unchanged. Thus, SAIDs can be used to compare and verify different schemas, and different overlays within schemas. As in the above example where there are two variations of a schema (a metric and imperial version), you can confirm that they have identical capture base structure by verifying that the SAIDs of the capture bases are the same.

The use of SAIDs is an important tool for a decentralized data ecosystem such as in agri-food research. A design goal of decentralization is that the identifier of the digital artifact (such as a dataset or schema) is used to establish authenticity rather than the location of the artifact. Using a SAID reference in a decentralized data ecosystem means a user can discover the schema in any location, and know they have found the unadulterated, authentic schema.

In a centralized data ecosystem, the authenticity of the schema is established through the authority of location. In contrast, SAIDs support authenticity by cryptographic authority and do not depend on a specific administration to maintain the authenticity of the schema.

The use of content addressable identifiers such as SAIDs is important for the concept of reproducibility. When using digital artifacts that have been referenced in a research workflow it can be challenging to verify that the cited resource is the same as when originally used. This challenge grows as years pass and authenticity of the original digital artifact becomes more difficult to verify. By including SAIDs when referencing digital artifacts such as a schema, researchers can be confident they are reproducing the original workflows with the original authentic digital artifacts. The SAID becomes a checksum or method of verification to validate artifact authenticity. When the SAID published in the workflow matches the calculated SAID of the found artifact (e.g., schema) the workflow can be verified contributing to better research reproducibility and ultimately to more FAIR research practices.

Semantic Engine, Schemas, and the University of Guelph

Agri-food Data Canada created the Semantic Engine to assist researchers writing machine-actionable OCA schemas for their data. The Semantic Engine web application is available to all users in an unrestricted fashion at the URL https://semanticengine.org and the source code is open source and freely available on the associated GitHub account https://github.com/agrifooddatacanada/OCA_Composer (ADC, 2024). The Semantic Engine has been created with significant user testing provided by researchers from the University of Guelph to ensure that it is useful and easy to use for the target audience.

Examples of schemas written using OCA can be found in research data repositories (e.g., Borealis: https://doi.org/10.5683/SP3/GHC9Q9). Overlays Capture Architecture is also being tested in other applications such as for the presentation of verifiable credentials. The flexibility of OCA allows a credential containing verifiable information to be presented in multiple languages while enabling cryptographic assurance via digital signatures that the credential has not been tampered with. Applications of credential presentation using OCA are currently being run in Canada (verifiable credentials in their Energy and Mines Digital Trust of British Columbia) and under evaluation in Switzerland.

The EU Horizon project ‘NextGen’ (EU Horizon, 2024) integrates genomic and multimodal data in the domain of cardiovascular personalized medicine. In this context, the Overlays Capture Architecture is not only underlying the semantic harmonization required between the different types of data, because of its layered architecture it also supports the distributed governance model necessary for multi-jurisdiction data exchange.

On semanticengine.org researchers can create a schema based on their existing dataset or they can create a completely new schema step-by-step using the web application. At the end of the creation process a user will download the completed schema bundle in JSON format, as well as the same information in an archive-friendly and human-readable plain text format. Users can also upload existing JSON schemas to the Semantic Engine and view or edit them. Finally, the Semantic Engine supports the creation of a Data Entry Excel based on a schema. This Excel template is a preformatted Excel sheet with automatically generated documentation and column names ready for data entry that enables easy, schema-aligned data entry by researchers making it easier to generate more FAIR datasets.

Funding Information

This research was undertaken thanks in part to funding from the Canada First Research Excellence Fund. CFREF-2015-00004.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

P.K. conceived of the semantic architecture of Overlays Capture Architecture, P.K. authored the specification and P.P. was an editor of the specification. C.M.H. conceived the Semantic Engine and C.M.H and A.M.E. designed and directed the implementation of the Semantic Engine. C.M.H. wrote instructional guidance materials and designed the Data Entry Excel output of the Semantic Engine. C.M.H. wrote the manuscript with all authors providing critical feedback that shaped the manuscript.