Have a personal or library account? Click to login
Integrating Premodern Manuscript Metadata into Wikidata: A Case Study in Ontology Design and Linked Data Reuse Cover

Integrating Premodern Manuscript Metadata into Wikidata: A Case Study in Ontology Design and Linked Data Reuse

Open Access
|Dec 2025

Full Article

(1) Context and Motivation

(1.1) Linked Open Data in Premodern Manuscript Studies

The past decade has seen growing interest in the potential of Linked Open Data (LOD) to support discovery, integration, and reuse of metadata in the humanities, particularly manuscripts (Bauer, Bleier, & Sonnberger, 2025). For premodern manuscript studies, where data is typically institutionally siloed, inconsistently structured, and highly variable in descriptive granularity, these possibilities are especially compelling. Libraries, archives, and digital humanities projects have begun contributing manuscript metadata to Wikidata—an openly editable knowledge base built on linked data principles—thereby expanding access to information that is often buried in institutional catalogs or embedded in human-readable but machine-inaccessible forms.

Digital Scriptorium (DS; https://digital-scriptorium.org/) is a national consortium of institutional members who contribute data describing their premodern manuscript holdings to a union catalog of premodern manuscripts owned in North American collections, the DS Catalog. The DS Catalog is built in Wikibase and operates in many ways on the same data principles and organizational structure as Wikidata. DS takes in datasets of premodern manuscript metadata from institutional members, enriches (Zeng, 2019b) and reconciles it, and uploads it into the DS Catalog, with the goals of increasing discoverability of these collections and the potential reuse of their data (Koho, et al., 2023). In its current organizational mission and vision, Digital Scriptorium was conceived in part as a way to begin addressing these issues of machine-actionability and data reuse.

Integrating manuscript metadata into Wikidata, however, is not straightforward. Wikidata was not designed with manuscripts in mind, and its flexible but general-purpose schema presents modeling challenges for representing the complexity of premodern manuscript metadata (Groß & Pellizzari di San Girolamo, 2025). Descriptive elements like artistic attribution, ambiguous production dates, or multilingual titles in original script require more nuanced representation than current property infrastructure often allows. These challenges are compounded by the lack of a universal metadata standard for manuscripts and by the wide variety of cataloging practices across institutions and individuals.

To address these issues, the DS team undertook a crosswalking and ontology design project grounded in the existing WikiProject Manuscripts Data Model (WPM DM; https://www.wikidata.org/wiki/Wikidata:WikiProject_Manuscripts/Data_Model). Building on that framework, we sought to test and expand the data model to better support DS’s aggregated metadata and its migration into Wikidata. Our work involved the identification of key modeling gaps, the evaluation of existing Wikidata properties, the suggestion of new relationships to be proposed where needed, and the creation of structured workflows for transformation, reconciliation, and upload of records.

In this article, we present our methodology and reflections on this process, with a focus on how ontology design (understood here as the critical articulation of what properties are needed to represent relationships, what relationships matter, and how knowledge is represented [Atamanchuk and Atamanchuk, 2023]) can enhance the value and usability of manuscript metadata. By expanding the WPM DM and developing a scalable, interoperable workflow for crosswalking DS metadata into Wikidata, we demonstrate how thoughtful ontology design can support smart data reuse and discovery across platforms, institutions, and research communities.

(1.2) Challenges of Integrating Manuscript Metadata into Wikidata

Metadata about manuscripts is increasingly being shared, aggregated, and repurposed in digital environments. Yet without consistent data models, shared vocabularies, or structural alignment across platforms and institutions, much of this information remains fragmented and difficult to use or reuse. The problem is not just one of access but of interoperability (Zeng, 2019a): the ability of disparate systems and standards to exchange information and meaningfully integrate it. Interoperability can be achieved at multiple levels (Chan & Zeng, 2006; Zeng & Chan, 2006): at the schema level (mapping descriptive elements between standards), at the record level (transforming records between formats), and at the repository level (aggregating metadata from heterogeneous sources with differing vocabularies and encoding standards). Efforts to achieve interoperability (Zeng, 2019a) were undertaken via mapping, which includes strategies for defining relationships between entities and concepts represented in different systems or datasets, such as the development of metadata crosswalks. A crosswalk (Woodley, 2016) is a visual and textual tool used for translating one metadata standard to another. Crosswalks often take the form of a chart or table that is the result of a mapping process between metadata elements from one (source) schema to another (target) schema. Crosswalks enable interoperability between metadata schemas and vocabularies by matching semantically equivalent or similar elements or values. They may also articulate any similarities and differences between the structure and semantics of standards which influence the degree of success in mapping between standards.

For premodern manuscripts—understood broadly as handwritten, text-based materials produced across a variety of world traditions before roughly 1800, including books—descriptions are often shaped by local practices and unstructured prose, making interoperability especially challenging to achieve. Most institutional metadata for manuscripts have evolved within bibliographic or MARC-based frameworks that were not designed to accommodate the rich, often ambiguous historical information that manuscripts contain. Moreover, cataloging practices vary significantly from one collection to another, and no universally adopted standard for manuscript description exists. Descriptive Cataloging for Ancient, Medieval, Renaissance, and Early Modern Manuscripts (hereafter AMREMM [Pass, 2003]) forms an extensive guide to the actual practice of original manuscript cataloging, assisting the cataloger in identifying relevant descriptive information, but the guide does not serve as a comprehensive content standard and is in need of update in the digital age. Additionally, AMREMM was not designed with structured data in mind, instead supporting the more prosaic, unstructured form of manuscript cataloging that inhibits machine-actionability and data reuse.

In recent years, institutions across the cultural heritage sector—including initiatives from the National Library of Wales (Evans, 2023), the Bibliothèque nationale de France, and the Bodleian Libraries at Oxford University—have experimented with using Wikidata as a platform for premodern manuscript description, though data models vary between institutions. Wikidata’s infrastructure offers several advantages: it is flexible, open, multilingual, and widely linked. Most importantly, it allows for semantic relationships to be created between and among items, facilitating connections across collections, regions, and disciplines. But integrating manuscript metadata into Wikidata also requires critical attention to ontology development—that is, to the design of the conceptual models and property structures used to represent manuscripts and their metadata.

(1.3) Motivation for Revising and Extending the WikiProject Manuscripts Data Model

To help guide manuscript-related contributions to Wikidata, a community of users launched the WikiProject Manuscripts (WPM), developing a recommended data model for representing essential manuscript metadata elements such as material, place of production, scribe, language, and intellectual content. This model draws on earlier work informing the Wikibase Biblissima Portal and provides pragmatic guidance on how to apply Wikidata properties to manuscript records (Morlock, et al., 2025). As the DS team began preparing to contribute its metadata to Wikidata, however, it became clear that the WPM Data Model (WPM DM), while a solid foundation, was not sufficient to accommodate the full range of descriptive information present in the DS Catalog. In particular, the WPM DM lacked support for nuanced modeling of provenance chains, complex artistic attributions, ambiguous or approximate dating, multilingual titles, and other features essential to rich manuscript description.

This gap led us (McCandless, formerly DS Manuscript Data Curation Graduate Fellow (2023–2025); Coladangelo, DS Project and Data Manager) to undertake a structured review of the WPM DM with the aim of testing its adequacy for DS data, identifying areas of misalignment, and designing an expanded ontology better suited to support data transformation, integration, and reuse. Our motivation was both practical and conceptual. On the one hand, we needed a reliable framework to support the batch uploading of DS metadata into Wikidata. On the other, we sought to use this process as a case study in how domain expertise—especially in manuscript studies—can inform ontology design and make digital infrastructure more responsive to the needs of humanities research.

Further motivations for this project are the current lack of reusability posed by unstructured data in manuscript description and the lack of a standard metadata schema. By contributing manuscript metadata to Wikidata according to a clear data model, we increase the reuse potential of manuscript metadata by the creation of datasets. Until now, handlists and dataset creation in manuscript studies has largely relied on individual or groups of scholars to manually compile lists of manuscripts based on disparate sources; but with manuscript data in Wikidata, scholars have the ability to pull down large-scale datasets about manuscripts via SPARQL query services, saving time. Steinova (2020) provides an exemplary discussion of the manual processes required of scholars in putting together handlists of specific types of manuscripts (in her case, Carolingian manuscripts of Isidore of Seville’s Etymologiae) and the immense labor and time this process requires.

Another goal of the DS-to-Wikidata project is to facilitate the constant and cyclical improvement of Wikidata, DS, and institutional metadata describing manuscripts. Wikidata provides a place for scholarly discourse and contributions of new information about the manuscripts described therein—which can in turn, upon review by institutional stewards, enrich and add to institutional descriptions (Coladangelo and McCandless, 2024b). Updated institutional descriptions can then be reflected in platforms such as the DS Catalog when institutional source data is periodically updated. The reuse potential of this data, therefore, speaks to the cyclical nature of iterative scholarship, crowdsourcing efforts, and integration of related datasets continually informing and improving one another.

This project thus sits at the intersection of data modeling, cultural heritage curation, and digital humanities infrastructure development. It also aligns with broader movements to make LOD more inclusive and interoperable, particularly for underrepresented materials such as non-Western or non-codex manuscripts. By expanding the WPM DM and designing structured workflows for transforming DS records into Wikidata items, we aimed to demonstrate how thoughtful data modeling can not only facilitate interoperability, but also enrich the potential for new kinds of research, collaboration, and discovery.

(2) Dataset description

Repository information

All DS Catalog data is stored in a cloud Wikibase instance at catalog.digital-scriptorium.org. Datasets used in this project were generated through the use of the DS Wikibase SPARQL query service using a query template (https://github.com/DigitalScriptorium/ds-data/blob/20bacd98eda2e6e267f86696dfd36a5b831e1835/ds-to-wikidata/sparql-query-template) and stored on GitHub (https://github.com/DigitalScriptorium/ds-data/tree/20bacd98eda2e6e267f86696dfd36a5b831e1835/ds-to-wikidata/datasets/inputs). The most recent release of this dataset, generated 2025-10-24, can be found at https://doi.org/10.5281/zenodo.17435362.

Format names and versions

Original datasets from the DS Catalog Wikibase generated through the use of the DS Wikibase SPARQL service were stored on GitHub and released and linked on Zenodo as described above as “2025.10.24-ds-data-repo” (filename: ds-data-1.0.0.zip). CSV files were used in OpenRefine (Delpeuch et al., 2025) to upload data to Wikidata.

Creation dates

2024-11-05 to 2025-08-08.

Dataset creators

Data in the DS Catalog is derived from metadata records contributed by member institutions and processed by the DS team. DS Catalog Project and Data Manager L.P. Coladangelo, DS Manuscript Data Curation Graduate Fellow Rose A. McCandless (Ohio State University), and DS Graduate Assistant Justin Blair (Pellissippi State Community College Libraries) were responsible for reconciliation and enrichment of data uploaded to the DS Catalog Wikibase, which was the data used as the basis of this project. Coladangelo was responsible for generating input CSVs for Wikidata via the DS Wikibase SPARQL Query Service. Coladangelo, McCandless, and Blair created data dictionaries used in matching DS Authorities to Wikidata items. McCandless was responsible for WPM to DS mapping and OpenRefine schema creation in consultation with Coladangelo. McCandless created Wikidata items and uploaded DS data to Wikidata via OpenRefine. Additional information from DS records was added to Wikidata by DS Graduate Volunteer Gaby Stephenson (San Jose State University). Coladangelo SPARQL queried Wikidata to confirm creation of Wikidata items and generate output CSVs matching DS represented manuscripts to their equivalent Wikidata items.

Language

Data values in DS Catalog data and authorities are in English. Multilingual data may also occur from institutionally-supplied metadata values.

License

DS Catalog data is made available to the public under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

(3) Method

(3.1) Developing the DS-to-Wikidata Crosswalk

A central component of this project was the development of a metadata crosswalk between the DS Data Model and the WPM DM, which we used as the foundation for representing manuscript records in Wikidata. The goal of this project was to establish consistent mappings between DS metadata properties and corresponding Wikidata properties as recommended by WPM, while also identifying areas where the WPM DM required modification or extension to support the full descriptive richness of DS records and the diverse manuscripts they describe.

We adopted a schema-level mapping approach, treating the DS Catalog as the source schema and the WPM as the target schema. To organize the mapping process, we grouped DS metadata properties using the broad descriptive categories derived from the WPM specification, including:

  • Core identity statements (e.g., “instance of” manuscript)

  • Material and production metadata (e.g., “made from material,” “location of creation,” “inception”)

  • Content-based metadata (e.g., “language of work or name,” “exemplar of”)

  • Creator and contributor roles (e.g., scribe, illustrator, calligrapher)

  • Provenance roles (e.g., “owned by”)

  • Housing and catalog references (e.g., “held by,” “catalog code”)

We then compared each DS property within these categories to its closest conceptual equivalent in the WPM DM and identified the corresponding Wikidata property, when available. In some cases, mappings were direct (e.g., DS’s “material in authority file” mapped to Wikidata’s P1861 “made from material”). In other cases, mappings required more interpretive work, due to semantic or structural mismatches between schemas.

While some elements mapped cleanly in a one-to-one fashion, many required more complex alignment strategies. We encountered several key patterns:

  • One-to-one: DS P16 “instance of” to Wikidata P31 “instance of”

  • One-to-many (direct): DS’s “physical description” fields often needed to be parsed into multiple Wikidata properties (e.g., “height,” “width,” “number of pages”)

  • One-to-many (qualifier properties): DS properties that function better as qualifiers in Wikidata, where they are top-level properties in the DS Catalog (e.g., “author” as a qualifier for “exemplar of”)

  • Many-to-one: Several DS properties and qualifiers (e.g., “production date as recorded,” “earliest date,” “latest date”) had to be merged or qualified under a single Wikidata property

To support reusability, we documented the full crosswalk in a structured spreadsheet. This documentation served as both a reference during manual data entry and the basis for later automation using OpenRefine.

(3.2) Modelling Complex Metadata

Artistic attribution poses modeling problems due to the structure of existing Wikidata properties. The existing WPM model (and Wikidata properties generally) assumes the presence of a clearly identified/named “illustrator” (P110). Many manuscripts, however, are attributed more broadly—e.g., to the “Ghent-Bruges school” or “follower of the Master of the Dresden Hours.” Although our initial intention was to supplement P110 with additional properties such as “school of” (P1780), the additional properties can only function as qualifiers to P110, as opposed to being discrete properties themselves. Qualifiers are property-value statements that function beneath discrete properties, clarifying or adding depth (Help:Qualifiers, 2025). This is problematic due to the fact that it would, in many cases, be inappropriate to create a Wikidata item for the unnamed illustrator if there is insufficient evidence to justify a Wikidata item. It would be ideal, then, if properties such as P1780 or “workshop of” (P1774) were able to function as standalone properties.

Provenance modeling has also proved somewhat complex. “Owned by” (P127) is a straightforward and useful property for conveying former ownership of a manuscript. The qualifiers “beforehand owned by” (P11811) and “afterward owned by” (P11812) can provide additional information about the chain of ownership, and “start time” (P580) and “end time” (P582), also used as qualifiers, can represent dates of ownership. But while ownership is supported in Wikidata, sale, transfer, and auction events (all of which are key to the reconstruction and understanding of the manuscript trade, among other topics) lack clear modeling pathways. For example, there is currently no standardized way or representative property to represent Sotheby’s as a selling agent, distinct from an owner. We thus identified the need for new or repurposed properties to distinguish between different types of provenance events (e.g. “sold by”). Similarly, modeling “ownership” does not distinguish between possession, stewardship, and curatorial housing—distinctions that matter in both legal and interpretive contexts. In these cases, we made practical decisions to use existing properties with appropriate qualifiers and references, but such decisions rely on community conventions that are not always documented or consistent.

Where no suitable Wikidata properties currently exist, we have documented needs for future property proposals and continue to engage in discussion with the Wikidata and WPM communities. Our recommendations include:

  • A property for “selling agent” (to distinguish booksellers and auction houses from collectors or institutional owners)

  • Expanded modeling for attributions statements (e.g., school, follower, workshop)

  • More granular treatment of manuscript structure and physical features, such as binding types and styles, layout features (e.g., number of columns), and foliation systems

These proposed enhancements reflect not just technical gaps but a broader need for ontological alignment between bibliographic standards and linked data environments. By explicitly modeling the interpretive, historical, and contextual dimensions of manuscript description, we aim to make this data not only machine-actionable, but also scholarly meaningful.

(3.3) Ontology Refinement and Expansion

Following the development of the crosswalk and preliminary refinement of the ontology (which continued throughout all stages of the process), the next stage of the project involved preparing and reconciling actual metadata from the DS Catalog for upload into Wikidata. This phase required addressing challenges at the levels of record transformation, authority alignment, and tool-based workflow development. It also provided the foundation for large-scale data ingestion, enabling multiple institutional datasets to be incorporated into Wikidata with consistency and transparency.

(3.4) Workflow Development and Reconciliation

Much of the DS metadata includes controlled or semi-controlled values, such as material types, geographic places, script styles, and centuries, that could be aligned with existing Wikidata items. To facilitate reconciliation, we developed several data dictionaries linking DS Authority terms (which are themselves based on controlled vocabularies including the Getty Thesaurus of Geographic Names, Art and Architecture Thesaurus, etc.) to corresponding Wikidata QIDs. To perform this reconciliation, we used OpenRefine, an open-source tool widely used for data cleaning, reconciliation, and Wikidata integration. The reconciliation process involved both matching using OpenRefine’s built-in reconciliation engine and manual review, especially where terms were ambiguous or multiple candidates existed. In cases where no appropriate Wikidata item could be found, we created new items manually (with full references and descriptions) for those items that have reliable and authoritative source material (such as identifiers in recognized controlled vocabularies like Library of Congress Authorities or the Virtual International Authority File), and left unsupported items as string values.

Once reconciliation was complete, we created a Wikidata schema in OpenRefine to guide the transformation of each DS record into a structured set of Wikidata statements. The schema drew directly on the expanded WPM model and the crosswalk we had developed. The schema was designed to include qualifiers and references wherever possible, supporting both data provenance and disambiguation and pointing the user back to the DS Catalog and institutional metadata. This schema served as a template that could be applied across diverse institutional datasets with only minor modifications, making the process both scalable and repeatable.

With the schema and reconciled data in place, we used OpenRefine’s Wikidata extension to perform batch uploads of manuscript items. Early uploads were tested using a small initial sample to confirm property formatting, reference syntax, and proper linkage. We also manually reviewed resulting items on Wikidata to ensure data quality and consistency with community conventions. Uploads have included metadata describing manuscripts in the collections of:

  • Vassar College, Poughkeepsie, NY: A set of Latin and vernacular manuscripts with well-structured cataloging, useful for testing multilingual titles and script modeling

  • Rutgers University, Rutgers, NJ: A dataset with many Hebrew manuscripts represented, requiring consideration of multilingual titles

  • The Newberry Library, Chicago, IL: 700+ records in a single dataset that needed significant script reconciliation and had to be batched and uploaded in sets of 100 new Wikidata items at a time.

Each of these uploads contributed not only to the public availability of manuscript metadata, but also to the refinement of our modeling approach and the identification of edge cases needing future support.

(3.5) Ontological Considerations for Complex Data

The process of mapping DS properties to the WPM DM surfaced several important issues relating to granularity and flexibility, the absence of key relationships, and the need for qualifiers and references. While the WPM DM provides a robust baseline for representing standard manuscript metadata in Wikidata, it was not designed to accommodate the full descriptive complexity, semantic nuance, or cataloging diversity reflected in the DS Catalog. Our team therefore undertook a systematic expansion and refinement of the WPM model, guided by the principles of ontology design and grounded in manuscript studies expertise.

Our approach to ontology design emphasized the meaning and function of metadata elements, not just their technical mapping. We asked not only “which Wikidata property should be used?” but also “what kind of information is this element conveying, and how should it be understood within the broader context of manuscript studies?” This led us to consider: the historical and cultural specificity of manuscript description, especially across diverse world traditions; the importance of qualifiers and references for capturing uncertainty, provenance, and source attribution; the need for semantic precision in distinguishing between related but distinct concepts (e.g., “scribe” vs. “author,” “ownership” vs. “sale”); and the multiplicity of valid descriptions, including multi-century date ranges and competing scholarly interpretations.

One of the primary complexities in the representation of manuscripts in metadata schemas of any kind, but especially those designed for bibliographic description such as MARC, is the distinction between the manuscript as a physical object and the text that is included therein (see Cashion, 2016). A manuscript, being written by hand, is by nature a unique object. Every instance of a manuscript has a unique set of characteristics, both physical and textual—and because of scribal errors, imperfect exemplaria, the ravages of time, etc., the text included in a manuscript is never exactly identical to the same text copied in another manuscript (Coladangelo and McCandless, 2024a). For example, the text of St. Augustine’s Confessions in one manuscript will always be slightly different from the Confessions as copied in another manuscript. It is best practice, therefore, to see the manuscript as a physical object that serves as a carrier of text, rather than as a text or book in the way we might see modern printed material (Cashion, 2016). For this reason, both the WPM Data Model and our crosswalk utilize “exemplar of” (P1574) to represent the textual contents of a manuscript—the Wikidata item itself represents the physical manuscript object, rather than the text contained within.

Through hands-on modeling and manual data entry into Wikidata, we identified several key areas where the WPM DM could be expanded to enable richer, more faithful representations of manuscripts that are still structured and interoperable. To improve discovery and better represent non-Western materials, we included fields for titles “in original script,” (DS P13), mapping these to Wikidata using “object named as” (P1932) as a qualifier beneath “exemplar of.” When a suitable work-level Wikidata item did not exist, we defaulted to “text” (Q2344602) as the value under “exemplar of,” adding the title as it appears in the manuscript records as a string under the qualifier “object named as.” This works as a strategy for both Latinate and non-Latinate language materials whose textual contents do not appear in Wikidata already (or are too complex or unclear to merit a dedicated Wikidata item).

(4) Results and Discussion

(4.1) Overview of Results

The integration of DS metadata into Wikidata using an expanded version of the WPM DM yielded valuable results across three primary areas: (1) confirmation of WPM’s baseline viability, (2) identification of critical modeling gaps in the WPM DM and Wikidata itself, and (3) development of scalable workflows that enhance future contributions to Wikidata. These results demonstrate not only the potential of Linked Open Data for manuscript studies, but also the practical and conceptual challenges of modeling humanities data at scale.

Our analysis confirmed that the WPM model provides a solid foundation for representing core manuscript metadata in Wikidata. Most DS properties aligned—at least partially—with WPM’s recommended fields, and manual testing confirmed the reliability of the crosswalk and schema before scaling up via OpenRefine. The project also highlighted significant limitations in Wikidata’s current property infrastructure, especially for modeling the complexity of manuscript description. Key issues included: ambiguous or multiple production dates; attribution to anonymous or collective artists; representation of sales and provenance; title complexity and language variation; and physical description and structure. These limitations reveal that while Wikidata is technically capable of representing many manuscript features, significant curatorial and modeling decisions are required to ensure semantic accuracy and fidelity to the source data.

One of the most concrete outcomes of this work is the development of a repeatable, scalable workflow for transforming DS Catalog metadata into Wikidata statements. This workflow includes a fully documented schema-level crosswalk between DS and Wikidata properties; reconciled data dictionaries for materials, places, centuries, scripts, names, and titles, linking DS authorities to Wikidata QIDs; and an OpenRefine schema that can be applied across institutional datasets, allowing batch uploads of records with consistent formatting and referencing. As of August 2025, this crosswalk and workflow have been used to import more than 3,500 structured, semantically rich Wikidata items for manuscript records from 27 institutions including Saint Louis University, Vassar College, Ohio State University, the Newberry Library, and more.

(4.2) Challenges in Modelling Humanities Data

The process of transforming manuscript metadata from the DS Catalog into structured Wikidata elements surfaced a number of challenges that are not unique to this project but reflect deeper tensions at the intersection of humanities data and linked open data platforms. In particular, the modeling of manuscript data required careful negotiation of uncertainty and competing or complex metadata—areas where general-purpose data models often fall short. The strategies we developed in response underscore the need for flexible, domain-sensitive ontology design in Wikidata, especially when working with complex, historically-rooted humanities materials.

Premodern manuscripts and their description resist simplification. Even in well-documented Western European traditions, cataloging a manuscript involves balancing physical description, historical attribution, intellectual content, and codicological structure, often with varying labels of clarity. These challenges are amplified when metadata has been compiled over time by different institutions and individuals using heterogeneous standards. For example, a single DS record may include an approximate production date expressed as a century, an attributed but anonymous artist described only by style and/or association, titles in Latin and Hebrew, and a binding note embedded in a semi-structured field.

Wikidata’s structured ontology offers expressive tools, but it also imposes restraints—most notably, a preference for point-specific data, item-based linking, and the avoidance of interpretive claims not supported by references. Representing the full semantic nuance of manuscript cataloging within this system required not only technical mapping but also thoughtful epistemological translation: deciding how much uncertainty to encode, how to treat unstructured or ambiguous notes, and when to create new items to support referential clarity.

(4.3) Representing Ambiguity and Uncertainty

Manuscript metadata is often marked by interpretive ambiguity: scribes are unnamed, titles are descriptive rather than formal, and dating is approximate. Wikidata’s ontology does provide mechanisms for encoding uncertainty (e.g. through date ranges or qualifiers), but these require careful application to avoid distorting the source data. For instance, while “inception” (P571) can hold a single date (a single year, century, or millennium), it cannot represent a range without additional qualifiers like “earliest date” and “latest date.” Similarly, “inception” expects a single value, but many manuscripts are dated to more than one century, for example, “ca. 1375–1425.” When more than one value is added under “inception” (such as, in this case, “14. century” and “15. century”), Wikidata asks that one value be “preferred” over another—this would, however, be inappropriate as the manuscript is equally likely to date from the fourteenth century as the fifteenth. Similarly, artistic attributions such as “follower of the Master of the Dresden Hours” cannot be reduced to a single creator without losing important contextual nuance.

In our modeling, we prioritized approaches that preserve ambiguity (and may push against the Wikidata property structure to an extent) rather than resolve it prematurely. We relied on qualified statements, such as using “school of” (P1780) or “object named as” (P1932) to qualify properties whose values diminished the nuance of the original metadata, and included references with the property “stated in” (P248), to ensure that interpretations are adequately sourced and attributed to the original cataloging institution. This approach mirrors the interpretive openness of manuscript studies itself, where scholarly consensus may, and indeed often does, shift over time.

(4.4) Mapping Challenges Reflective of Broader Modelling Needs

Another core challenge lies in representing titles, names, and terms in multiple languages and scripts. Manuscripts are often titled differently across institutions with varied cataloging practices. One institution, for example, may choose to use the title provided in a manuscript or its prior description. Another may instead choose to assign titles based on genre rather than specific text. Original titles may appear in Latin, Arabic, Hebrew, or other languages, in diverse scripts. Wikidata’s infrastructure supports multilingual labeling through language codes and monolingual text fields, but does not easily accommodate nuanced script information or transliteration metadata.

To address this, we incorporated fields such as “title in original script” (DS P13) into our schema, representing them in Wikidata as qualified statements. When linking a manuscript to a work-level item via “exemplar of” (P1574), we used “object named as” (P1932) to record variant titles and original-script forms, referenced back to the DS Catalog and/or institutional record (depending on the location/source location of the metadata). This strategy maintained the multilingual richness of the source data and multilingual queryability while allowing users to discover and compare records across linguistic boundaries.

Still, this solution required workarounds and compromises. There is currently no standardized property for recording transliteration schemes, and the transliteration of non-Roman scripts, in particular Arabic scripts, is varied at best. This presents major discovery issues; between different transliteration systems of letters, sounds, and diacritical marks/unicode characters, the diversity of possible transliterations for any given non-Roman alphabet title makes querying almost impossible. By including the original script version of the title, we provide some basis for querying titles of texts included in manuscripts (though multilingual querying presents its own host of issues). These issues point to a broader need for more granular modeling of multilingual expression, particularly for cataloging materials outside of the Latin-alphabet Western European tradition.

Wikidata’s ontology is intentionally broad, designed to accommodate everything from literary works and celestial bodies to taxonomic species and sports statistics. But this generality can obscure the cultural specificity of certain knowledge domains. Manuscript description, shaped by centuries of scholarly practice, carries with it assumptions, terms, and categories that may not align neatly with Wikidata’s schema—not to mention the many historic biases in cataloging and scholarship which may inhibit culturally accurate description in the first place.

(5) Implications/Applications

This project demonstrates how thoughtful ontology design and domain expertise can make humanities metadata more interoperable, reusable, and meaningful. By expanding the WikiProject Manuscripts Data Model and developing scalable workflows for transforming DS metadata into Wikidata, we have shown how structured, expert-informed data modeling can support smart data reuse and discovery across platforms, institutions, and research communities.

The workflows and schema developed here have already been applied successfully across multiple institutional datasets and can be reused or adapted by other projects working with manuscript metadata or similar cultural heritage materials. They provide a concrete framework for transforming, reconciling, and uploading data in a consistent, transparent, and interoperable way, especially for projects that store data in a Wikibase environment.

Our experience suggests that ontology design for the humanities must be iterative, community-informed, and sensitive to disciplinary knowledge. It also points to the need for developing shared modeling practices and proposing new properties where warranted, particularly for domains like manuscript studies that sit at the intersection of textual, artistic, and material culture.

This project has demonstrated the need for subject-specific, expert-driven ontologies that are based in both the subject itself and an understanding of the data principles at play. By explicitly modeling the interpretive, historical, and contextual dimensions of manuscript description, we aim to make this data not only machine-actionable but also scholarly meaningful, contributing to a richer and more inclusive digital infrastructure for the humanities.

Data Accessibility Statement

Data used in this project is available on GitHub (https://github.com/DigitalScriptorium/ds-data/tree/20bacd98eda2e6e267f86696dfd36a5b831e1835/ds-to-wikidata/datasets/inputs) and is made available to the public under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Notes

[1] https://www.wikidata.org/wiki/Property:P186. All Wikidata property URLs are formatted as above, with “Property:P#” indicating the specific property number.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Rose A. McCandless

  • Conceptualization

  • Data curation

  • Investigation

  • Methodology

  • Project administration

  • Validation

  • Visualization

  • Writing – original draft

  • Writing – review & editing

L.P. Coladangelo

  • Conceptualization

  • Data curation

  • Formal analysis

  • Funding acquisition

  • Investigation

  • Methodology

  • Project administration

  • Resources

  • Software

  • Supervision

  • Visualization

  • Writing – review & editing

DOI: https://doi.org/10.5334/johd.431 | Journal eISSN: 2059-481X
Language: English
Submitted on: Oct 25, 2025
Accepted on: Nov 27, 2025
Published on: Dec 11, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Rose A. McCandless, L. P. Coladangelo, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.