(1) Introduction
Rare book and manuscript (RBM) digital catalogs are uniquely benefitted by the semantic data networking affordances of Linked Open Data (LOD). Digital special collections and similar online archival repositories have deployed controlled vocabularies and descriptive metadata standards to make collections with complex material, linguistic, generic, and subject qualities searchable within digital content management systems and linking items to one another based on existing or bespoke properties (Jett et al., 2017). Linked Open Data offers the potential to establish relationships between an institution and the digital records and items within their digital special collection (Daquino, 2024). Networking special collections objects and records through linked descriptive and organizational metadata allows for those objects to establish clear relationships to the authority record, to other objects within the specific catalog knowledge base, and to other objects and associated qualifiers included within the wider linked data knowledge base. LOD technology has been highly impactful over the past two decades and continues to allow for innovations both within some digital archival and special collections catalogs (Sanderson, 2024).
Heritage practitioners working with RBM materials and records are aware of the diversity of these materials, as well as the unique qualities and characteristics of individual copies of a manuscript. RBM materials may have distinctions in their composition in a specific binding, contain wax residue or spilled ink, represent the marginal writings of a prior owner, or have other features which distinguish them. For RBM materials more than any other in heritage collections, records and descriptions represent a specific copy of a manuscript document or rare book. This reality is a draw for researchers interfacing with those materials in a collection’s context and a matter of consideration for integrating those materials into a given collection. Koho et al. (2023) state that, “Unlike printed books, every manuscript is a unique production and requires item-specific description elements related to both the textual contents—author, title, language—and the physical properties–date and place of production, script, materials, dimensions for both the item and page layout, decoration, folio count, binding, and so on”. Descriptive complexities and the highly individual nature of RBM materials have been a draw for practitioners to implement LOD schemas for their collections. LOD has been lauded for its ability to establish clear relationships across materials in each collection and has even been used to regularize metadata in cross-institutional collections.
In the following sections, I discuss a LOD research and data visualization project which explored the objects and larger metadata structure of the Digital Scriptorium (DS) digital manuscript special collection. I took part in research with the DS manuscript union catalog as an outside research fellow after the release of their updated LOD catalog rather than an integrated project team member with significant back-end access. Because the work was limited in this way, the work of this article is mainly focused on the outputs of the project as a research use-case rather than exploring the data model beyond the processes for transforming and ingesting institutional metadata as LOD into the catalog. This research and its associated outputs are primarily analysis-oriented and demonstrate potential utility for LOD and data visualization methods of LOD to support researchers of diverse manuscript collections in linked data environments. This project also offers a potential use case for linked manuscript data and presents analytical outputs that directly support research with pre-modern and early modern manuscripts, their digital surrogates, and manuscript cultures represented in the DS catalog.
1.1 Background
Heritage practitioners working in special collections contexts may integrate institutional holdings data into existing linked knowledge bases like Wikibase or create their own bespoke knowledge base within Wikibase’s open-source linked data platform. Researchers working with such collections have leveraged LOD collections to better compare materials across a corpus (Stork et al., 2018; Miller et al., 2025). In this section, I provide a brief background to the literature surrounding current institutional applications and scholarly work with LOD-integrated heritage collections and some of the drawbacks when transforming a collection into LOD.
LOD projects in RBM collecting institutions have already been implemented in a variety of ways to support internal organization and authority as well as supporting research using their holdings. More than a decade ago, the semantic web catalog project, Short Title Catalogue, Netherlands (STCN), presented a significant application of LOD and semantic web architecture to provide researchers with a way of searching linked manuscript documents within a single catalog. The STCN team present an optimistic view of their LOD approach, stating that “the use of LOD technologies has the potential to transform both the quantitative and qualitative aspects of querying a DH dataset” (Beek et al. 2014). Another LOD-based manuscript catalog, the Mapping Manuscript Migrations (MMM) project “intends to demonstrate new approaches to this kind of provenance research, based on the transformation of existing data sources into a Linked Open Data environment” (Burrows et al., 2018). This project has since been released and functions as a live LOD environment for medieval and Renaissance manuscripts and related objects and records, presenting a kind of prototype for the work underpinning subsequent manuscript catalog projects such as the Digital Scriptorium catalog discussed further in this paper.
Implementation of LOD authorities and schema in an RBM context do not offer a total solution to organizational and search issues within a manuscript catalog. LOD implementation comes with its own set of drawbacks which can complicate their utility as a search tool. Some of the challenges and drawbacks for implementing LOD in a special collections context explored in detail in a 2020 OCLC position paper. Among a variety of issues discussed, the authors point out that “Because of the rare or unique—and often local or regional nature of special collections—they need to represent many people, families, and corporations that are not in authority files” (Blake et al., 2020). Researchers from OCLC further explore complications with LOD implementation in a special collections context, such as the need to restructure traditional archival authorities to be functional in a LOD environment or the significant difficulties in implementing LOD for multi-lingual collections.
Other researchers working with the semantic web point out issues with sustainability, the need for persistent URIs to make the application of LOD functional, and the potential complexities from concepts being represented by multiple URIs with various origins (Schmidt et al., 2022). Despite the significant potential for LOD for RBM special collections and the largely positive reception by heritage practitioners and digital humanists, the integration of catalog records into the semantic web is not a cure-all to persistent problems with archival data architecture or high-level search within a manuscript archive. Practitioners and researchers alike should maintain an awareness of the potential cost, difficulties, and drawbacks to transforming special collections into LOD.
(2) Data Contextualization
The Digital Scriptorium (DS) is a union manuscript catalog which connects a variety of institutional data and materials within a bespoke knowledge base built on the Wikibase platform. The DS manuscript catalog brings together archival records, institutional metadata, and digital surrogates from over 50 institutional repositories across North America. The catalog connects data, resources, and records through a bespoke knowledge base built on Wikidata’s Linked Open Data capabilities.
A recent and significant update to the DS catalog’s data model, data curation, and ingest processes, titled DS 2.0 by the project team, revised multiple core data elements and processes into a LOD implementation, improving the sustainability of the catalog as more institutional data are added. DS 2.0 is a full rebuild of the original DS catalog as LOD and implements a new authority structure where data from member institutions are linked to the DS catalog’s authority record (Koho et al., 2023, p. 11).
There were several driving principles for the DS 2.0 adoption of LOD and transformation of their authority structure and how it relates to external institutional records. The principles set forth for the DS 2.0 LOD implementation reflect the catalog’s primary purpose as a tool to empower researchers to discover, use, and reuse pre-modern and early modern object data, records, and digital surrogates. DS 2.0’s core principles included adopting a minimal standards approach for metadata (requiring only location and institutional data to create a record), allowing member institutions to be the sole curators of their data and leave those data unaltered once received. DS 2.0 also enables image embedding through use of the International Image Interoperability Framework (IIIF) which allows for image description and lightweight embedding of high-fidelity images rather than hosting digitized surrogates through the catalog infrastructure. Implementing LOD allowed the team to enrich their collection’s metadata while connecting those metadata to an external authority. This extensive work to restructure the catalog as LOD aimed to enhance the scope, sustainability, and research potential of the DS catalog.
2.1 Digital Scriptorium’s Data Model
The DS data model1 is built on Wikibase, an open-source software which powers LOD catalogs and connects resources across Wikimedia. DS features an online catalog2 with traditional database search functionality (a legacy from the pre-LOD catalog). The LOD portion of the new catalog can be explored using Wikidata’s query language SPARQL. The LOD query options empower premodern and early modern history as well as literature researchers to engage the collection on both granular and broader levels. To revise and improve the overall research capabilities of this cross-institutional union catalog, the Digital Scriptorium team launched DS 2.0, a project from July 2020 to March 2023 to transform the collection into LOD, improve collection sustainability by updating their minimum standards, and expand or enhance metadata properties to streamline research.
The DS catalog aims to help collecting institutions and researchers of pre-modern and early modern manuscripts to benefit from the networked quality of LOD. By connecting objects representing a variety of institutional metadata structures and regularizing them within a single structured authority, the DS manuscript database makes the cross-institutional catalog data discoverable as larger dataset outputs. The driving purpose of DS 2.0 has allowed the DS community to contribute to LOD and semantic network research and to add research potential to the study of premodern and early modern manuscripts as well as wider digital humanistic research (Koho et al, 2023, p. 2). As the DS catalog develops and expands its holdings, it becomes crucial to examine what impact changes to the data model and infrastructure have had on a significant collection of manuscripts.
Koho et al. (2023), members of the DS 2.0 project team, explain that the “primary responsibility” of the DS 2.0 project was to “reconcile and harmonize member data with external and in-house authorities”. Restructuring DS authorities and transforming catalog metadata into LOD improved discoverability and the mechanisms through which the DS catalog is structured internally and how the catalog connects to external authorities and metadata from partner institutions. According to Coladangelo and McCandless (2024), “DS takes in records from disparate heterogeneous sources, and extracts, aggregates, harmonizes, enriches, transforms, and republishes records in its union catalog. In addition to providing increased discoverability and datasets for computational research, this metadata is now available as linked data, supporting new re-uses”.
DS 2.0 transforms collections metadata from over 50 academic and cultural institutions into Wikidata under their unique authority structure. The data transformation and harmonization process relies on a more minimalistic approach to external metadata and recontextualizes institutional metadata within the linked environment. This process takes member data and converts those data through an agnostic transition spreadsheet. From there, the data are transformed to fit within the DS 2.0 data model and are made available through the DS search infrastructure both on the LOD end and the more traditional manuscript catalog end.
(3) Tools and Approaches
The following sections focus on research conducted with the support of the University of Pennsylvania Schoenberg Institute for Manuscript Studies Digital Scriptorium team as a LIS Education and Data Science Integrated Network Group (LEADING) research fellow. This research included creating or revising SPARQL queries for the DS catalog to directly support manuscript research activity. Subsequent work included visualizing the dataset outputs of those SPARQL queries using Tableau Public, a web platform for creating and sharing interactive data visualizations. As this project includes roughly a dozen distinct SPARQL queries and Tableau visualizations, I am highlighting several here to discuss the data. The SPARQL queries, query outputs, and visualizations produced from this work is available in full via GitHub3 and/or Tableau Public.4
3.1 Researching the DS Manuscript Catalog with SPARQL
At the time of writing, this research has produced 12 new or revised SPARQL queries the rely on the DS LOD knowledgebase and eight complete visualizations of data from the Digital Scriptorium catalog. These visualizations reflect portions of the updated DS 2.0 data model and focus on various data properties relevant to pre-modern and early modern researchers.
The initial investigation into the DS catalog involved applying research-oriented SPARQL queries to create custom datasets based on DS’ metadata authority. These queries were built to facilitate this research or were further developed from pre-existing queries publicly available through the DS GitHub5 to offer workable models for using SPARQL to explore the catalog data at scale. These queries are useful for researchers and are internally used by the Digital Scriptorium team. The datasets produced by these queries also serve as the research data for this project, rather than relying on the DS data model itself. As an external Research Fellow rather than integrated team member, my access to the backend data infrastructure was limited. Because of this, my research primarily focused on the end-user interaction points with the LOD knowledgebase and catalog records, effectively building a workable process for extracting novel datasets through SPARQL to demonstrate high-level or granular data for visualization and reuse.
Creating SPARQL queries enabled the visualization and annotation aspects of this work, discussed in detail in section 3.3. To conduct research within the DS catalog, this work leveraged datasets generated from 12 distinct research-oriented SPARQL queries of a range of data fields and types, available via the project GitHub.6 The query methodology included constructing targeted queries for a variety of DS LOD properties, including genre, provenance, material, location of origin, role, and others. Approaching SPARQL queries from multiple perspectives, sometimes overlapping in how the query was constructed, enabled the collection and extraction of novel datasets containing qualitative relationships of items within the catalog generated through the LOD knowledge base and metadata authority underpinning DS.
SPARQL queries are useful as a site for research beyond its structured discovery capabilities. Many of the queries produced within this work are customizable for subsequent use by researchers looking to search within the manuscript catalog. This customization is enabled in distinct clauses in the query, such as applying FILTER clauses to limit qualifiers populated by the query, as shown in Figure 1.

Figure 1
SPARQL Query of Earliest and Latest Date with Filter.
In the query shown in Figure 1, the FILTER function is set to the year 1000 for the ‘Early Date’ element of the query, designated by the ‘?earlyDate’ item. In this case, placing the range at the earliest date 1000 provides 862 results of the 27,158 results returned when the FILTER is disabled. This is highlighted in Figure 2 below. The number in the FILTER clause can be changed by a researcher to search for only those manuscripts in the collection with an earliest date of any potential year. As with all SPARQL queries with a filter, the FILTER can also be adjusted to include a range of values or the ‘=’ symbol can be changed to a ‘<’ or ‘>’ to include results above or below a target date. This query can also be altered to include a ‘?lateDate’ item FILTER to complement or replace the ‘?earlyDate’ string.

Figure 2
Early Date Filter set to = 1000.
Other SPARQL queries were modified or updated from existing SPARQL queries in the DS GitHub to include new BIND statements, which are statements in SPARQL that connect variables (e.g., ‘?hasEarliestDate’ shown above binds to property P37 which paths to objects assigned an ‘earliest date’ qualifier) to the DS 2.0 properties. Other queries were created and organized to include additional RDF triple statements to search for a variety of properties represented in the DS catalog or specific search parameters, such as searching only items in French or those only from Italy. Figure 3 includes a query searching for the catalog identifier and the name string of the institution where the record originated.

Figure 3
DS Catalog ID and Institution Query.
The output from this query is not a dense dataset (n = 28,293). These outputs are useful, however, to inform researchers about the name of an item with a unique DSID (specific identifier in the DS catalog), institutional affiliation and institutional identifier. The query is also fairly agnostic to variations of institutional identifiers and is not impeded by the differences in identifiers between, for instance, records from the University of Kansas and those from the Free Library of Pennsylvania, which are both shown in Figure 4 of the output of this query.

Figure 4
Data Output from Institutional ID Query.
A range of these SPARQL queries are available both through this research project’s GitHub and through the Digital Scriptorium GitHub.7 These queries range from a variety of topics and engage the collection from differing and intersecting elements of the digital catalog. At the current stage of this research, the primary outputs are the queries and visualizations as shown below, since the datasets produced by those queries are updated as the catalog’s holdings are expanded. Dataset outputs from the queries run to facilitate the initial visualizations are included in the GitHub to represent the findings of this work in multiple forms.
3.2 Linked Open Data Visualizations
The research-oriented SPARQL queries produced datasets representing a variety of materials in the DS catalog from different perspectives. By exporting those data outputs as a JSON file, the queries enabled the visualization of those research queries using the Tableau data visualization platform. The main purpose of visualizing these queries is to portray a set of relationships between individual items or across categories within the wider manuscript catalog. Researchers in the broader LOD discourse have discussed the use of visualizing linked data through a variety of means, both for those who ingest LOD and those who leverage it for research (McBrien & Poulovassilis, 2019; Helmich et al., 2017).
3.3 DS LOD Visualizations
Visualizing DS data emphasizes the intersecting aspects of the catalog and its properties. Drawing on previous research with visualizing special collections using LOD, this work is intended to explore the structure and discovery results of generalized SPARQL queries to represent the structure of a digital catalog from a variety of collections-level perspectives (Shabani et al., 2018). They range from documentary subject, genre, authorship, holding institution, material elements, and more. Visualizing these elements via Tableau shows the capabilities of LOD as a means of representing manuscript special collections materials and digital surrogates within a complex and structured union catalog of RBM materials. Some visualizations for this research project are showcased below in this paper, though not interactive like the Tableau instances. The rest of the visualizations not covered here are available via Tableau Public8 and in this research project’s GitHub repository.9
Figure 5 offers a cursory look at the DS catalog and the various genres of items within the holdings. This visualization is based on the output of a simple SPARQL query which drew the various qualifiers through the Genre as Recorded (P18) property. Due to this simplicity, modelling these results as a table rather than a more complex visualization enables exploring the output of genre name and number of items directly. The image of the visualization represented in Figure 5 (n = 52,302) also shows only the most populous genres, though the more complete list of genres and their amount in the query are available in the full visualization. With this simple visualization, it is important to note that the resulting numbers in the second column are not duplicated across similar categories, (e.g. materials in Manuscripts, Sanskrit are not necessarily a part of Manuscripts). Some records may be represented in multiple categories. The category of ‘Manuscripts’ is by far the largest, especially considering its repetition with more specific qualifiers such as Manuscripts, Medieval or Manuscripts, Arabic—19th century. This repetition of qualifiers related to ‘Manuscripts’ that are more granular is expected in such a diverse manuscript union catalog. How those various qualifiers are applied and constructed within the catalog, however, is a point of inquiry into how the current LOD catalog is constructed and how the qualifiers under the relevant properties are distributed.

Figure 5
DS Manuscripts by Genre.
Additional visualizations of queries demonstrate the size and diversity of the DS Catalog. In Figure 6 (n = 52,439), the genres captured in Figure 5 are shown with additional parameters to reveal their institutional contexts as well. The ‘institutionLabel’ value is presented in a packed bubble graph along with a checkbox filter shown on the right-hand side of the visualization which allows researchers to limit which institutions are displayed.

Figure 6
Genre by Holding Institution.
Visualizing the locations represented by the DS Catalog is another facet of this work. Visualizing these data enables a better look into the composition of that catalog and its representation of places. The DS Catalog captures various location data for some of their resources, including cities, regions, and nations. Figure 7 (n = 18,774) represents the diversity of location data in the DS Catalog and includes the amount or records in those categories.

Figure 7
Places Represented in the DS Record.
Other visualizations derived from research-oriented SPARQL queries developed during this project represent other aspects of the catalog. While some of these visualizations are simple in form, such as Figure 8 (n = 27,466) they still reveal some significant aspects of how the LOD are connected semantically and how they operate across the catalog. Figure 9 (n = 17,501), another, simpler visualization, captures some of the data which inform the initial interest of this project to understand how an object’s materiality is handled across the catalog. These data, as visualized below, indicate a significant representation of paper and parchment materials for manuscript objects, representing magnitudes more presence in the catalog than any other material composition.

Figure 8
Number of Records by Role.

Figure 9
Material Type in DS Catalog.
These visualizations give a partial or complete look at the data, including trends, facets of the data model, material or genre qualities captured, and other elements of the collection relevant for archival research within the digital manuscript collection. The potential for innovative data reuse through LOD research with institutional special collections connects the work of linked data special collections to emerging digital humanities scholarship (Marsh et al., 2024). Larger sets of manuscript linked data offer reuse potential in comparing materials across corpora, investigating RBM descriptive practices at scale, or modelling the unique elements of institutional RBM materials in broad and granular ways through LOD querying.
(4) Conclusion and Future Work
An initial motivation for this research was to find alternative ways of representing a larger dataset of targeted manuscript data by leveraging the LOD infrastructure of DS. The research approach of this project centers around the significance of the DS catalog as a cross-institutional collection. While this proved to be a more limited initial effort than anticipated, in some ways, the data outputs from this research nonetheless pose a distinct inquiry into the results of transforming, harmonizing, and integrating metadata from various institutions and institutional types. Since material data are often especially relevant to scholars of premodern and early modern history, literature, philosophy, and art, connecting various descriptive data through LOD (Hyvönen et al., 2021). LOD implementation in an RBM special collections context is not without significant risk, cost, and potential for compromising data within a given knowledge base. Complexities are expanded in a multi-institutional context such as the DS catalog, as explored in the above research into that collection and its organization.
Providing researchers with SPARQL queries as a starting point for humanistic research can enable them to navigate a complex cross-institutional union catalog like the Digital Scriptorium and empower a focus on aspects of these collections pertinent to their research. Additionally, visualization of these queries serves a dual purpose of modelling the research capabilities of SPARQL and Wikibase and demonstrating the structure and focus of the catalog from a higher level. While initial reception of the DS 2.0 LOD integration have been positive, a larger look at how these changes impacted researcher interaction and retention has not been undertaken at the time of writing. Further work from this project will likely include research into possible changes in user behavior and retention post-release of DS 2.0.
Though the work of this project does not investigate the full data infrastructure of the DS catalog or the DS 2.0 LOD data model in full, the current state of the work offers insights into the structure and utility of LOD knowledge bases for manuscript collections and maps a process for potential catalog-wide analysis and eventual reuse cases for linked manuscript data. The complex semantic relationships in the DS catalog are discoverable through the SPARQL query service connecting the data of this catalog to the wider Wikibase knowledge base. The records of this catalog are also discoverable through more normative search infrastructure (indexed materials searchable via a dedicated search bar with optional filters) where users can access individual records and interact with digital surrogates embedded in the record via IIIF. As DS 2.0 structures the catalog metadata as LOD, the linking of object records in this large manuscript repository allows manuscript data and metadata to connect to internal and external knowledge bases to support manuscript research.
The DS catalog represents an innovation in digital pre-modern and early modern manuscript catalogs, LOD digital collections projects, and broader digital special collections or union catalogs. It serves as a complex point of inquiry for traditional and digital humanities work as one of a small set of LOD manuscript initiatives. DS introduces institutional metadata from partner institutions into the semantic web via Wikidata and organizes those diverse materials and institutional schema into a singular union catalog of manuscript metadata connected by the LOD infrastructure under a single, unified authority.
Future work into the more complex and relational layers of DS data, as well as potential for expanding the data visualization outputs of this project, can expand the value from the union catalog offer future opportunities for the work to impact both digital special collections and digital humanities work. Additional work building on this project will investigate end-user behaviors and experiences, which were not able to be investigated in the duration of this initial research.
Notes
Acknowledgements
This research would not be possible without the opportunity, resources, and mentoring provided through The LIS Education and Data Science Integrated Network Group (LEADING) Fellowship. I am grateful for the research and educational opportunities provided by all the LEADING program staff, including those who taught various data science lessons to Fellows and mentored us throughout the Fellowship process. In particular, I would like to thank Jane Greenberg for her work in organizing the program and her mentorship during my time in the program. Her experience and perspective were vital to my success as a LEADING Fellow. I would also like to thank the mentorship team from the Digital Scriptorium project from the University of Pennsylvania Schoenberg Institute for Manuscript Studies: Lynn Ransom, Doug Emery, and L. P. Coladangelo, for the education, support, and encouragement in the development of this work. Their guidance on the construction and iteration of this research allowed me to turn my initial research questions into tangible results. In addition, I would like to thank my co-Fellow on the DS project, Jade Snelling, for her excellent work during the research fellowship mapping the DS data model and the invaluable knowledge her research provided the team and future researchers.
Competing Interests
Maisie Jones served as a volunteer copy editor for the Journal of Open Humanities Data from August 2021 to July 2023. The author has no competing interests to declare.
Author Contributions
Maisie Jones: Conceptualization, Formal Analysis, Methodology, Validation, Visualization, Writing – Original Draft, Writing – Review and Editing.
