From Wikidata to CIDOC CRM: A Use Case Scenario for Digital Comparative Literary Studies

Laura Untner

doi:10.5334/johd.421

Full Article

(1) Context and Motivation

Literary studies, and especially comparative literary studies, routinely examine relations among texts, whether in the form of quotations, adaptations, recurring characters, or more general references to other works (see, for example, Kristeva, 1967; Genette, 1982; Pfister, 1985). Although central to the discipline, such relations often resist straightforward datafication because they are usually more interpretive than purely descriptive. A machine-readable data model capturing these relations should therefore preserve interpretive transparency while providing the structure needed for computational analysis and interoperability.

Semantic technologies seem to be well-suited to this challenge (see, for instance, Allemang et al., 2020). First, they share a structural affinity with intertextual relations: both take the form of networks and are realized as graphs (Tennis, 2004; Ivanovic, 2017; Oberreither, 2018, p. 136). Second, they can represent even highly complex relationships that could be difficult to model with, for example, the Guidelines for Electronic Text Encoding and Interchange (TEI) (Ciotti 2016). Furthermore, formal ontologies are explicit representations of a domain, even though they always come with specific perspectives. In this use case, for example, adopting CIDOC CRM entails an event-centric perspective on entities such as persons, places, and times (CIDOC CRM SIG, 1995 ff.). Employing LRMoo means committing to the Work–Expression–Manifestation–Item (WEMI) framework for bibliographic entities (IFLA LRMoo Working Group & CIDOC CRM Special Interest Group, 2020 ff.). Using INTRO signals a textual perspective within literary studies and the distinction between abstract features and their concrete actualizations in texts (Oberreither, 2018, 2023; Oberreither & Untner, 2018 ff.).

Wikidata, by contrast, is a general-purpose graph not restricted to the humanities or literary studies (Wikimedia, 2012 ff.). Its attraction for literary research is still clear and lies, among others, in its provision of open identifiers for millions of entities, including numerous authors and works, its broad adoption of identifiers, its open community governance, and its public SPARQL endpoint (see, for example, Hube et al., 2017; Blakesley, 2018, 2020, 2022a, 2022b; Börner & Kopf 2018; Fischer & Jäschke 2022; Fischer et al. 2023; Zhao, 2023; Illmer et al. 2024; Lubin & Fischer 2024; Rohe et al. 2025; Untner, 2025a). Yet it also poses challenges: properties and items may be used inconsistently across communities, statement granularity varies widely, qualifiers essential for detailed analysis are often absent, labels shift across languages, and categories relevant to literary studies, such as genres or distinctions between motifs and topics, are only partially stabilized (for data quality in Wikidata, see, for instance, Shenoy et al., 2022; Santos et al., 2024). The use case presented here accepts Wikidata as it is, while implicitly anticipating a future in which many of these issues have been mitigated. In the meantime, it achieves controlled data integration by harvesting only selected properties and applying strict type constraints. This transforms a heterogeneous, cross-domain dataset into a domain-aware graph tailored to literary and especially comparative literary studies, grounding Wikidata’s data in humanities modeling conventions that render it more coherent and interoperable.

While the present use case focuses on Wikidata, the architecture is not bound to it. Since the inputs in the scripts are modular, any SPARQL endpoint or structured source can be harvested with minor code adaptations. The same transformation steps model the data according to the CIDOC CRM/LRMoo/INTRO framework, and the alignment stage then adds identifiers and mappings. In other words, the workflow is model-driven rather than source-dependent. This is crucial for long-term sustainability: as additional Linked Data sources become available, they can be integrated with minimal redesign.

All Python scripts are openly available on GitHub (Untner, 2025b) and as the Python package wiki2crm (Untner, 2025c). The default base URI in the scripts points to https://sappho-digital.com/, a placeholder reflecting the workflow’s origin in Sappho Digital, a dissertation project documenting the German-language reception of Sappho as Linked Data using the same data model (Untner, 2024–[2027]). Importantly, nothing in the code or model requires this domain: adopters should replace the base URI with their own namespace. The overall aim of the workflow is reusability: supplying new CSV files and a different base URI generates a graph for another corpus, modeled in the same way.

(2) Dataset Description

Repository Location

https://doi.org/10.5281/zenodo.17751731

Repository Name

Zenodo

Object Name

wikidata-to-cidoc-crm_2025-11-28.zip

Format Names and Versions

The repository mainly contains Python source code and distribution files (.py, .whl, .tar.gz), along with documentation and configuration materials (.md, .pdf, .toml, .txt), RDF and data examples (.ttl, .csv), and several illustrative images (.png, and since version 2.0 also .svg).

Creation Dates

2025-04-06 to 2025-11-28

Dataset Creator

Laura Untner is the only creator. She received feedback on the relations module from Bernhard Oberreither and on the Python workflow from Lisa Poggel. Peter Andorfer suggested the Python package and the SHACL shapes.

Language

English

License

Creative Commons Attribution 4.0 International

Publication Date

2025-11-28

(3) Data Model

The data model is layered to maintain a clear distinction between description and interpretation. This section progresses from the more familiar CIDOC CRM and LRMoo layers to the specificities of INTRO, which define the project’s literary studies dimension.

To ensure that the data produced by the three modules remains consistent with the underlying data model, each module is associated with a set of SHACL shapes. These shapes do not add further modelling complexity; they simply encode the essential structural expectations of the CIDOC CRM, LRMoo, and INTRO layers. A detailed explanation of the shapes themselves is not possible here, but it should be noted that they serve as a formalized checkpoint in the workflow: after the Python scripts create the triples for authors, works, and relations, each output is automatically validated against its corresponding shapes, ensuring that the implementation remains aligned with the model’s design.

(3.1) CIDOC CRM for Authors

CIDOC CRM (more precisely, its Erlangen OWL serialisation, known as the Erlangen Conceptual Reference Model (ECRM) (University of Erlangen-Nuremberg, 2010 ff.), which is aligned with the official CIDOC CRM) is employed to model information about authors (see Figure 1). Using an event-centric model allows literary texts to be situated historically and geographically. This enables queries such as: “Which texts with motif X were created by authors born before 1900?” or “Which persons referenced in works A and B were contemporaries of Z?” The CIDOC CRM layer anchors interpretation in historical data while ensuring that interpretation does not distort description.

Model for biographical data (simplified).

Persons are modeled as E21_Person nodes linked to instances of E42_Identifier (QID) and E82_Actor_Appellation (name), life events (E67_Birth, E69_Death) with E52_Time-Span and E53_Place, as well as E36_Visual_Item/E38_Image (image). Gender is represented as E55_Type. Inverse properties are also included, for example, places use P7i_witnessed to connect to events, significantly improving navigability and enabling bidirectional querying without relying on OWL reasoning. All assertions are supplemented with prov:wasDerivedFrom links to the relevant source QIDs, and mappings to Wikidata items are added via owl:sameAs wherever possible.

(3.2) LRMoo for Works

LRMoo, in combination with some CIDOC CRM classes, provides the bibliographic backbone on which the modeling of (inter)textual phenomena rests (see Figure 2). The model adheres to the WEMI framework, in which abstract works are represented as F1_Work, created by F27_Work_Creation; their more concrete expressions as F2_Expression, produced by F28_Expression_Creation; printed manifestations as F3_Manifestation, resulting from F30_Manifestation_Creation; and individual items as F5_Item, created through F32_Item_Production_Event.

Model for bibliographical data (simplified).

In this data model, F2_Expression instances occupy the central role. This reflects the fact that in Wikidata, “literary work” very often refers neither to a manifestation (F3_Manifestation) nor to an abstract concept (F1_Work), but rather to an expression – less abstract than a work yet capable of being instantiated in multiple manifestations. Accordingly, F1_Work instances are linked only to authors, while F2_Expression instances are additionally connected to E42_Identifier, E55_Type (genre), and E73_Information_Object (digital surrogates). Titles are linked to both F2_Expression and F3_Manifestation via E35_Title, whose content is an E62_String. Publishers of F3_Manifestation_Creation events are represented as E40_Legal_Body, accompanied by E53_Place and E52_Time-Span. E52_Time-Span is also included for F28_Expression_Creation.

As in the authors module, equivalence axioms connect CRM to ECRM, and also LRMoo to FRBRoo (Functional Requirements for Bibliographic Records object-oriented; International Working Group on FRBR and CIDOC CRM Harmonisation, 2006–2017) and it’s Erlangen OWL serialisation, EFRBRoo (University of Erlangen-Nuremberg, 2012 ff.). Further, mappings to Wikidata items are created wherever possible. Provenance is consistently captured through prov:wasDerivedFrom.

Two design principles follow from this approach. First, the model is expression-centric: topics, motifs, references, citations, and other (inter)textual phenomena are attached to F2_Expression instances. Second, it prioritises event contiguity: creation events serve as entry points for temporal and spatial analysis, supporting queries such as “Which expressions actualizing motif X were published in Vienna between 1800 and 1830?” or “Which recurring characters are associated with expressions from Paris and Berlin in the 20^th century?”

(3.3) INTRO for (Inter)textual Phenomena

INTRO (the Intertextual, Interpictorial and Intermedial Relations Ontology), built on CIDOC CRM and LRMoo, forms the analytical core of the data model (see Figure 3). It conceptualises intertextuality as a two-step process: (i) identifying lexical intertexts such as citations and paraphrases as well as semantic features such as topics, motifs, plots, characters, and references – or, more precisely, their actualizations in specific expressions; (ii) constructing intertextual relations among expressions that share these feature actualizations and/or text passages. Every assertion is explicitly marked as an interpretation, acknowledging the interpretive conditions of literary analysis.

Model for (inter)textual data (simplified).

INTRO is designed to capture a wide range of (inter)textual phenomena. In the present model, the focus lies on:

Topics (INT_Topic): abstract subject matters such as “gender roles” or “suicide” (Gfrereis, 1999, p. 208).
Motifs (INT_Motif): more concrete and smaller-scale elements such as “rose” or “sea” (Gfrereis, 1999, p. 130).
Plots (INT_Plot): structural story patterns, e.g., “a person sells their soul to the devil” (Gfrereis, 1999, p. 195).
Characters (INT_Character): fictional agents, potentially based on real persons (E21_Person) (Jannidis 2004, pp. 151 ff.).
References (INT_Reference): to persons (E21_Person), places (E53_Place), and works (F2_Expression).
Text Passages (INT21_TextPassage): lexical intertexts like citations and paraphrases that can be identified as part of an expression.

Particular attention should be drawn to INTRO’s distinction between features and actualizations. In INTRO, texts do not “possess” features directly; instead, the model introduces an intermediate node, INT2_ActualizationOfFeature, which mediates between an expression and the feature it realises. Formally, this is expressed through the following chain of triples: F2_Expression → R18_showsActualization → INT2_ActualizationOfFeature; INT2_ActualizationOfFeature → R17_actualizesFeature → INT4_Feature. In this structure, the actualization – not the feature itself – is what belongs to the expression. This design allows a single feature to be realized in different ways across multiple expressions without forcing the graph to commit to a single, unified interpretation of that feature. A feature such as a plot, for instance, may be actualized affirmatively in one expression, and parodically in another. Each realization is represented as its own concrete instance of INT2_ActualizationOfFeature, linked back to a common abstract instance of INT4_Feature. Thus, INTRO is able to model not only that a feature appears in a text, but also the various ways in which it materialises across multiple texts.

The relations layer reifies the relation itself as a node, INT31_IntertextualRelation, rather than attaching a direct property such as :related between expressions. In practice, the model represents a relation through a set of explicit triples, for example: INT31_IntertextualRelation → R24_hasRelatedEntity → F2_Expression A; INT31_IntertextualRelation → R24_hasRelatedEntity → F2_Expression B. In this pattern, the INT31_IntertextualRelation instance stands at the center, and each participating expression is connected to it via R24_hasRelatedEntity. This approach has several advantages: multiple entities like expressions and features can be associated with a single relation, substantiating claims of similarity; relations can be easily typed and qualified; and each can be explicitly identified as interpretation. In this way, the graph asserts that the relation is a scholarly claim, not a factual statement.

The ontological role of INT_Interpretation is to safeguard interpretive plurality. All textual features and relations in the model can be tied to an interpretation, effectively stating: this relation is asserted, not a fact. As in the other modules, all features are also linked to Wikidata QIDs wherever possible, and all assertions carry prov:wasDerivedFrom references. In this implementation, the sources for assertions are Wikidata QIDs, though they could equally point to scholarly publications or other references. Because interpretations are entities themselves, the graph can represent multiple interpretations of the same relation – concordant, divergent, or differently justified – without collapsing them into a single authoritative account. For literary studies, such pluralism is indispensable, since competing readings are not anomalies but constitutive of the field.

(3.4) Mappings and Alignments

Although the data model is based on standards like CIDOC CRM, LRMoo, and INTRO, mapping and alignment remain essential to maximize interoperability. This part of the workflow focuses only on the classes and properties used in this data model, and especially on those central to the relations module.

In addition to the mappings between ECRM and CIDOC CRM, and between LRMoo, FRBRoo, and EFRBRoo (using OWL equivalence statements), the workflow incorporates external identifiers from the Virtual International Authority File (VIAF), the German National Library’s Integrated Authority File (GND), GeoNames, DBpedia, Schema.org, and Goodreads. It also establishes alignments – mainly using SKOS (Simple Knowledge Organization System) relations and OWL property chain axioms – with ontologies relevant to literary studies and the modeling of textual relations, including the Bibliographic Ontology (BIBO; D’Arcus & Giasson, 2008 ff.), the FRBR-aligned Bibliographic Ontology (FaBiO) as well as the Citation Typing Ontology (CiTO; both Peroni & Shotton, 2012), the Drama Corpora (DraCor) Ontology (Börner, n.d.), the GOLEM Ontology for Narrative and Fiction (Pianzola et al., 2024), the MiMoText Ontology (Hinzmann et al., 2024a, b), the OntoPoetry Ontology (POSTDATA, n.d.), and the Intertextuality Ontology (Horstmann et al., 2022 ff.). Further mappings are provided to more general ontologies such as the DCMI Metadata Terms (DCMI Usage Board, 2002 ff.), the Document Components Ontology (DoCo; Constantin et al., 2016), and the Friend of a Friend (FOAF) ontology (Brickley & Miller, 2004 ff.). A full table documenting all class and property alignments is available in the corresponding GitHub repository.

(4) Workflow

The workflow is implemented as five Python modules, which may run independently or as a pipeline (see Figure 4). Its design is deliberately modular to facilitate reuse and maintenance, and it is idempotent, meaning running a module again on the same inputs does not compromise data integrity. In the current implementation, the output is serialized in Turtle.

Inputs consist of CSV files listing QIDs for authors or works, depending on the module. The authors module expects author QIDs; the works and relations modules expect work QIDs. No automated checks are performed to confirm whether the QIDs actually represent authors or works. This flexibility allows, for instance, the inclusion of referenced persons alongside authors. Ultimately, the scope depends on the QIDs provided in the CSV files, which must be compiled beforehand. The CSVs also act as explicit scope delimiters: relations are computed only among the listed IDs – the scripts do not crawl the entirety of Wikidata.

It is worth noting that identifying suitable properties and items for information integration was not always straightforward, given the scale and complexity of Wikidata’s data model. Completeness cannot be guaranteed.

The workflow produces the following outputs:

authors.ttl (CIDOC CRM layer),
works.ttl (LRMoo layer),
relations.ttl (INTRO layer),
merged.ttl (consolidated graph), and
an enriched graph generated during the mapping and alignment process.

At present, the code employs https://sappho-digital.com/ as a placeholder base URI; adopters are advised to replace this with their own namespace prior to publication.

All scripts and SHACL shapes, as well as exemplary inputs and outputs, are available in the stated GitHub repository and in the Python package wiki2crm. The package bundles all five modules together, allowing users to run the workflow either from the command line or as a Python library.

(4.1) Authors Module

The authors module constructs a CIDOC CRM representation of each listed person and their biographical context, following the model outlined above. It queries labels (names), P21 (gender), P569 and P570 (birth and death dates), P19 and P20 (birth and death places), and P18 (images).

For each QID, the logic proceeds as follows: An E21_Person node is created and linked to an E42_Identifier containing the QID. An E82_Actor_Appellation is added with the label. If dates or places are present, E67_Birth and/or E69_Death events are created, anchored to an E52_Time-Span and associated with E53_Place nodes. Gender is represented as E55_Type, linked via type-of-type to a “Wikidata Gender” node. Where images exist, E36_Visual_Item and E38_Image are added, with rdfs:seeAlso pointers to Wikimedia Commons.

The result is a Turtle file containing biographical information as an owl:Ontology. Mappings from CIDOC CRM to ECRM are provided through OWL equivalence statements. Inverse properties are consistently included; every assertion carries prov:wasDerivedFrom, and owl:sameAs statements are added wherever appropriate.

(4.2) Works Module

The works module constructs an LRMoo WEMI chain for each listed work, enriched with essential contextual information. It queries P1476 (title), P50 (author), P136 (genre), P571 or P2754 (creation date), P577 (publication date), P291 (publication place), P123 (publisher), P98 (editor), P953 (digital copy), P1433 or P361 (works that include this work), and labels (to populate titles when P1476 is missing).

For each QID, the logic proceeds as follows: An F1_Work is created together with an F27_Work_Creation, linked to the respective E21_Person author node(s). An F2_Expression is then minted and connected to the work, with an E35_Title containing an E62_String attached to it. The QID itself is represented as an E42_Identifier, while genre is modeled as E55_Type, and digital surrogates as E73_Information_Object. If P571 or P2754 values are present, an E52_Time-Span is added to the F28_Expression_Creation. The chain continues with F3_Manifestation and an F30_Manifestation_Creation, which includes E53_Place and E52_Time-Span nodes and links publishers as E40_Legal_Body. Information on editors and additional titles may also be incorporated. Finally, an F5_Item and F32_Item_Production_Event are created, without further detail.

The output is a Turtle file containing WEMI chains for the specified work QIDs, represented as an owl:Ontology. Equivalence mappings between LRMoo, FRBRoo, and EFRBRoo are added using OWL statements. As in the authors module, inverse properties are consistently included, every assertion carries prov:wasDerivedFrom, and owl:sameAs links are created wherever appropriate.

(4.3) Relations Module

The relations module constructs the interpretive layer of the data model by deriving (inter)textual data from Wikidata and modeling it using INTRO. Querying is more complex here than in the other modules, since properties and items may be used very inconsistently, and categories relevant to literary studies are only partially stabilized in Wikidata.

(4.3.1) (Inter)Textual Data in Wikidata

As noted above, the workflow takes Wikidata as it is but anticipates a future in which many of its limitations are resolved. To clarify the present availability of relevant data, some query results are summarized below. All were constrained to Q7725634 (literary work) as subject – and as object, in the case of work references, text passages, and more general intertextual relations. Subclasses and subproperties were included in the Python scripts, but not in the illustrative query results.¹ Because pairwise comparisons among expressions scale quadratically, the script restricts runtime costs by limiting scope to the input list. Performance nonetheless decreases with larger datasets.

Topics (INT_Topic): Queried via P921 (main subject) with Q26256810 (topic; see Table 1). Since P180 (depicts) returned only three results, it was not integrated.

Table 1

Topics query results.

QUERY	RESULTS
Q7725634 (literary work) – P921 (main subject) – Q26256810 (topic)	317 results

Motifs (INT_Motif): Queried via P6962 (narrative motif ; see Table 2). Other approaches using P180 with Q1697305 (narrative motif) or Q68614425 (motif) returned negligible results and were not included.

Table 2

Motifs query results.

QUERY	RESULTS
Q7725634 (literary work) – P6962 (narrative motif) – ?obj	2.847 results

Plots (INT_Plot): Queried analogously to topics, using P921 with Q42109240 (stoff). This category produced the fewest results (see Table 3).

Table 3

Plots query results.

QUERY	RESULTS
Q7725634 (literary work) – P921 (main subject) – Q42109240 (stoff)	19 results

Characters (INT_Character): Queried via P674 (characters) or, alternatively, via P180 and P921 if the item was an instance of Q3658341 (literary character) or Q15632617 (fictional human; see Table 4). Characters could also be linked to real persons (E21_Person) when the object item was an instance of Q5 (human). Other potential queries (e.g., with Q97498056, Q122192387, or Q115537581) produced negligible results and were excluded.

Table 4

Characters query results.

QUERY	RESULTS
Q7725634 (literary work) – P674 (characters) – ?obj	18092 results
Q7725634 (literary work) – P180 (depicts) – Q3658341 (literary character)	15 results
Q7725634 (literary work) – P921 (main subject) – Q3658341 (literary character)	149 results
Q7725634 (literary work) – P180 (depicts) – Q15632617 (fictional human)	11 results
Q7725634 (literary work) – P921 (main subject) – Q15632617 (fictional human)	102 results

References (INT_Reference): Person references are queried via P180, P921, and P527 (has part) with Q5 (see Table 5).

Table 5

Person references query results.

QUERY	RESULTS
Q7725634 (literary work) – P180 (depicts) – Q5 (human)	718 results
Q7725634 (literary work) – P921 (main subject) – Q5 (human)	4849 results
Q7725634 (literary work) – P527 (has part) – Q5 (human)	38 results

Place references are queried via P921 with Q2221906 (geographic location; see Table 6). P180 and P527 returned negligible results.

Table 6

Place references query results.

QUERY	RESULTS
Q7725634 (literary work) – P921 (main subject) – Q2221906 (geographic location)	44 results

Work references are queried via P921, constrained to QIDs in the CSV input (here: Q7725634 on both sides; see Table 7). P361 (part of) and P527 were excluded because they are mainly used to state that a text is part of another text. P1299 (depicted by) and P180 were also excluded for negligible results.

Table 7

Work references query results.

QUERY	RESULTS
Q7725634 (literary work) – P921 (main subject) – Q7725634 (literary work)	263 results

Text Passages (INT21_TextPassage): Citations and paraphrases are queried via P2860 (cites work) and P6166 (quotes work), constrained to the provided QIDs (here: Q7725634 on both sides; see Table 8).

Table 8

Text passages query results.

QUERY	RESULTS
Q7725634 (literary work) – P2860 (cites work) – Q7725634 (literary work)	208 results
Q7725634 (literary work) – P6166 (quotes work) – Q7725634 (literary work)	40 results

Intertextual Relationships (INT31_IntertextualRelation): More general intertextual relations are queried via P4969 (derivative work), P144 (based on), P5059 (modified version of), and P941 (inspired by), constrained to the CSV-listed QIDs (here: Q7725634 on both sides; see Table 9). P2512 (has spin-off) returned no results and was excluded.

Table 9

Intertextual relationships query results.

QUERY	RESULTS
Q7725634 (literary work) – P4969 (derivative work) – Q7725634 (literary work)	1054 results
Q7725634 (literary work) – P144 (based on) – Q7725634 (literary work)	1993 results
Q7725634 (literary work) – P5059 (modified version of) – Q7725634 (literary work)	16 results
Q7725634 (literary work) – P941 (inspired by) – Q7725634 (literary work)	234 results

In summary, the query results illustrate both the potential and the limitations of Wikidata for modeling (inter)textual phenomena. Certain categories, such as motifs (P6962), characters (P674), person references via P921, and general intertextual relations via P4969 and P144, are well populated and thus the most promising for large-scale analysis. Others, such as topics, place references, work references, citations, and paraphrases, or some character/person queries, yield intermediate but usable quantities. A few categories, notably plots, person references via P527, and intertextual relations via P5059, remain too sparsely represented to support substantial analysis. Overall, Wikidata provides a heterogeneous yet promising foundation: some (inter)textual features are already sufficiently populated for broader research, while others will require further community curation before enabling comprehensive analysis.

(4.3.2) Construction Logic

For each QID, the workflow ensures that an F2_Expression instance exists (from the works module). Detected features are then modeled as follows: the relevant feature node is either created or reused, and an INT2_ActualizationOfFeature is added to link the feature to the expression it realises, always connected to an INT_Interpretation. Citations and paraphrases trigger the creation of INT21_TextPassage nodes representing the reused text passage. When two or more expressions share a feature or text passage, a similarity-based INT31_IntertextualRelation is asserted, listing the shared elements as evidence and linking the relation itself to an interpretation node.

The output is a Turtle file containing an owl:Ontology with intertextual information and interpretive statements that overlay – but do not alter – the WEMI layer, thereby preserving the distinction between description and interpretation. As in the other modules, equivalence mappings are provided through OWL statements, inverse properties are consistently included, owl:sameAs links wherever appropriate, and every assertion is accompanied by prov:wasDerivedFrom. At present, provenance always refers back to Wikidata entities, though future iterations may also incorporate references to scholarly publications or other sources.

(4.4) Merge Module

The merge module produces a single, coherent RDF graph – formally, an owl:Ontology – by combining the outputs of the preceding modules. Its central tasks include deduplicating shared nodes (for example, persons who appear both as authors and as references) and verifying that all expression URIs referenced in relations are present. The result is a consolidated Turtle file containing all triples generated by the previous modules.

(4.5) Map and Align Module

In addition to the constant mappings between ECRM and CIDOC CRM, and between LRMoo, FRBRoo, and EFRBRoo, the map-and-align module enriches the merged graph with external identifiers and mappings. For persons, works, places, and features, it can incorporate identifiers from GND (P227), VIAF (P214), GeoNames (P1566), Goodreads (P8383), DBpedia, and Schema.org (both via P1709 and regular expressions).

As stated above, the module also asserts mappings for classes and properties to other ontologies relevant to literary studies and the modeling of textual relations, including BIBO, FaBiO, and CiTO, the DraCor, GOLEM, and MiMoText Ontologies, the OntoPoetry and the Intertextuality Ontologies, as well as to more general ontologies such as DCMI Metadata Terms, DoCo, and FOAF.

The output is an enriched Turtle file that integrates external identifiers and ontology alignments, allowing the dataset to circulate across multiple infrastructures without binding it to a single ontology stack.

(5) Practical Considerations and Future Scenarios

As stated above, Wikidata’s strengths – among them, its wide range of freely available data and its community-driven structure – are matched by its challenges: Properties are often polysemous across domains, levels of granularity are inconsistent, class typing is uneven, labels vary across languages, and information relevant to literary studies is absent or only partially stabilized. The project presented here does not attempt to “fix” Wikidata; instead, it adapts to these conditions by applying several constraints to information retrieval processes and re-models the data. So even when upstream data is sparse, the described workflow can provide tangible benefits; above all, because it allows freely available, heterogeneous data to be modeled in a way that is more informed by humanities research.

In practice, the workflow results in a clean, basic person graph built on CIDOC CRM; WEMI chains based on LRMoo that anchor interpretive analyses and integrate with existing GLAM infrastructures; and an interpretive layer with INTRO (drawing on both CIDOC CRM and LRMoo) that models textual features and intertextual relationships with explicit interpretation statements that keep claims auditable. After the mappings and alignments are done in the end, the result is a dataset that can circulate across multiple ecosystems – especially within the Digital Humanities.

The specific Wikidata queries discussed certainly contribute to surveying the current state of data available for literary studies, but they are not what makes the workflow distinctive – those queries could just as well be executed directly on Wikidata. What is distinctive is the domain-aware modeling of the retrieved data and the diverse mappings and alignments, which not only provide ontological context but also make the resulting dataset more compatible with other Digital Humanities projects. Wikidata, after all, has no built-in understanding of the overlaps between the aligned ontologies, nor of the domain-specific modeling conventions used in literary studies. The workflow, however, takes precisely these aspects into account. This makes it possible, for example, to rapidly extract subsets of Wikidata that are structurally comparable to the data produced for the Sappho Digital project. So, even with partial data, the workflow offers a reusable, time-saving approach especially for digital comparative literary studies – one that establishes a Linked Data layer much more closely informed by humanities methods and standards than what Wikidata alone can provide.

The broader vision anticipates a future in which Wikidata contains richer information relevant to literary studies. As communities contribute more statements, the relations module will detect additional features and relations. Crucially, the workflow is not limited to Wikidata: the same architecture can be applied to any SPARQL endpoint or other structured dataset offering comparable statements. In this sense, Wikidata functions as a use case rather than a constraint. As domain-specific SPARQL endpoints emerge, the workflow can pivot to them seamlessly while retaining the CIDOC CRM/LRMoo/INTRO data model and alignment layer. The more enrichment that occurs upstream, the more powerful and precise the resulting graph becomes – without substantial redesign of the workflow.

From a practical perspective, adopters are advised to proceed iteratively. CSVs of author and work QIDs should be curated carefully, keeping in mind that the relations module compares only within the supplied work list. Scope can then be expanded gradually, with the expectation that missing (inter)textual features stated in Wikidata may result in thin INTRO layers. To address this, users may consider contributing statements back to Wikidata to improve future iterations or enriching graphs manually with other workflows or external sources.

(6) Conclusion

This paper has presented a modular, layered data model and a Python workflow that transforms Wikidata statements into a domain-aware graph specifically designed for digital comparative literary studies. Its main contributions are the data model, the identification of relevant Wikidata properties and items for information retrieval, and the alignments with ontologies pertinent to literary studies and the modeling of textual relations.

Conceptually, the workflow is guided by the principle of separating description from interpretation. CIDOC CRM and LRMoo are used for representing persons, places, times, and bibliographic entities, while INTRO captures (inter)textual phenomena as interpretations, distinguishing features, and their actualizations, as well as text passages and relations. Ethically, transparency is paramount: every assertion is traceable to its sources, and competing interpretations in the INTRO layer can coexist rather than being collapsed into a single authoritative account.

In the end, working with Wikidata is productive yet challenging. This project addresses these challenges through type constraints, property triangulation, and explicit provenance statements. At the same time, it is designed with the future in mind: as data in Wikidata relevant for literary studies expands, and as other domain-specific Linked Data sets emerge, the same architecture can support denser graphs without significant redesign. The placeholder namespace reflects the model’s origins in the Sappho Digital project but not its limits: the data model as well as the workflow should be generalizable and adaptable to other projects in digital comparative literary studies. Therefore, the scripts are also distributed on GitHub and as the Python package wiki2crm.

Finally, the broader invitation is collaborative. If scholarly communities contribute statements about (inter)textual phenomena to shared repositories, workflows such as the one presented here can transform scattered and fragmentary knowledge into structured, queryable infrastructures for literary research – without sacrificing the interpretive plurality that is essential to the field.

Notes

[1] The queries were all done on 30 September 2025. The query structure was as follows:

SELECT ?item ?itemLabel ?obj ?objLabel

WHERE {

?item wdt:{Property} ?obj . # add property

?item wdt:P31 wd:Q7725634 . # query only literary works as subjects

?obj wdt:P31 wd:{Object} . # add object (if necessary for the query)

SERVICE wikibase:label { bd:serviceParam wikibase:language “[AUTO_LANGUAGE],mul,en”. }

}

Acknowledgements

Special thanks to Bernhard Oberreither for feedback on the relations module, Lisa Poggel for feedback on the Python workflow, and Peter Andorfer for his valuable input.

Competing Interests

The author has no competing interests to declare.

Author Contributions

Laura Untner: Conceptualization, Data curation, Methodology, Software, Validation, Writing – original draft, Writing – review & editing.