Gazetteers of Latin Authors and Works for Chronological Modelling in the LiLa Knowledge Base

Matteo Pellegrini; Francesco Mambrini; Giovanni Moretti; Marco Passarotti

doi:10.5334/johd.558

1 Overview

Repository location

https://zenodo.org/records/20038372

Context

This dataset was produced in the context of the work on the LiLa (Linking Latin) knowledge base (Passarotti et al., 2020),¹ whose aim is to enhance interoperability between the resources and tools available for the Latin language. This is achieved by leveraging best practices of the Linguistic Linked Open Data (LLOD) paradigm (Cimiano et al., 2020), following W3C standards such as RDF to represent information and SPARQL to retrieve it.

Nowadays, the LiLa knowledge base connects resources pertaining to different epochs, ranging from Early Latin – e.g., the Opera Latina corpus by LASLA (Fantoli et al., 2022) – to Neo-Latin – e.g., the encyclicals by the late Pope Francis (Alagni et al., 2025). However, a diachronic exploration of resources in LiLa was hindered by the lack of an explicit coding of chronological information: to extract texts pertaining to a specific epoch, one needed to explicitly restrict queries by manually identifying authors or works of that epoch. Some chronological information could be obtained from other data sources that are linked to LiLa: for instance, many documents are connected to their author in Wikidata, and information on birth and death date is provided there. However, this is not enough to allow for a systematic restriction to specific time spans: not all authors of texts in LiLa are in Wikidata, and for some texts, the author is unknown. In this paper, we present the work that led to the creation of gazetteers for authors and works documented in LiLa, to enhance the potential for a systematic diachronic investigation of the resources linked to LiLa.

2 Method

Steps

The first step of our procedure consists in the identification of the most appropriate existing vocabularies and ontologies and their extension with new classes and properties.

We start with a short description of the already existing vocabularies and ontologies that we reuse. Those include general purpose Semantic Web vocabularies, such as rdf, rdfs² and skos.³ The dcterms⁴ allow for the structured coding of metadata of several kinds, among which authorship is the most relevant in this context. The CIDOC crm⁵ was created with the main objective of providing a controlled vocabulary for the documentation of cultural heritage institutions, but it is arguably general enough to be applied to any kind of intangible heritage. In this work, we use it to define and connect with each other events (for works) and the people involved in them (for authors). The Library Reference Model (lrmoo)⁶ is an extension of the CIDOC crm that provides additional classes and properties for the treatment of bibliographic data. It identifies different degrees of concreteness in the real-world correlates of bibliographic items, ranging from abstract works (e.g., Agatha Christie’s ‘Murder on the Orient Express’), to their expressions (e.g., its original English text), manifestations (e.g., a specific edition), up to concrete embodiments (e.g., a physical copy).

We now turn to a description of how the information is modelled using those vocabularies in our gazetteers. Regarding the gazetteer of authors (Figure 1), each item is assigned to it through the property dcterms:isPartOf and defined as an instance of the class crm:e39_actor. It is then assigned one or more human-readable labels, among which the preferred one is given via the properties skos:preferredLabel and rdfs:label, and the others via the property skos:altLabel. Documents in textual resources published as LLOD in the LiLa knowledge base are linked to their authors via the property dcterms:creator. Links are provided to the CTS URN corresponding to each author using the property dcterms:identifier, and to the corresponding items in Wikidata⁷ using the property skos:exactMatch. In the gazetteer of works (Figure 2), each item is defined as an instance of the class lrmooo:F1_Work and assigned an rdfs:label. Links are provided to the CTS URN corresponding to each work using the property dcterms:identifier, and to the corresponding items in Wikidata and in the Digital Latin Library⁸ (where available) using the property skos:exactMatch. Furthermore, we define creation events – instances of lrmoo:F27_Work_Creation – and link them to works via the property lrmoo:R16_created and to authors via the property crm:P14_carried_out_by.

We finally get to the modelling of chronological information. Pellegrini et al. (2025) have focussed on this aspect in the context of DynaMorphPro, a lexical resource that provides information on the dating of borrowings, conversions and class-shifts into Latin and Italian. Their model is rich and complex, allowing for the definition of time spans pertaining to items of different kinds. In this work, we opt for a much more lightweight model that allows for easier queries. We introduce two new datatype properties, lila:startDate and lila:endDate, that have in their range the xsd:dateTimeStamp corresponding to the beginning and the end of the time spans to which an item pertains, respectively. Their domain is unrestricted, so that they can be used to express chronological information on different items as needed. To guarantee interoperability with established standards, they are defined as subproperties of dcterms:date.

The next step is the extraction of chronological information. Many of the authors and works represented in the LiLa knowledge base were already linked to the corresponding named individuals in Wikidata.⁹ For authors, dating of some kind is commonly provided in Wikidata, and we leverage this to systematically extract values to be used as object of the properties lila:startDate and lila:endDate. In some cases, the dates of birth and/or death are stated explicitly, through the Wikidata properties wd:P569 “date of birth” and wd:P570 “date of death”. In other cases, more generic information is provided through the Wikidata property wd:P1317 “floruit”, expressing the “date when the person was known to be active or alive, when birth or death not documented”,¹⁰ or through other properties;¹¹ in such cases, we take the dates of the beginning and end of the century indicated in Wikidata as lila:startDate and lila:endDate. Regarding authors for whom there is no link to Wikidata,¹² we have to resort to other sources. In many cases, these are authors of the Late Antiquity represented in DigilibLT.¹³ Therefore, we manually recovered their dating from the information provided on the web pages of authors in that project, following the same strategy as for information from Wikidata concerning different degrees of granularity (i.e., exact dates vs. generic information on the century).¹⁴

3 Dataset Description

Repository name

Zenodo

Object name

CIRCSE/LiLa-Gazetteers-v1.1.zip

Format names and versions

.ttl, .md

Creation dates

2025-09-01–2026-05-05

Dataset creators

Matteo Pellegrini (University of Surrey), Francesco Mambrini, Giovanni Moretti, Marco Passarotti (Università Cattolica del Sacro Cuore)

Language

Latin, English (metalanguage)

License

CC BY-SA 4.0 International

Publication date

2026-05-05

4 Reuse Potential

The main motivation behind the creation of this dataset is its potential to be reused by other researchers to perform chronologically restricted queries. Such a possibility can be helpful for scholars working in diverse fields, ranging from computational linguistics to literary studies. In Figure 3, we show an example of a SPARQL query that extracts texts by authors whose lila:endDate falls within the 2nd century CE and counts the number of tokens available for each of them. A sample of the results is given in Figure 4.¹⁵

SPARQL query that extracts works by authors of the 2nd Century CE.

Sample of the results of the query of Figure 3.

Having this information easily retrievable in a machine-readable format allows users to make such chronological queries more systematic. As an example, we show in Figure 5 a generalisation of the query of Figure 3 that groups works by the century of the lila:endDate of their author. The results of this query can be exploited to evaluate the diachronic coverage of the LiLa project (as shown in Figure 6). It can be observed that some epochs (e.g., the Late Antiquity) are currently very well represented, while in others there are gaps (e.g., some centuries in the Middle Ages). This points to the direction in which future projects should go to make the coverage more comprehensive and balanced.

SPARQL query that groups works by century.

Diachronic coverage of LiLa based on the results of the query of Figure 5.

More generally, a crucial feature of the model that we propose is its potential extensibility. Currently, we only provide information on the dating of authors, not of works, because for authors such information could be easily retrieved automatically, while this was more difficult for works. However, if and when such information will be retrieved, it will be possible to code it on works in the same fashion.

Furthermore, the model we propose, being couched in the framework of the CIDOC crm and its lrmoo extension for bibliographic information, has the potential to allow for the coding of more structured information at different degrees of abstractness. Currently, we only introduce works at the most abstract level – lrmoo:F1_Work. However, it is possible to add corresponding instances of more concrete classes to express information of other kind – for instance, lrmoo:F2_Expression to provide metadata on a specific edition of some text, and to model the process of annotation of that text by introducing an annotation activity, defined as an instance of crm:E13_Attribute_Assignment, linked to the annotators performing it through the property crm:P14_carried_out_by and to the annotated text through the property crm:P140_assigned_attribute_to (Figure 7).

Extension of the model of the gazetteers of works with expressions and manifestations.

Notes

[1] https://lila-erc.eu/about/.

[2] https://www.w3.org/TR/rdf-schema/.

[3] https://www.w3.org/TR/skos-reference/.

[4] https://www.dublincore.org/specifications/dublin-core/dcmi-terms/.

[5] https://cidoc-crm.org/.

[6] https://cidoc-crm.org/lrmoo.

[7] https://www.wikidata.org/wiki/Wikidata:Main_Page.

[8] https://digitallatin.org/.

[9] See Berti (2025) for another effort towards linking data pertaining to (Ancient Greek and) Latin authors to Wikidata.

[10] https://www.wikidata.org/wiki/Property:P1317.

[11] See https://www.wikidata.org/wiki/Help:Dates for more information on the treatment of dating in Wikidata.

[12] Out of the 173 authors in LiLa, 136 are linked to Wikidata, 126 of which display some dating information in that source.

[13] https://lila-erc.eu/data/corpora/digiliblt/id/corpus.

[14] In the future, this information can potentially be used to enrich Wikidata itself with these authors, but to do this it would be necessary to retrieve additional (meta)data that could not be extracted in the course of this work.

[15] SPARQL queries can be launched on the LiLa endpoint at https://lila-erc.eu/sparql/.

Author Contributions

Matteo Pellegrini: Conceptualization, Data Curation, Writing (original draft), Writing (review & editing).

Francesco Mambrini: Conceptualization, Data Curation, Writing (review & editing).

Giovanni Moretti: Conceptualization, Software, Writing (review & editing).

Marco Passarotti: Concpetualization, Supervision, Writing (review & editing).

Gazetteers of Latin Authors and Works for Chronological Modelling in the LiLa Knowledge Base

Full Article

1 Overview

Repository location

Context

2 Method

Steps

Figure 1

Figure 2

3 Dataset Description

Repository name

Object name

Format names and versions

Creation dates

Dataset creators

Language

License

Publication date

4 Reuse Potential

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Notes

Author Contributions

Paradigm

My account