Wikidata and LiLa for Latin: Enabling Interoperability and Access to Inflected Forms and Corpus Attestations

David Lindemann; Matteo Pellegrini; Francesco Mambrini; Marco Passarotti

doi:10.5334/johd.464

Full Article

(1) Context and motivation

(1.1) Context: Wikidata and LiLa

Wikidata (Vrandečić & Krötzsch, 2014; Erxleben et al., 2014) is a knowledge base of structured data coded according to the Resource Description Framework (RDF; cf. Lassila and Swick, 1998), following the principles of the Linked Open Data paradigm (LOD; cf., with a focus on linguistic data, Cimiano et al., 2020). Each item in Wikidata has its own Uniform Resource Identifier (URI), and information about it is stored as RDF statements that connect the item to a value using a defined property. Several classes and properties – also with dedicated URIs – are defined to allow for the representation of the variegated information recorded in the original data.

Among else, rich information is provided on lexemes using the OntoLex-lemon vocabulary (McCrae et al., 2017), by now a de facto standard for the modelling of lexical information as RDF LOD. Following this standard, on Wikibase, lexemes are assigned the class ontolex:LexicalEntry,¹ and are linked to their senses (instances of ontolex:LexicalSense) through the property ontolex:sense, and to their forms (instances of ontolex:Form) through the property ontolex:lexicalForm (cf. Lindemann, 2025 for a more detailed discussion on the implementation of OntoLex-Lemon in Wikibase).

Furthermore, in line with the spirit of LOD, Wikidata items can also feature links to identifiers of the corresponding items in external identifiers. For instance, for Latin lexemes, the URI of the corresponding lemma in the knowledge base of the Linking Latin (LiLa) project (Passarotti et al., 2020) is given. From there, many other pieces of information can be retrieved on that lexeme, since LiLa acts as a hub to make lexical and textual resources on Latin available as RDF LOD, and hence interoperable. To achieve interoperability between resources available for Latin, a lemma bank was created that currently² includes 230,402 forms that can be used as lemmas for Latin words, and both tokens of textual resources and entries of lexical resources are linked to the corresponding lemma in there. To date, the lemma bank contains links to a few digital dictionaries (such as the Lewis and Short Latin-English dictionary, cf. Mambrini et al., 2021a; the Lexicala Latin-French dictionary, cf. De Paoli et al., 2025; Velez’s Latin-Portuguese dictionary, cf. Dezotti et al., 2024), other lexical resources (such as a Wordnet, cf. Mambrini et al., 2021b; a sentiment lexicon, cf. Sprugnoli et al., 2021; a word-formation lexicon, cf. Pellegrini et al. 2022), as well as corpora pertaining to different epochs and genres (such as the LASLA corpus for the pre- and post-classical period, cf. Fantoli et al., 2022; the CompHistSem corpus for the late and medieval period, cf. Pedonese et al., 2023; the CLaSSES collection of inscriptions, cf. De Felice et al., 2023). To date, almost 12M corpus tokens are linked to the LiLa knowledge base.

Thanks to the presence of links between Wikidata lexemes and LiLa lemmas (Lindemann et al., 2023),³ the two projects already stand out as a virtuous example of the integration of data from different sources guaranteed by the RDF technology and the LOD paradigm. In this paper, we discuss an enhancement of this integration, concerning inflected forms of lexemes and their attestation in corpora.

(1.2) Motivation: other forms and their occurrence in texts

Inflectional lexicons are lexical resources that list (potentially all) inflected forms that are available for each lexeme, and code the cell that forms occupy in a lexeme’s paradigm – i.e., the morphosyntactic properties that they convey. Such resources are increasingly being developed, because of their potential use for both theoretical and applied linguistics. On the one hand, they can be used to train and test Natural Language Processing (NLP) systems addressing specific tasks: for instance, the Unimorph project (Kirov et al., 2016; Batsuren et al., 2022) scraped Wiktionary data to obtain lexicons consisting of form-lemma-cell triples in tabular format for a wide range of languages (to date, 169); those lexicons have been used to train and test systems for the tasks of morphological inflection (predicting inflected forms from lemmas), reinflection (predicting inflected forms from each other), analysis and other shared tasks proposed in various editions of the Special Interest Group on Morphology and Phonology of the Association for Computational Linguistics (SIGMORPHON, cf. Cotterell et al., 2016, Nicolai et al., 2024). On the other hand, inflectional lexicons have been widely used in the literature on theoretical morphology, to perform systematic quantitative analyses of predictability in implicative relations between forms (see, e.g., Pellegrini, 2023 on Latin; Herce, 2025 on Spanish; Beniamine, 2018 on a small but typologically diverse sample of languages). The Paralex project (Beniamine et al., 2023), an effort towards the standardisation of inflectional lexicons more complex than the ones released in the Unimorph project, allows for the creation of lexicons that include phonological transcriptions that are crucial for the quantitative analyses mentioned above. Besides a recommended standard format, Paralex features an OWL (Web Ontology Language) ontology that introduces classes and properties that can be used to release those lexicons as RDF LOD (cf. Pellegrini et al., 2025).

As hinted above, for many lexemes, Wikidata lists sets of inflected forms, and the morphosyntactic properties they convey, thus essentially providing users with inflectional data in RDF for many languages, including Latin. However, no information is provided on the actual usage of Latin forms, as reflected in corpus attestations. Indeed, especially in large paradigms, like the ones of Latin verbs, it will often be the case that only a few forms for each lexeme are actually found in texts, while many others never appear (Bonami & Beniamine, 2016). This information can be obtained from LiLa, looking at the attestations of such forms in the corpora linked to the lemma bank that feature a fine-grained annotation of morphological features. This is the topic that we aim to address in this paper: we discuss our efforts to map corpus tokens linked to the LiLa lemma bank to forms of lexemes expressed in RDF in the same format as in Wikidata, to allow for a distinction between forms that are possible, but never used in LiLa corpora, and forms that are attested in texts.

(1.3) Content of this work

Rather than working on Wikidata Latin lexemes directly, we create our own Wikibase instance.⁴ This allows us to identify potential issues and test different options to solve them. Furthermore, it makes it possible to showcase these different options in Wikidata community discussions about the data model to adopt towards Latin lexemes, and about how to overcome some limitations of the Wikidata lexicon, such as the following ones. On the one hand, there are gaps in the coverage of Wikidata with respect to tokens in LiLa corpora: missing lexemes, such as praeexisto ‘pre-exist’; lexemes for which at present no forms are listed, such as lauo⁵ ‘wash’; specific forms that are systematically not listed, such as contracted perfects, e.g., prf.act.inf praecipitasse for praecipito⁶ ‘precipitate’; and other less systematic gaps. On the other hand, Wikidata forms are sometimes underspecified for some of the inflectional properties that they convey, such as gender. For instance, for the perfect participle of the verb candido⁷ ‘make white’, the form candidatis is only coded as being the dative plural, with no information on gender, since that form can convey masculine, feminine or neuter gender. However, tokens matching that form in texts can be specified for gender, according to the gender of the controller noun: the adjective form will appear annotated as masculine if agreeing with a masculine noun, feminine if agreeing with a feminine noun, neuter if agreeing with a neuter noun. Some Latin corpora linked to LiLa – e.g., the Index Thomisticus Treebank (ITT; Passarotti, 2019) – accordingly tag the token as either masculine, feminine or neuter. If such tokens were linked to the underspecified form recorded in Wikidata, there would be loss of information, that would make it impossible to check, for instance, the frequency of that form occurring as masculine, feminine or neuter. Also, a direct link added to the token pointing to the form URI would lose granularity in that regard. In general, a more granular approach, representing orthographically identic but morphologically ambiguous forms as separate entities, makes querying for specific forms more straightforward.

To overcome these limitations, rather than using Wikidata forms, we exploit another inflectional resource linked to LiLa, PrinParLat (Pellegrini et al., 2025), to generate full paradigms including fully specified inflected forms, and link those forms to fully specified annotations of tokens in corpora. Based on that data, we could for example propose to represent all attested forms in Wikidata, while morphologically possible but unattested forms would remain unlisted.

Since at the time of the writing of this paper (November 2025) PrinParLat only included verbs, we had to work solely on verb paradigms in the following experiments. As reference corpus, we use the Index Thomisticus Treebank, as it is the only corpus linked to LiLa (Mambrini et al., 2022) that features an annotation of morphosyntactic properties as fine-grained as for distinguishing, for example, masculine, feminine and neuter forms with identical orthographical representation.

(2) Dataset description

Repository location

https://doi.org/10.5281/zenodo.17553591

Repository name

Zenodo

Object name

Lilamorph Wikibase: Latin Inflected Forms and Corpus Attestations

Format names and versions

CSV, python 3.12

Creation dates

01-09-2025 to 07-11-2025

Dataset creators

David Lindemann (programming), Matteo Pellegrini, Francesco Mambrini and Marco Passarotti (source data)

Language

Metalanguage: English. Object language: Latin

License

GNU General Public License v3.0⁸

Publication date

07-11-2025

(3) Experiments

(3.1) Generating forms: PrinParLatInfLexi

The point of departure for the generation of forms for this work is PrinParLat (Pellegrini et al., 2025).⁹ It is a Paralex compliant inflectional lexicon, also released as RDF LOD and linked to the LiLa knowledge base. It lists Principal Parts of Latin verbs – i.e., sets of inflected forms of a lexical unit from which its full paradigm can be inferred (see Stump & Finkel, 2013). The cells that are used as principal parts in the resource are prs.act.inf, fut.act.ind.3sg (or the corresponding morphologically passive forms for deponents), prf.act.ind.1sg (3sg for impersonals), prf.pass.ptcp.nom.n.sg, and fut.act.ptcp.nom.n.sg.

The forms of PrinParLat are assigned not only to lexemes, identified on semantic grounds (i.e., forms that share the same lexical meaning), but also to flexemes (Fradin & Kerleroux, 2003), identified on formal grounds. For instance, the verb lavo ‘wash’ can be inflected according to either the 1^st or the 3^rd conjugation, and its principal parts are accordingly assigned to different flexemes,¹⁰ as shown in Table 1.

Table 1

Lexemes, flexemes, and principal parts for the verb lavo ‘wash’.

LEXEME	FLEXEME (CONJ.)	prs.act.inf	fut.act.ind.3sg	prf.act.ind.1sg	prf.pass.ptcp.nom.n.sg	fut.act.ptcp.nom.n.sg
‘wash’	lavo (1^st)	lauare	lauabit	lauaui	lauatum	lauturum
‘wash’	lavo (3^rd)	lauere	lauet	laui	lautum	lauturum

Thanks to this structure, it is possible to capture the systematic relationship between variants that share some formal feature – be it the conjugation, as in this example, or the stem variant on which the form is built, e.g., abalien- vs. abalen- for the verb ‘separate’ (prs.act.inf abalienare/abalenare, fut.act.ind.3sg abalienabit/abalenabit, etc.). Furthermore, in this way, each flexeme can be assigned a unique descriptor of its inflectional behaviour (e.g., the conjugation), even if the lexeme is compatible with different conjugations – as happens for lavo in Table 1. This allows each PrinParLat flexeme to be linked to its own corresponding lemma in the LiLa Knowledge Base – e.g., in this case, the PrinParLat flexeme pertaining to the 1^st conjugation is linked to the LiLa lemma 110084,¹¹ which is itself stated to be a 1^st conjugation verb, while the PrinParLat flexeme pertaining to the 3^rd is linked to the LiLa lemma 110085,¹² which is itself stated to be a 3^rd conjugation verb.

Indeed, each flexeme of PrinParLat features tags for both its traditional coarse-grained conjugation among the ones of the classification of traditional grammars, as coded on the corresponding LiLa lemma through the property lila:inflectionType, and a fine-grained inflection micro-class (Dressler et al., 2008), grouping together only lexemes that display the same inflectional behaviour in all forms (see Pellegrini et al., 2025 for further details). For instance, the verbs laudo ‘praise’, amo ‘love’ and cubo ‘lie’ are all traditionally assigned to the same conjugation, the first, with prs.act.inf -are (amare, laudare, cubare) and the same set of inflectional patterns in the present system (e.g., prs.act.ind.3.sg amat, laudat, cubat). However, only laudo and amo are also assigned to the same micro-class, since they also share the inflectional patterns of the perfect (e.g., prf.act.inf laudauisse, amauisse), perfect participle and supine (laudatum, amatum), while cubo is assigned to a different micro-class that displays other inflectional patterns in those cells (see prf.act.inf cubuisse, sup cubitum).

In this context, we exploit the information provided by PrinParLat – namely, principal parts of flexemes and their conjugation and micro-class – to generate the full paradigms of the Latin verbs recorded in PrinParLat, including all adjectival and nominal forms (participles, gerund(ive)s, supines). This is achieved by means of rules, whose bases are specific principal parts of flexemes of specific conjugations, and whose outputs are other forms produced through replacements coded by regular expressions. Table 2 shows a sample of such rules, and their output when applied to forms of the verb lavo shown above in Table 1.

Table 2

Rules to generate forms of Latin verbs.

RULE	INFLECTIONTYPE	BASE FORM	OUTPUT FORM	REPLACEMENT	OUTPUT (lavo)
1	v1r	prs.act.inf	prs.act.ind.3sg	are$ → at	lauat
2	v3r	prs.act.inf	prs.act.ind.3sg	ere$ → it	lauit
3	v1r,v3r	prf.act.ind.1sg	prf.act.ind.3sg	i$ → it	lauauit, lauit
4	v1r,v3r	prf.pass.ptcp.nom.n.sg	prf.pass.ptcp.nom.f.sg	um$ → a	lauata, lauta

The outcome of the application of this process is a new resource providing information on the content of the 254 paradigm cells available for the 11K flexemes included into PrinParLat –amounting to a total of more than 2,5M cells. We call this PrinParLat-based inflected lexicon PrinParLatInfLexi and we release it on Zenodo in the Paralex community.¹³

(3.2) Uploading PrinParLatInflexi forms to the Wikibase

We then use the data of PrinParLatInfLexi to upload data in a Wikibase instance that we created. In the RDF version of PrinParLat linked to LiLa, it was necessary to have flexemes as the anchor point: as we have seen in Section 3.1, flexemes can be assigned a unique tagging of inflectional behaviour, and they can consequently also be unambiguously linked to lemmas of the appropriate inflection type, differently than lexemes, that can display forms inflected according to different conjugations. However, for the purposes of this work (the corpus token linking) we decided to use lexemes as the main lexical entries, to avoid an undesired duplication of forms that would give rise to widespread ambiguity in the linking of tokens to forms, and would also consequently impact the ability to retrieve accurate form counts.

Such duplication can happen because there are cases in which some forms of a lexeme are compatible with more than one flexeme. For instance, the verb ‘clean’ displays both forms inflected according to the 2^nd conjugation (prs.act.inf tergēre, fut.act.ind.3sg tergēbit) and forms inflected according to the 3^rd conjugation (prs.act.inf tergere, fut.act.ind.3sg terget), that are accordingly assigned to different flexemes. However, forms like prf.act.ind.1sg tersi, prf.pass.ptcp.nom.n.sg tersum and fut.act.ptcp.nom.n.sg tersurum cannot be assigned to one conjugation specifically: the morphological processes yielding these forms – namely, -s- suffixation for the formation of the perfect stem ters- and of the so-called “third stem” (Aronoff, 1993) ters- – are found both with verbs of the 2^nd conjugation (e.g., mulgeo ‘milk’, with prf.act.ind.1sg mulsi and prf.act.ptcp.nom.n.sg mulsum) and with verbs of the 3^rd conjugation (e.g., spargo ‘scatter’, with prf.act.ind.1sg sparsi and prf.act.ptcp.nom.n.sg sparsum), and can thus be considered to be compatible with both identified flexemes, as Table 3 illustrates.

In a first experiment, we uploaded one lexical entry for each flexeme into our Wikibase instance and generated full paradigms for those entries. Therefore, all forms generated from the prf.act.ind.1sg, prf.pass.ptcp.nom.n.sg and fut.act.ptcp.nom.n.sg are generated twice: such duplication happens also, e.g., for prf.act.ind.2sg tersisti, prf.pass.ptcp.nom.f.sg tersa, fut.act.ptcp.nom.f.sg tersura, and so on. This would create redundancy and ambiguity in the linking of tokens to form – as any occurrence should be linked to both generated forms, with no possibility for disambiguation. To avoid this issue, in a second experiment we pivoted on lexemes instead: we thus generated different forms for the same cell only when those are segmentally different (like tergēre vs. tergere and tergēbit vs. terget), while there is only one form for the prf.act.ind.1sg tersi and prf.pass.ptcp.nom.n.sg tersum (and all forms generated from such principal parts).

Table 3

Lexemes, flexemes, and principal parts for the verb TERG(E)O ‘clean’.

LEXEME	FLEXEME (CONJ.)	prs.act.inf	fut.act.ind.3sg	prf.act.ind.1sg	prf.pass.ptcp.nom.n.sg	fut.act.ptcp.nom.n.sg
‘clean’	tergeo (2^nd)	tergēre	tergēbit	tersi	tersum	tersurum
‘clean’	tergo (3^rd)	tergere	terget	tersi	tersum	tersurum

The following figures show how lexical entries are represented in our Wikibase instance¹⁴ and illustrate the difference between the two approaches. Figure 1 shows the entry for the flexeme tergo, with prf.act.ind tersī and prs.act.inf tergere only (see Figure 2). Figure 3 shows the entry for the flexeme tergeo, with another prf.act.ind tersī with its own URI and prs.act.inf tergēre only (see Figure 4). Figure 5 shows the lexical entry for the lexeme tergo, listing both tergere and tergēre (see Figure 6), as they are segmentally different, but with only one form when the shape is the same as in tersī. In Figure 5, the lexeme is linked to its flexemes and to the corresponding lemma in LiLa through dedicated properties. If more than one LiLa lemma is compatible with a lexeme, the links to those are qualified by the flexeme they correspond to, to ease disambiguation. A list of forms with their grammatical features is then provided, and each individual form is also linked to the flexemes it is compatible with (Figure 6). Furthermore, the form and its segmentation as provided in PrinParLatInfLexi are given. Lastly, there is a link to a Wikibase item that represents the coding of the paradigm cell that the forms fill. Such a reification is recommended in the Paralex ontology, where paradigm cells are assigned to individuals of a dedicated class paralex:Cell. In PrinParLatInfLexi (like in PrinParLat), the paradigm cell is expressed through a custom Paralex-compliant coding that uses the labels common in traditional descriptions of Latin, and their abbreviations defined in the Leipzig Glossing Rules¹⁵ – e.g., “prf” for perfect tense-aspect.

The flexeme tergo in our Wikibase instance.¹⁶

A form of the flexeme tergo in our Wikibase instance.

The flexeme tergeo in our Wikibase instance.¹⁷

A form of the flexeme tergeo in our Wikibase instance.

The lexeme tergo in our Wikibase instance.¹⁸

A form of the lexeme tergo in our Wikibase instance.

More generally, as we have seen, the Paralex ontology provides classes and properties that are envisaged to be used for the release of Paralex data as RDF LOD. Since the resource we start from, PrinParLatInfLexi, is constructed to be a Paralex lexicon, to guarantee interoperability with other Paralex lexicons released as RDF LOD, the correspondence between classes and properties that we introduce in our Wikibase instance and those of the Paralex ontology, on one side, and to Wikidata, on the other, is established through dedicated properties, as shown in Figure 7 for the Paralex orth form property, linked to paralex:orth_form and wikidata:P7243. This enables us to export a version of the datasets in Wikibase using the aligned Wikidata or Paralex OWL property, respectively, for a seamless integration into these environments.

Mapping from the Paralex orth form property to Wikidata and Paralex properties.¹⁹

(3.3) Uploading ITTB tokens to the Wikibase

The other pieces of information that we need to fulfil the aim of this work are corpus tokens. To that end, we use the Index Thomisticus Treebank (ITTB). It is a corpus of the work of Thomas Aquinas featuring the fine-grained morphological tagging that we need for this work. Since PrinParLatInfLexi only includes verbs, we only extract verbal tokens of the ITTB. We then generate items in our Wikibase instance for each verbal token, as shown in Figure 8, preserving the links to the IDs of the token in the LiLa Knowledge Base and of the lemma to which the token is linked through dedicated properties. We also record the fine-grained morphological annotation provided in there using the tagset of the Universal Dependencies (UD) project (Petrov et al., 2012) and expressed as RDF LOD using the Web Annotation Ontology (Sanderson et al., 2013). In our Wikibase instance, we introduce a property that points to the different feature-value pairs in the UD tagset (e.g., “Tense#Past”).

The token lavatur in our Wikibase instance.²⁴

(3.4) Mapping ITTB tokens to PrinParLatInfLexi forms

The final step is linking the tokens of the ITTB to the forms of PrinParLatInfLexi.

For an ITTB token to be linked to a PrinParLatInfLexi form, the following requirements must be met.

The token and the form must be segmentally identical. Since the ITTB tokens do not display any coding of vowel length, differently than the forms of PrinParLatInfLexi, where vowel length is coded at least on endings, we use a normalised version of the latter representation where vowel length is removed.
The token and the form must link to the same LiLa lemma. Only for tokens for which no direct match with any LiLa lemma could be found in PrinParLatInfLexi forms, we also checked whether a match could be found with lemmas that are stated to be variants in the lemma bank, through the property lila:lemmaVariant. For instance, the ITTB token epulemur²⁰ is linked to the deponent lemma epulor.²¹ While no PrinParLat(InfLexi) entry is linked to that lemma, there is a matching entry linked to non-deponent epulo,²² which in the LiLa lemma bank is stated to be a variant of epulor; we thus link the token to the corresponding form of that entry (epulēmur).²³

The fine-grained morphological tags of the token and of the form must be compatible. Since the two resources use different annotation schemes for morphological properties – a custom Paralex-compliant coding using the abbreviations of the Leipzig Glossing Rules in PrinParLatInfLexi, the UD tagset in the ITTB – we needed to explicitly map the values of the two annotations to one another, as shown in Table 4.²⁵ Note that in a couple of cases, a single tag in the PrinParLatInfLexi set corresponds to two values of the ITTB one. This is because tags like ‘future perfect’, common in traditional Latin descriptions, cannot be considered as expressing a value of a single feature, but rather distinct values of distinct features of tense and aspect.

Table 4

Mapping between the PrinParLatInfLexi and ITTB tagsets for morphological feature values.

FEATURE	VALUE	PrinParLatInfLexi (LEIPZIG GLOSSING RULES)	ITTB (UNIVERSAL DEPENDENCIES)
case	ablative	abl	Case#Abl
case	accusative	acc	Case#Acc
case	dative	dat	Case#Dat
case	genitive	gen	Case#Gen
case	nominative	nom	Case#Nom
case	vocative	voc	Case#Voc
gender	feminine	f	Gender#Fem
gender	masculine	m	Gender#Masc
gender	neuter	n	Gender#Neut
mood	imperative	imp	Mood#Imp
mood	indicative	ind	Mood#Ind
mood	subjunctive	sbjv	Mood#Sub
number	plural	pl	Number#Plur
number	singular	sg	Number#Sing
person	first	1	Person#1
person	second	2	Person#2
person	third	3	Person#3
tense-aspect		fut	Tense#Fut
tense-aspect		fprf	Tense#Fut, Aspect#Perf
tense-aspect		iprf	Tense#Imp
tense-aspect		prf	Tense#Past, Aspect#Perf
tense-aspect		pprf	Tense#Pqp
tense-aspect		prs	Tense#Pres
verb form		gdv	VerbForm#Gdv
verb form		ger	VerbForm#Ger
verb form		inf	VerbForm#Inf
verb form		ptcp	VerbForm#Part
verb form		sup	VerbForm#Sup
voice		act	Voice#Act
voice		pass	Voice#Pass

The link from the corpus token to a matching ontolex:Form is represented on our Wikibase by the property P21. Each of those links is referenced by a link to the exact version of the matching algorithm script used for producing the matches, i.e., the general matching algorithm, and the one asking for matches through lemma variants, respectively.²⁶

(4) Outcome and discussion

As the outcome of the process described in the previous section, in our Wikibase instance we have on the one hand a collection of the 8,018 lexemes of PrinParLat, each of them supplied with a full list of generated forms from PrinParLatInfLexi (up to 1,812 per lexeme, depending on the amount of form variants available for the same cells); on the other hand, a collection of 71,195 verbal tokens extracted from the ITTB. Table 5 summarises the outcome of the process of linking tokens to forms.

Table 5

Outcome of the process of linking ITTB forms to PrinParLatInfLexi forms.

LINKED TOKENS		UNLINKED TOKENS	TOTAL
1:1 LINKS	1:N LINKS	UNLINKED TOKENS	TOTAL
67,740 (95.1%)	969 (1.4%)	2,486 (3.5%)	71,195
68,709 (96.5%)

The overall coverage of the generated forms with respect to ITTB tokens is very high, above 95%. This result is even more remarkable if we consider that the data source from which our forms are generated, PrinParLat, is intended to document systematically only words that are found in Classical Latin dictionaries (cf. Pellegrini et al., 2025), and thus does not cover Medieval Latin words systematically, while our tokens come exactly from a corpus of Medieval Latin, consisting of works of Thomas Aquinas (13^th century). Indeed, many of the tokens that cannot be linked are occurrences of verbs that are only attested from the Middle Ages on, and are thus not found in Classical Latin dictionaries, like conditiono ‘to condition’. If we exclude such cases of tokens that are not mapped to any form because there is no lexical entry in PrinParLatInfLexi that is linked to the corresponding lemma, only 1,769 unlinked tokens remain. These can be due to forms that are not represented in PrinParLatInfLexi, like comparatives and superlatives of participles (e.g., nom.m/f.sg convenientior ‘more convenient’, nom.n.sg convenientissimum ‘very convenient’), adverbs derived from participles (e.g. consequenter ‘consequently’ from consequor), or simply spelling variants (e.g., prs.act.inf abiicere instead of standard abicere ‘throw away’). In other cases, the reason lies in different choices in the annotation, e.g. present participles of deponents that are tagged as passive in the ITTB vs. active in PrinParLatInfLexi.

In most cases, ITTB tokens are unambiguously linked to only one form in PrinParLatInfLexi. There are cases of tokens that can be linked to different forms, but ultimately, all of these ambiguities are due to issues that are not related to our procedure to link tokens to forms. On the one hand, this can happen because of ambiguity in the linking of ITTB tokens to LiLa lemmas in the original resource: for instance, the lemmatization of the token deserere²⁷ is not disambiguated between the two available options, namely desero ‘sow’²⁸ vs. desero ‘abandon’.²⁹ As a consequence, that token is compatible with forms of both corresponding lexemes in PrinParLatInfLexi. This ambiguity in the linking of tokens to LiLa lemmas explains about two thirds of the ambiguous token-to-form links in our data (628 out of 969). On the other hand, as we have seen, tokens in the ITTB are not coded for vowel length, while forms of PrinParLatInflexi are, and there are forms that only differ in vowel length. The most systematic – and hence most quantitative impactful – example is the prf.act.ind.3pl cell, that can be formed by attaching to the perfect stem either -erunt or -ērunt. While these variants differ in their frequencies and contexts of occurrence, both are systematically given as possible variants for all lexemes in PrinParLatInfLexi, that aims at the maximum possible coverage. Consequently, a token like dedicaverunt³⁰ can be linked to two different PrinParLatInfLexi forms, dedicaverunt³¹ and dedicavērunt,³² without the possibility of disambiguation (except possibly for poetry, where the length can sometimes be inferred from metre). This unavoidable ambiguity due to lack of information on vowel length in corpora accounts for about one third of the cases of ITTB tokens that are linked to more than one form in PrinParLatInfLexi (343 out 969).

To summarise, the coverage of ITTB tokens with our generated forms is very high, and almost no case of 1:0 and 1:N links is due to actual gaps in our generated forms or bugs in the linking process. Consequently, we consider our work to be a successful example of enrichment and integration of data for Latin.

The writing operations to Wikibase described in section 3 have been carried out by own python scripts, deploying the wikibaseintegrator libary³³ for the construction of JSON objects, as required by the MediaWiki API, for Wikibase entities³⁴ of type lexeme (for PrinParLat lexemes and flexemes) and of type item (for corpus tokens).

(5) Applications and perspectives for future work

Recently, we have aligned Wikidata Latin lexemes to the LiLa lemma bank and have also created new Wikidata lexeme entities.³⁵ With the different forms lexicons built on our Wikibase instance, we are now in the position to contribute to a discussion in the Wikidata community, comparing different options of representation of inflected forms. We would like to highlight corpus token linking as central use case for Wikibase forms, which entails to adopt the data model that caters best for that application, namely a separate listing of orthographically identical but morphologically ambiguous forms.

Having chosen Wikibase as platform for the experiments presented here, all datasets remain now ready for intervention of human or algorithmic users, who would mark ambiguous links (from token to form, or from token to LiLa lemma), as “preferred” or “deprecated”, so that the ambiguity is resolved. For such task, the software provides the necessary means for querying the content and introducing rank values to ambiguous statements. These edits would be recorded in the users’ edit history, for revision and evaluation statistics.³⁶

The data that results from this work, and the procedure outlined to obtain it, are also of interest for theoretical phonology and morphology, that can benefit from the possibility of having accurate frequency counts of wordforms. More specifically, this can be useful for quantitative studies on predictability in morphology, such as the ones cited in Section 1. Such studies normally use inflectional lexicons listing all forms of lexemes, regardless of their attestation. However, there is a growing awareness – see, e.g., Boyé and Schalchli (2019) – that a more ecological setting for the task of predicting forms should be adopted, e.g., using attested forms as predictors, and non-attested forms as the ones to be predicted, or at least taking into account information on token frequency in some way. This work provides data can be exploited for that purpose, and it outlines a procedure that can be applied to obtain more data on this aspect.

Indeed, in the future, we plan to extend the application of this procedure to other lexical categories that display different inflected forms in Latin – namely, nouns and adjectives – as well as to other corpora that feature the necessary fine-grained morphological tagging – such as the LASLA corpus. Furthermore, we envisage such an effort to be undertaken also for other languages.

Appendices

Appendix

We provide here a list of the namespaces of the compact URIs used in this paper.

lila: <http://lila-erc.eu/ontologies/lila/>

ontolex: <http://www.w3.org/ns/lemon/ontolex#>

paralex: <https://www.paralex-standard.org/paralex_ontology.xml#>

wikidata: <http://www.wikidata.org/entity/>

Notes

[1] See the Appendix for a list of the namespaces of compact URIs used in this paper.

[2] As of November 3, 2025. The same holds for the other counts shown in this paper.

[3] See https://www.wikidata.org/wiki/Property_talk:P11033.

[4] See https://lilamorph.wikibase.cloud. The instance is hosted by the Wikibase Cloud hosting service provided by Wikimedia Deutschland, see https://wikibase.cloud.

[5] See https://www.wikidata.org/wiki/Lexeme:L1056162. Last accessed (2025/11/15).

[6] See https://www.wikidata.org/wiki/Lexeme:L281899. Last accessed (2025/11/15).

[7] See https://www.wikidata.org/wiki/Lexeme:L262071. Last accessed (2025/11/15).

[8] The ITTB is under a CC BY-NC-SA 4.0 license. There is no one license common to all the resources interlinked in LiLa; as for the Lemma Bank, which is what is relevant here, it is CC BY-SA 4.0. The GNU General Public License v3.0 is the license for code that is consistent with the “as open as possible, as closed as necessary” approach pursued in LiLa and with the licenses chosen for the third-party sources mentioned here (ITTB, LiLa).

[9] More specifically, the version used for this work is v2.0.0 DOI: https://doi.org/10.5281/zenodo.17815898

[10] See http://lila-erc.eu/data/lexicalResources/prinparlat/id/flexeme_l0416_6480 for the flexeme of the 1st conjugation and http://lila-erc.eu/lodview/data/lexicalResources/prinparlat/id/flexeme_l0416_6479 for the flexeme of the 3rd conjugation. Both websites were last accessed 2025/11/15.

[11] See http://lila-erc.eu/data/id/lemma/110084. Last accessed (2025/11/15).

[12] See http://lila-erc.eu/data/id/lemma/110085. Last accessed (2025/11/15).

[13] DOI: 10.5281/zenodo.17819184. Another inflected lexicon of Latin verbs and nouns that we created in the past for other purposes is already available, LatInfLexi (Pellegrini & Passarotti, 2018). There is a remarkable amount of overlap between the two resources, so in the future we plan to merge them into a single reference resource for inflected forms in Latin. However, this cannot be done yet because there are pieces of information that are provided in only one of the two (e.g., information on flexemes in PrinParLatInfLexi only, information on vowel length in LatInfLexi only).

[14] Access all collections at https://lilamorph.wikibase.cloud/wiki/Main_Page. Last accessed (2025/11/15).

[15] https://www.eva.mpg.de/lingua/resources/glossing-rules.php. Last accessed (2025/11/15).

[16] See https://lilamorph.wikibase.cloud/wiki/Lexeme:L10047. Last accessed (2025/11/15).

[17] See https://lilamorph.wikibase.cloud/wiki/Lexeme:L10048. Last accessed (2025/11/15).

[18] See https://lilamorph.wikibase.cloud/wiki/Lexeme:L34277. Last accessed (2025/11/15).

[19] See https://lilamorph.wikibase.cloud/wiki/Property:P11. Last accessed (2025/11/15).

[20] See https://lilamorph.wikibase.cloud/wiki/Item:Q25520. Last accessed (2025/11/15).

[21] See http://lila-erc.eu/data/id/lemma/101349. Last accessed (2025/11/15).

[22] See http://lila-erc.eu/data/id/lemma/101350. Last accessed (2025/11/15).

[23] See https://lilamorph.wikibase.cloud/wiki/Lexeme:L29533#F152. Last accessed (2025/11/15).

[24] See https://lilamorph.wikibase.cloud/wiki/Item:Q29683. Last accessed (2025/11/15).

[25] See https://lilamorph.wikibase.cloud/wiki/Main_Page#PrinParLat_morphological_cell_descriptors_(Leipzig_abbreviations) for the mapping of Leipzig abbreviations to UDP features and to Wikidata entities, and, through Wikidata, to entities of the LexInfo ontology. Last accessed (2025/11/15).

[26] We are using Wikibase references, and as value for the reference link, the ID of the Github commit of the version used in the matching procedure, see, e.g., https://lilamorph.wikibase.cloud/wiki/Item:Q49188. Last accessed (2025/11/15).

[27] See https://lilamorph.wikibase.cloud/wiki/Item:Q11916. Last accessed (2025/11/15).

[28] See http://lila-erc.eu/data/id/lemma/98883. Last accessed (2025/11/15).

[29] See http://lila-erc.eu/data/id/lemma/98882. Last accessed (2025/11/15).

[30] See https://lilamorph.wikibase.cloud/wiki/Item:Q11272. Last accessed (2025/11/15).

[31] See https://lilamorph.wikibase.cloud/wiki/Lexeme:L28639#F172. Last accessed (2025/11/15).

[32] See https://lilamorph.wikibase.cloud/wiki/Lexeme:L28639#F174. Last accessed (2025/11/15).

[33] See https://wikibaseintegrator.readthedocs.io. Last accessed (2025/11/15).

[34] See https://www.mediawiki.org/wiki/Wikibase/DataModel. Last accessed (2025/11/15).

[35] See https://www.wikidata.org/wiki/Property_talk:P11033. Last accessed (2025/11/15).

[36] For example, at https://lilamorph.wikibase.cloud/wiki/Item:Q15797#P16 (Last accessed 2025/11/15), upon the decision whether that token “dico” should be linked to the first conjugation verb or the third conjugation verb (grounded, for instance, on the inflection paradigm of unambiguous forms of the same group of lemmas appearing in the same text), the user would mark the correct link as “preferred rank”, or, alternatively, the other one as “deprecated rank“, so that the matching algorithm in the next round would not again propose ambiguous token-to-form links.

Acknowledgements

This work was carried out within the context of the “LiLa: Linking Latin” project, that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme – Grant Agreement No. 769994. This work has also been supported by`DLTB research group (Basque Government, IT1534-2).

Competing interests

The authors have no competing interests to declare.

Author Contributions

David Lindemann: Conceptualization, Methodology, Software, Visualization, Writing – review & editing

Matteo Pellegrini: Conceptualization, Formal analysis, Methodology, Resources, Writing – original draft, Writing – review & editing

Francesco Mambrini: Conceptualization, Methodology, Writing – review & editing

Marco Passarotti: Conceptualization, Methodology, Writing – review & editing