Extending CLDF — Towards a Type System for Cross-Linguistic Data

Robert Forkel; Johann-Mattis List

doi:10.5334/johd.517

(1) Context and motivation

Over the last decade, the reuse of cross-linguistic data, i.e. data from or about many languages such as typological surveys or collections of interlinear glossed text, changed considerably. Eclectic reuse, e.g. hand-picking examples from collections of glossed text to illustrate phenomena in research articles, has given way to computational methods investigating patterns in the enormous diversity of the world’s languages (Kortmann, 2021). New methods, with particular requirements regarding their input, in turn shaped practices of data collection. As an example, consider how the database Grambank was created by Haynie et al. (2023) in order to create a successor of the World Atlas of Language Structures by Haspelmath et al. (2005) that would exhibit denser coverage for individual languages with fewer information gaps.

This interplay of data and methods resembles the way data types and operations interact in programming languages. In programming this interaction has often been formalized in the framework of a type system (Cardelli & Wegner, 1985; Weirich, 2014). A type system is a set of rules guiding the interpretation of data in computer programs. Compilers or interpreters, i.e. components of the computing stack that make programs executable, can enforce these rules to prevent a class of runtime errors caused by mismatches in how values are interpreted in different parts of a program (Cardelli, 2004).

Typical examples for data types in programming languages are integer numbers and floating-point numbers (an approximation of real numbers): Since floating point numbers are often only approximations of the “real” result of computations, comparing them for strict equality is an invalid operation to determine equality of results. If the programming language can distinguish between different types of numbers, this can be used to choose appropriate operations or reject invalid input, e.g. select an appropriate comparison method for floats or reject floats as input for naive equality comparison.

Since type systems provide a very effective way to specify and enforce contracts between data and operations, they also make a certain class of unit tests unnecessary, thus help reduce code. This effectiveness has also led most programming languages to support user-defined data types, i.e. a way for programmers to define their own data types, typically composed of multiple “atomic” types. This composability is one way in which a type system can implement expressiveness and modularity, two properties listed by Weirich (2014) as desirable.

We argue that the analogy between data and methods in science and type systems in programming is useful as a paradigm for a reuse-focused dataset design approach. With the Cross-Linguistic Data Formats (CLDF) standard (Forkel et al., 2018), an extensible toolbox to specify cross-linguistic data types is available. Examples of a data management approach based on data types and CLDF are presented in section § 4.

(2) Background

(2.1) Cross-linguistic data types

In computer programming, data types can be seen as common groupings of data values – represented by variables in code – that share a common set of operations that can be applied to them, as well as a common representation inside individual programming languages (Parnas et al., 1976; Hudson & Ishizu, 2017, pp. 45–55).

For cross-linguistic data, a shared or common “representation” is often the most transparent criterion to establish a type, at least at the current stage, where data collection often predates any computational reuse. An example for such a data type are word lists – lists of triples consisting of a word (or lexical unit), a meaning description and a language the word belongs to. Note that word list is already a complex data type composed from atomic “word”-triples. Such word lists have been collected for centuries (Pallas, 1787/1789).¹

A prominent method that uses word lists as input is the comparative method (Meillet [1925], 1954) to group words from different, but genetically related languages into cognate sets and reconstruct proto forms, i.e. words of proto languages (see Durie (1996) for a popular description of the major workflow of the method). In programs that compute cognate sets, such as LingPy’s (List & Forkel, 2023) LexStat algorithm (List, 2014), it is useful to describe the input as word lists rather than as generic “lists of triples”, because this makes the program more informative and semantically transparent. Figure 1 shows an excerpt from the wordlist collection of Pallas for the concept “mother” translated into several different languages (Pallas, 1789, 2:10).

Words for “mother” sampled across different languages in Pallas (1789: 10).

Another example for a cross-linguistic data type that has been collected for a long time are phoneme inventories, i.e. sets of phonemes occurring in a language (Moran & McCloy, 2019). International Phonetic Alphabet (IPA) charts present an early method to analyse phoneme inventories by equipping phonemes with features and plugging them into tabular arrangements for visual inspection. Figure 2 provides a consonant chart from the digitized online version of the Tableaux phonétiques des patois suisses romands (TPPSR) by Gauchat et al. (1925) for the variety Chevroux (see Geisler et al. (2021) for the details on the digital version).²

Pulmonic consonants of the variety Chevroux in the TPPSR.

A third example for a data type that is used for cross-linguistic comparison is interlinear glossed text (IGT), i.e. lines of text in an object language and its translation that are annotated with glosses between the lines (Lehmann, 2004; List et al., 2022). A concrete specification of IGT, in particular for examples in typology, are the Leipzig Glossing Rules (Comrie et al., 2015). An example conforming to the Leipzig Glossing Rules could be characterized as triple of lines, where each line has the same number of “words”. Since morpheme-alignment is an optional property of IGT, an IGT example could also have an additional property “conformance level”, which informs consumers whether to expect morpheme-aligned lines or not.

(2.2) Data types in CLDF

The CLDF standard provides a framework to formalize cross-linguistic data sufficiently to serve as machine-readable representations of typed data. To do so, CLDF piggybacks on a lower-level type system, namely the CSVW recommendation for tabular data (Tennison, 2016).³ CSVW assigns atomic data types to table cells – such as integer numbers, strings or dates – and can describe composed data types in the form of rows of tables with well-specified cells. Additionally, CSVW supports relations between tables using foreign key specifications. Thus, a set of tabular data files described by CSVW metadata can be regarded as equivalent to a serialization of the data in a relational database.

CLDF then attaches semantics at three levels of the data model: cells, tables and groups of tables. So, for example, CSVW suffices to define a data type composed as a pair of real numbers. But an additional layer of semantics – such as CLDF – is necessary to fix the interpretation of the pair, for example as vector in a two-dimensional vector space, as complex number or as geographic coordinates.

CLDF data types for cells are called properties in CLDF lingo, table-level data types are called components and table-group-level types are called modules. Arguably the most important property in CLDF is glottocode, i.e. a string representing a language identifier in the Glottolog catalog (Forkel & Hammarström, 2022). An example for a component is the LanguageTable, each row of which lists well-specified metadata for a certain language, e.g. its glottocode or a point coordinate, locating the language in geographic space. A Wordlist is an example for a CLDF module, requiring the presence of a FormTable component in a conformant dataset.

The rules associated with data types in CLDF are specified in the standard and enforced when running the cldf validate command of CLDF’s reference implementation – the pycldf package – on a dataset. Doing so will make sure that glottocodes have the correct format, FormTable rows have non-empty Form cells, and Wordlist datasets contain a non-empty FormTable.

(2.3) Extending CLDF

CLDF is designed to be extensible, and new properties, components and modules can be added with each new release of the specification (Forkel, 2023). Since the underlying CSVW specification has no limitations regarding the number of tables in a package or the number of columns in a table – and CLDF simply ignores non-CLDF columns or tables – future extensions can be easily implemented and tested by adding custom columns or tables to datasets.

Once such ad-hoc, custom additions “stabilize”, i.e. prove to be useful and are used as model for several datasets, they are set on track for standardization. This means that the next version of CLDF may include “official” properties, components or modules which correspond to the ad-hoc additions.

This process of stepwise conventionalization is very pragmatic but requires a somewhat coherent CLDF community to work. For example, it must be possible to identify cases where different datasets incorporate ad-hoc models for the same kind of data. It also requires data publication using suitable licenses, in particular licenses that allow derived works, because once the ad-hoc model becomes part of the CLDF specification, the “pioneer” datasets need to be converted to using the now specified CLDF constructs (a process that might be called retro-standardization (Geisler et al., 2021). Licenses without ND clauses make sure this conversion does not need to be done by the original authors.

(2.4) Aggregating CLDF data

CLDF is designed to allow for easy aggregation of data from multiple datasets – thereby encouraging a culture of smaller, focused datasets. Shared provenance of the data in datasets is conveyed transparently by linking the data to sources, i.e. bibliographic records of publications providing the source of the data. So, Walker and Ribeiro (2011) computed a language phylogeny for the Arawakan language family, which has been standardized as Phlorest phylogeny in Walker and Ribeiro (2025), based on lexical data, standardized as Lexibank dataset in Walker and Ribeiro (2024) and tied the phylogeny to geographic speaker areas conveyed as map, which has been standardized as CLDF dataset of speaker areas in the Glottography project (The Glottography Consortium, 2025). Aggregation of the data is made possible via transparent data types and consistent language identifiers used across sets. The shared provenance of the data can be inferred from the DOI of the source publication referenced by all three datasets.

(3) A paradigm for curating cross-linguistic data based on data types

Over the last five years we have seen projects evolve around datasets of a certain data type. In these projects the data type defines the commonality between datasets, i.e. the same analysis methods can be applied to all datasets curated in the project. Datasets are kept distinct – rather than aggregating all data into a single set – in order to convey provenance of the data and thus to allow for addition of new datasets based on new “raw” data independently of other datasets. So, curating collections of datasets of the same type allows for reuse (and development) of type-specific curation tools. Limiting datasets to data with the same provenance makes the source transparent.

(3.1) Dataset design

The data design workflow pioneered by these projects proceeds as follows: (1) A useful data type is identified, e.g. by implementing an analysis method for a particular instance of the data type and then validating this method by running it on similar data. Once a data type is identified and an ad-hoc data model defined, this model can be used to inform the implementation of curation tools – which may ultimately become part of the CLDF reference implementation. Such a tool can be used to curate multiple datasets consistently. (2) More datasets are collected, e.g. for different language families. (3) The data model might be revised based on experience with more data. Discussion of the data model can happen in the form of issue threads at CLDF’s GitHub repository⁴ – making sure the CLDF maintainers are involved. (4) Once the model stabilizes, it is included into the CLDF standard and methods development as well as data collection can then proceed independently, based on a stable specification and the project infrastructure only serves as a (optional) publication outlet.

(3.2) Data curation

Curation of collections of datasets of the same type comprises the following tasks: (1) Identifying suitable data in source publications, (2) extracting the raw data, (3) converting the raw data to a CLDF dataset conforming to the quality standards of the collection, and (4) creating versioned releases of the datasets, published in a suitable repository. This curation scenario is exactly the use case for which the cldfbench package has been developed (Forkel & List, 2020): The three default directories in a cldfbench-curated dataset partition the data into “raw” data – as it appears in the source publication, “etc” – i.e. data added by the curators (essentially acting as editors) and “cldf” data intended for reuse. In addition, cldfbench creates suitable metadata to allow simple uploading of released versions of datasets to the Zenodo repository.

Organizing dataset curation around data types also makes it easier to address the retro-standardization issue explained above. The community curating datasets of a particular type is in the best position to push standardization in CLDF and retro-standardize their datasets once standardization in CLDF has happened.

(4) Examples

(4.1) Example 1: Phlorest

Language phylogenies have been the output of analysis methods in Historical and Diversity Linguistics for a long time (Dunn, 2015). More recently – in particular with the advent of Bayesian methods – they have been used as input for analysis, e.g. serving as prior or constraint in Bayesian phylogenetics or for the compilation of “super trees” (Jäger, 2018). The latter reuse scenario requires consistent formatting and metadata across constituent phylogenies and thus contributed to the specification of a data model for language phylogenies. Such phylogenies have been collected and curated in the Phlorest project (Forkel & Greenhill, 2023).

Phlorest phylogenies are typically extracted from NEXUS files provided as supplements to publications. The phlorest package⁵ is used to make this extraction simple and reproducible and to make sure, comparable phylogenies are created, for example by sampling posterior distributions appropriately. This package also provides functionality to format the most important information of a phylogeny as human readable README to serve as landing page of the dataset.⁶

As of now, Phlorest published 29 phylogenies⁷ providing genealogical information about languages from more than 20 families. The curation workflow is described in Figure 3.

The Phlorest project exemplifies the data curation workflow described here: Once a data type is identified, dedicated curation workflows can be supported with tools such as Python packages. Publication of datasets can then be modeled as “release” in GitHub’s standard workflow and the integration with Zenodo makes sure such releases are actually pushed to a longterm archive for scientific content.

While data collection often precedes methods development in linguistics, this is not always the case for data types already known in other branches of science, such as phylogenies. In such cases, adaption of methods from other disciplines may be a core driver of “data typing” and the data types may turn into common terminology making cross-disciplinary development of methods easier.⁸

(4.2) Example 2: Glottography

Another type of cross-linguistic data that has been collected and curated following the paradigm outlined above are datasets of “speaker areas” (aka “language maps” or “language atlases”). With phylogeography tools such as BEAST being able to handle geographic data in the form of polygons rather than just point locations, machine-readable data on the geographic distribution of languages becomes a necessity. The demand for such data was addressed by the Glottography project (see Ranacher et al., 2026).⁹

While a basic property to associate languages with geographic objects was already present in CLDF 1.2, the Glottography project enhanced the CLDF data model by adding aggregations of areas: Areas aggregated on Glottolog language-level, i.e. merging all dialect-level areas and areas aggregated on family-level.

Suggestions for new datasets are collected as issues of a GitHub repository.¹⁰ Curation is supported by a dedicated software package,¹¹ and released datasets are published in a dedicated Zenodo community.¹²

With some datasets providing speaker areas for hundreds or even thousands of languages (Ranacher et al. 2025), the collection of datasets curated by Glottography now covers more than 57 percent of the languages of the world (as tallied by Glottolog 5.2).

(4.3) Outlook: Etymological dictionaries

The comparative method mentioned above needs more input than just word lists. It crucially depends on assumptions about the genealogy of languages to inform the reconstruction of proto-forms. Thus, a data type that allows application of the comparative method would be composed of word lists and a language phylogeny. While already quite complex, data of this type has also already been collected for some time as exemplified by etymological (or comparative) dictionaries such as the Austronesian Comparative Dictionary (Blust & Trussel, 2013), PMED (Kaufman & Justeson, 2003) or The lexicon of Proto Oceanic (Ross et al., 1998). Specifying a suitable CLDF variant for such data is also already underway (see Figure 4, Smith et al., 2025).

Complex data in etymological dictionaries can be constructed from simpler data types which are already standardized in CLDF. In addition to the tabular data known from cognate coded wordlists the reconstruction tree adds direction and order to the reconstructed protoforms.

Since this data type is considerably complex, it can be expected that the available datasets will exhibit inconsistencies. Thus, a first operation for this data type should be a consistency check – possibly resulting in data annotation as for interlinear glossed text.

(5) Conclusion

The above examples show how the data curation paradigm outlined here has been successfully implemented. And more and more similar approaches can be observed, such as the IMTVault project (Nordhoff & Krämer, 2022; Nordhoff & Krämer, 2025),¹³ providing a web interface for CLDF data of the IGT type, or the Open Text Collections project (Nordhoff et al., 2024)¹⁴ which also invites submissions of previously unpublished data. Thus, thinking about cross-linguistic data in terms of data types seems to create a suitable framework to organize data creation, curation and reuse.

We have also shown that the CLDF standard¹⁵ provides a suitable toolbox to specify data types. Several of these have been part of CLDF since version 1.0, with examples of conforming open datasets listed on the initiative’s website.¹⁶

While data creators may not have to think about data types right away, it may still be beneficial in order to figure out suitable publication outlets and in general to maximize potential reuse. Last but not least, editors of academic journals may play an important role in a data ecosystem based on data types, by identifying suitable additions to type-specific collections in supplemental data and ideally directing authors appropriately.

Notes

[1] See also https://pallas.ivdnt.org/, last accessed 27.03.2026.

[2] Taken from https://tppsr.clld.org/languages/1, last accessed 27.03.2026.

[3] https://csvw.org, last accessed 27.03.2026.

[4] https://github.com/cldf/cldf, last accessed 27.03.2026.

[5] https://github.com/phlorest/phlorest, last accessed 27.03.2026.

[6] See for example https://github.com/phlorest/walker_and_ribeiro2011/blob/main/README.md, last accessed 27.03.2026.

[7] https://zenodo.org/communities/phlorest, last accessed 27.03.2026.

[8] https://github.com/tochsner/phylodata/blob/main/PRINCIPLES.md, last accessed 27.03.2026.

[9] https://github.com/glottography, last accessed 27.03.2026.

[10] https://github.com/Glottography/.github/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new%20dataset%22, last accessed 27.03.2026.

[11] https://github.com/Glottography/pyglottography, last accessed 27.03.2026.

[12] https://zenodo.org/communities/glottography, last accessed 27.03.2026.

[13] https://imtvault.org/, last accessed 27.03.2026.

[14] https://opentextcollections.github.io/, last accessed 27.03.2026.

[15] https://doi.org/10.5281/zenodo.597075, last accessed 27.03.2026.

[16] https://cldf.clld.org, last accessed 27.03.2026.

[17] https://doi.org/10.3030/101044282, last accessed 27.03.2026.

Acknowledgements

We thank all scholars who have helped the creation and further development of CLDF by interacting with the data, testing the tools, or providing data their own data in formats compatible with CLDF.

Funding Statement

Johann-Mattis List was supported by the ERC Consolidator Grant Productive Signs (Grant No. 101044282).¹⁷ Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency (nor any other funding agencies involved). Neither the European Union nor the granting authority can be held responsible for them.

Author Contributions

Robert Forkel (Conceptualization; Writing – original draft; Writing – review & editing).

Johann-Mattis List (Conceptualization; Writing – review & editing).