Extending CLDF — Towards a Type System for Cross-Linguistic Data

Robert Forkel; Johann-Mattis List

doi:10.5334/johd.517

Extending CLDF — Towards a Type System for Cross-Linguistic Data

Journal of Open Humanities Data

Volume 12 (2026): Issue 1

By: Robert Forkel and Johann-Mattis List

Open Access

|Apr 2026

Abstract

We argue that, in order to maximize reusability of cross-linguistic data, it is useful to think about it in terms of a type system. Type systems are enforceable rules guiding the interpretation of data in computer programs. Thus, they link data values to valid operations which can be performed on them. Clearly, the reusability of research data is determined largely by the availability of suitable analysis methods. A clear idea of cross-linguistic data types will enable development of analysis methods as well as a mechanism to match valid data with appropriate operations. The Cross-Linguistic Data Formats (CLDF) initiative provides a toolkit to model such cross-linguistic data types, and in recent years we have seen a paradigm (and an associated process) arise of how new types can be added to CLDF through stepwise conventionalization. Additionally, data types provide a useful selection criterion to group datasets for unified curation. Thus, a type system for cross-linguistic data will provide actionable metadata to guide data curation and inform data reuse.

References

Blust, R., & Trussel, S. (2013). The Austronesian comparative dictionary: A work in progress. Oceanic Linguistics, 52(2), 493–523. 10.1353/ol.2013.0016
Open DOI Search in Google Scholar Back to article
Cardelli, L. (2004). Type systems. In A. B. Tucker (Ed.), CRC handbook of computer science and engineering (2nd ed.). CRC Press.
Search in Google Scholar Back to article
Cardelli, L., & Wegner, P. (1985). On understanding types, data abstraction, and polymorphism. ACM Computing Surveys (CSUR), 17(4), 471–522. 10.1145/6041.6042
Open DOI Search in Google Scholar Back to article
Comrie, B., Haspelmath, M., & Bickel, B. (2015). Leipzig glossing rules. Conventions for interlinear morpheme-by-morpheme glosses. Max Planck Institute for Evolutionary Anthropology. https://www.eva.mpg.de/lingua/pdf/Glossing-Rules.pdf
Search in Google Scholar Back to article
Dunn, M. (2015). Language phylogenies. In C. Bowern & B. Evans (Eds.), The Routledge handbook of historical linguistics (pp. 190–211). Routledge. 10.1093/oso/9780195066074.003.0005
Open DOI Search in Google Scholar Back to article
Durie, M. (1996). Early Germanic Umlaut and variable rules. In M. Durie (Ed.), The comparative method reviewed: Regularity and irregularity in language change (pp. 112–134). Oxford University Press.
Search in Google Scholar Back to article
Forkel, R. (2023). Evolving CLDF: Why and how?. 10.5281/ZENODO.10887671
Open DOI Search in Google Scholar Back to article
Forkel, R., & Greenhill, S. (2023). Phlorest – Seeing the forest and not just trees. 10.5281/ZENODO.10684787
Open DOI Search in Google Scholar Back to article
Forkel, R., & Hammarström, H. (2022). Glottocodes: Identifiers linking families, languages and dialects to comprehensive book information. Semantic Web, 13(6), 917–924. 10.3233/SW-212843
Open DOI Search in Google Scholar Back to article
Forkel, R., & List, J. M. (2020). CLDFBench: Give your cross-linguistic data a lift. In Proceedings of the twelfth international conference on language resources and evaluation (pp. 6997–7004). European Language Resources Association (ELRA). https://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.864.pdf
Search in Google Scholar Back to article
Forkel, R., List, J.-M., Greenhill, S. J., Rzymski, C., Bank, S., Cysouw, M., Hammarström, H., Haspelmath, M., Kaiping, G. A., & Gray, R. D. (2018). Cross-linguistic data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data, 5(180205). 10.1038/sdata.2018.205
Open DOI Search in Google Scholar Back to article
Gauchat, L., Jeanjaquet, J., & Tappolet, E. (1925). Tableaux phonétiques des patois suisses romands. Attinger.
Search in Google Scholar Back to article
Geisler, H., Forkel, R., & List, J. M. (2021). A digital, retro-standardized edition of the tableaux phonétiques des patois suisses romands (TPPSR). In M. Avanzi, N. Lo Vecchio, A. Millour, & A. Thibault (Eds.), Nouveaux regards sur la variation dialectale (pp. 13–36). Éditions de Linguistique et de Philologie. https://tppsr.clld.org
Search in Google Scholar Back to article
Haspelmath, M., Dryer, M. S., Gil, D., & Comrie, B. (Eds.). (2005). The world atlas of language structures. Oxford University Press.
Search in Google Scholar Back to article
Haynie, H. J., Skirgård, H., Blasi, D. E., Hammarström, H., Collins, J., Latarche, J. J., … & Gray, R. D. (2023). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances, 9(16). 10.1126/sciadv.adg6175
Open DOI Search in Google Scholar Back to article
Hudson, P., & Ishizu, M. (2017). History by numbers: An introduction to quantitative approaches (2nd ed.). Bloomsbury.
Search in Google Scholar Back to article
Jäger, G. (2018). Global-scale phylogenetic linguistic inference from lexical resources. Nature Scientific Data, 5(180189), 1–16. 10.1038/sdata.2018.189
Open DOI Search in Google Scholar Back to article
Kaufman, T., & Justeson, J. (2003). A preliminary Mayan etymological dictionary. Foundation for the Advancement of Mesoamerican Studies. https://www.famsi.org/reports/01051/index.html
Search in Google Scholar Back to article
Kortmann, B. (2021). Reflecting on the quantitative turn in linguistics. Linguistics, 59(5), 1207–1226. doi: 10.1515/ling-2019-0046
Open DOI Search in Google Scholar Back to article
Lehmann, C. (2004). Interlinear morphemic glossing.” In G. E. Booij, C. Lehmann, J. Mugdan, & S. Skopeteas (Eds.), Morphology. An international handbook (2, pp. 1834–1857). De Gruyter. 10.1515/9783110172782.2.20.1834
Open DOI Search in Google Scholar Back to article
List, J. M. (2014). Sequence comparison in historical linguistics. Düsseldorf University Press. 10.1515/9783110720082
Open DOI Search in Google Scholar Back to article
List, J. M., & Forkel, R. (2023). LingPy: A Python library for quantitative tasks in historical linguistics [Software, Version 2.6.13]. MCL Chair at the University of Passau. https://pypi.org/project/lingpy
Search in Google Scholar Back to article
List, J. M., Hill, N. W., & Forkel, R. (2022). A new framework for fast automated phonological reconstruction using trimmed alignments and sound correspondence patterns.” In Proceedings of the 3rd workshop on computational approaches to historical language change (pp. 89–96). Association for Computational Linguistics. https://aclanthology.org/2022.lchange-1.9
Search in Google Scholar Back to article
Meillet, A. (1925/1954). La méthode comparative en linguistique historique. Reprint. Honoré Champion.
Search in Google Scholar Back to article
Moran, S., & McCloy, D. (Eds.). (2019). PHOIBLE 2.0. Max Planck Institute for the Science of Human History. https://phoible.org/
Search in Google Scholar Back to article
Nordhoff, S., & Krämer, K. (2022). IMTVault: Extracting and enriching low-resource language interlinear glossed text from grammatical descriptions and typological survey articles. In Proceedings of the 8th workshop on linked data in linguistics within the 13th language resources and evaluation conference (pp. 17–25). European Language Resources Association. https://aclanthology.org/2022.ldl-1.3/
Search in Google Scholar Back to article
Nordhoff, S., & Krämer, K. (2025). Creating and enriching a repository of 177k interlinearized examples in 1611 mostly lesser-resourced languages. In Proceedings of the 5th conference on language, data and knowledge (pp. 186–196). Unior Press. https://aclanthology.org/2025.ldk-1.20/
Search in Google Scholar Back to article
Nordhoff, S., Seyfeddinipur, M., & Döhler, C. (2024). Mobilizing archival collections: The open text collections project. Language Documentation and Archiving Conference.
Search in Google Scholar Back to article
Pallas, P. S. (1787/1789). Linguarum totius orbis vocabularia comparativa; Augustissimae Cura Collecta. (Vol. 2). Typis Iohannis Georgii Schnoor.
Search in Google Scholar Back to article
Pallas, P. S. (1789). Sravnitel’nye Slovari Vsech Jazykov i Narečij, Sobrannye Desniceju Vsevysočajšeij Osoby. Otdelenie Pervoe, Soderžaščee v Sebe Evropejskie i Aziatskie Jazyki. (Vol. 2). Šnor.
Search in Google Scholar Back to article
Parnas, D. L., Shore, J. E., & Weiss, D. (1976). Abstract types defined as classes of variables. In Proceedings of the 1976 conference on data: Abstraction, definition and structure (pp. 149–154). Association for Computing Machinery. 10.1145/800237.807133
Open DOI Search in Google Scholar Back to article
Ranacher, P., Forkel, R., Efrat-Kowalsky, N., Urban, M., Hehli, A., Franz, M., … & Norder, S. (2025). A global and interoperable dataset of linguistic distributions derived from the atlas of the world’s languages. Scientific Data, 12(1). 10.1038/s41597-025-05828-6
Open DOI Search in Google Scholar Back to article
Ranacher, P., Forkel, R., Efrat-Kowalksy, N., Urban, M., Hehli, A., Franz, M., Biland, G., Kreienbühl, A., Hermida Rodríguez, A., Ezevedo, M. C. B. C., Giebler, J., Takahashi, T., Neureiter, N., van Gijn, R., Roose, M., Vesakoski, O., Weibel, R., Kaiping, G., & Norder, S. (2026). Glottography: an open-source geolinguistic data platform for mapping the world’s languages. Journal of Open Humanities Data, 12(47). 1–16. 10.5334/johd.459
Open DOI Search in Google Scholar Back to article
Ross, M., Pawley, A., & Osmond, M. (1998). The lexicon of proto oceanic: The culture and environment of Ancestral Oceanic Society 1: Material culture (Vol. C–152). Pacific Linguistics. https://epress.anu.edu.au/lexicon_citation.html
Search in Google Scholar Back to article
Smith, A. D., Forkel, R., & Blumenfeld, L. (2025). The Austronesian and the Micronesian comparative dictionaries as CLDF datasets. Scientific Data, 12(1). 10.1038/s41597-025-05301-4
Open DOI Search in Google Scholar Back to article
Tennison, J. (2016). CSV on the web: A primer. W3C. https://www.w3.org/TR/tabular-data-primer/
Search in Google Scholar Back to article
The Glottography Consortium (2025). Glottography dataset derived from Walker and Ribeiro 2011 “Bayesian phylogeography of the Arawak expansion in Lowland South America”. Zenodo. 10.5281/ZENODO.17342060
Open DOI Search in Google Scholar Back to article
Walker, R. S., & Ribeiro, L. A. (2011). Bayesian phylogeography of the Arawak expansion in Lowland South America. Proceedings of the royal society B: Biological sciences, 278(1718), 2562–2567. 10.1098/rspb.2010.2579
Open DOI Search in Google Scholar Back to article
Walker, R. S., & Ribeiro, L. A. (2024). CLDF Dataset Derived from Walker and Ribeiro’s “Bayesian Phylogeography of the Arawak Expansion” from 2011. Zenodo. 10.1098/rspb.2010.2579
Open DOI Search in Google Scholar Back to article
Walker, R. S., & Ribeiro, L. A. (2025). Phlorest phylogeny derived from Walker & Ribeiro 2011 ‘Bayesian Phylogeography of the Arawak expansion in Lowland South America’. Zenodo. 10.1098/rspb.2010.2579
Open DOI Search in Google Scholar Back to article
Weirich, S. (2014). Type Systems. In T. Gonzalez & J. Díaz-Herrera (Eds.), Computing Handbook. Computer Science and Software Engineering (3rd ed.) (pp. 1–36). CRC Press. 10.1201/b16812-79
Open DOI Search in Google Scholar Back to article