Have a personal or library account? Click to login
A Global Lexical Database (GLED) for Computational Historical Linguistics Cover

A Global Lexical Database (GLED) for Computational Historical Linguistics

By: Tiago Tresoldi  
Open Access
|Feb 2023

Full Article

(1) Overview

Repository location

https://doi.org/10.5281/zenodo.7368116

Context

The Global Lexical Database (GLED) is a resource for computational historical linguistics encompassing a dataset of basic vocabulary for most known natural languages, with accompanying information on machine-detected cognates and phonological alignments, along with per-family and global phylogenetic resources. The latest release holds 262,859 entries for 6,572 doculects (documented language varieties, see Nordhoff & Hammarström, 2011) in 344 families (Figure 1) and is available under the CC-BY licence. The database’s key component, a lexical dataset ultimately derived from the word lists of the Automated Similarity Judgement Program (ASJP), carries lemmas for between 30 and 40 comparative concepts for each doculect, all rendered with a broad phonetic transcription. The average concept coverage per doculect is 90.3%, and the average mutual pairwise coverage between doculects is 82.2%. Table 1 details the distribution of concept counts across doculects, and Table 2 lists the concepts along with their coverage.

Table 1

Number of doculects per number of concepts expressed in absolute and relative terms. Note that the number of entries for a doculect will be higher than the number of concepts in the case of synonyms.

NUMBER OF CONCEPTSDOCULECTSPERCENTAGE OF DOCULECTS
303305.0
313064.7
323615.5
334016.1
345959.1
356279.5
3678612.0
376059.2
386279.5
3973611.2
40119818.2
johd-9-96-g1.png
Figure 1

Location of the doculects included in the dataset, using information from Hammarström et al. (2022); colours are automatically assigned to differentiate language families.

Table 2

Absolute and relative doculect coverage per concept, along with the Concepticon mapping for each concept.

CONCEPT GLOSSDOCULECTS (RATIO)CONCEPTICON NAME/ID
1pl5265 (0.801)WE/1212
1sg5379 (0.818)I/1209
2sg5231 (0.795)THOU/1215
blood6426 (0.977)BLOOD/946
bone6351 (0.966)BONE/1394
breast5957 (0.906)BREAST/1402
come6130 (0.932)COME/1446
die6125 (0.931)DIE/1494
dog6430 (0.978)DOG/2009
drink6058 (0.921)DRINK/1401
ear6475 (0.985)EAR/1247
eye6494 (0.988)EYE/1248
fire6417 (0.976)FIRE/221
fish6226 (0.947)FISH/227
full4190 (0.637)FULL/1429
hand5693 (0.866)HAND/1277
hear5898 (0.897)HEAR/1408
horn4317 (0.656)HORN (ANATOMY)/1393
knee5357 (0.815)KNEE/1371
leaf6077 (0.924)LEAF/628
liver5454 (0.829)LIVER/1224
louse5711 (0.868)LOUSE/1392
mountain5321 (0.809)MOUNTAIN/639
name6042 (0.919)NAME/1405
new5711 (0.868)NEW/1231
night6289 (0.956)NIGHT/1233
nose6404 (0.974)NOSE/1221
one6296 (0.958)ONE/1493
path6151 (0.935)PATH/2252
person5552 (0.844)PERSON/683
see6104 (0.928)SEE/1409
skin6182 (0.940)SKIN/763
star6220 (0.946)STAR/1430
stone6290 (0.957)STONE/857
sun5877 (0.894)SUN/1343
tongue6430 (0.978)TONGUE/1205
tooth6399 (0.973)TOOTH/1380
tree5850 (0.890)TREE/906
two6285 (0.956)TWO/1498
water6413 (0.975)WATER/948

The collection is not as accurate as alternative global (e.g., List et al., 2022a) and family or areal resources (e.g., Matisoff, 2008), which merge different sources, offer more significant concept coverages, and are manually curated for linguistic and data qualities. Such alternatives should be favoured when they encompass all the languages an investigation needs. Nonetheless, GLED constitutes a reliable and convenient source for probing language relationships, prototyping studies, and bootstrapping phylolinguistic analyses (Greenhill et al., 2020). It is likewise designed to support the development of new methods for tasks in computational historical linguistics, including phonological alignment, cognate detection, and sound correspondence inference (List et al., 2018). Finally, the language distances built in the database can be used for adjusted language sampling, as illustrated in Section 4.

(2) Method

The dataset provided by Jäger (2018), derived from ASJP (Brown et al., 2008), was used as the lexical source, excluding doculects that did not fit the design (such as artificial languages, reconstructions, and duplicates). The original transcription system, “ASJPcode”, was mapped to a broad transcription consistent with CLTS/BIPA (Anderson et al., 2018) through an orthographic profile (Moran & Cysouw, 2018). Such a profile was based on the one produced by the author for including ASJP in the Lexibank project. Decisions followed the non-exhaustive examples of phonological mapping and tokenization given in the original ASJP paper and the phonemic transcriptions of the ASJP word lists provided by other datasets.

Per-family automatic cognate attribution was performed with LexStat (List, 2012) for small and medium families (i.e., less than 18,000 items) and the SVM technique (Jäger, 2018) for large ones. Phonological alignments of the ensuing cognate sets were compiled with LingPy (List & Forkel, 2021). Finally, the data was organized in a singular tabular resource; entries were sorted, in order, by family, concept, language, and form (Table 3).

Table 3

A modified snippet from the lexical dataset, showing the most critical columns for a subset of Tupian words for the concept “dog”. The data includes a unique language name, a Glottocode (when available), the family name, a concept gloss derived from the Concepticon catalog, the phonological transcription of the word, the phonological alignment of the word in its cognate set (with hyphens indicating gaps), and a cognate set index.

LANGUAGECODEFAMILYCONCEPTFORMALIGNMENTCOGSET
Achéache1246TupianDOGbɐegib ɐ e g i16
Amundavaamun1246TupianDOGɲɐɲwɐrɐɲ ɐ ɲ w - ɐ r ɐ17
Avá Canoeiroavac1239TupianDOGjɐwɐrɐj ɐ - w - ɐ r ɐ17
Paraguayan Guaranípara1311TupianDOGdʒɐgwɐdʒ ɐ g w - ɐ - -17
Kaiwákaiw1246TupianDOGjɐgwɐj ɐ g w - ɐ - -17
Eastern Bolivian Guaraníeast2555TupianDOGjeimbɐj e - i m b ɐ19
Tapietétapi1253TupianDOGɲɐʔəmbɐɲ ɐ ʔ ə m b ɐ19
Cinta Largacint1239TupianDOGɐwəliɐ w ə l i20
Gavião Do Jiparanágavi1246TupianDOGɐvələɐ v ə l ə20

Per-family distance matrices based on the proportion of shared cognates were obtained from this dataset (Figure 2), and unrooted trees were constructed with the Neighbor-Joining method (Saitou & Nei, 1987). Models for inferring phylogenetic trees were produced with a patched version of BEASTling (Maurits et al., 2017) and monophyletically constrained using Glottolog 4.6 (Hammarström et al., 2022). Bayesian MCMC analyses were carried out with BEAST2 (Bouckaert et al., 2019), and summary Maximum Clade Credibility (MCC) trees were obtained with TreeAnnotator (Heled & Bouckaert, 2013). Finally, custom scripts were employed to normalize distances and join these trees, along with the language isolates, into a single unrooted tree (Figure 3). It must be underlined that the latter is in absolutely no manner proposed as supporting “Proto-Human” hypotheses but merely as a convenient resource for measuring language distance.

johd-9-96-g2.png
Figure 2

A neighbour-net for the Tupian languages in the dataset, plotted with SplitsTree v4 (Huson & Bryant, 2006).

johd-9-96-g3.png
Figure 3

The “global” language tree from the combined Bayesian MCMC phylogenetic inferences, plotted with iTOL (Letunic & Bork, 2021).

The complete pipeline is accessible via the public GitHub repository at https://github.com/tresoldi/gled and takes approximately three days to be processed in a typical laptop (i5 processor, 8GB RAM, Fedora Linux 37). It will expedite planned forthcoming releases aggregating sources for languages missing in ASJP, such as recently documented isolates, and employing alternative methods for computational tasks, such as new methods of cognate detection.

(3) Dataset Description

Object name

gled

Format names and versions

The dataset has the following components:

  • – A TSV file (“gled.tsv”) with columns for (a) unique entry ID, (b) language ID (as provided in ASJP), (c) language name (provided by Glottolog, ASJP, or the author), (d) Glottocode when available, (e) Glottolog name when available, (f) family name, (g) concept gloss, (h) Concepticon ID (List et al., 2022b), (i) ASJP original form, (j) reconstructed form, (k) broad IPA transcription, (l) alignment, (m) cognate set ID, and (n) cognate set ID as an integer

  • – A YAML file (“gled.resource.yaml”) with the metadata as per the FrictionlessData project

  • – NEXUS files (“nexus/*.nex”) for families with more than one language

  • – Distance Matrices (“phylo/*.dst”) for families with more than one language, based on the percentage of shared cognates

  • – NJ trees in Newick notation (“phylo/*.tree”) for families with more than one language, based on the corresponding distance matrix

  • – Bayesian MCMC per-family (“trees/*.tree”) and global (“trees/global.tree”) trees in Newick notation

Language

English

Licence

CC-BY-4.0

Publication date

2022-11-27

(4) Reuse Potential

Provided that its limits in proportion and strictness, arising from ASJP and examined in Brown et al. (2008) and Jäger (2018), are considered, the dataset provides many opportunities for reuse in empirical historical linguistics focused on lexical and phonetic data. Furthermore, as the doculects are linked to Glottolog, it is viable to integrate the data with other global-level resources, such as the World Loanword Database (Haspelmath & Tadmor, 2009), the World Atlas of Language Structures (Haspelmath et al., 2005), and Phoible (Moran & McCloy, 2019).

The distance matrices and phylogenetic trees offer a convenient starting point for comparing the results of different and more advanced analyses, notably with under-studied and under-resourced language families for which no distance matrix or phylogenetic tree with branch lengths is available. Table 4 illustrates such distances, showing values from the trees inferred without (NJ) and with (B) a molecular clock. Such distances can be managed to perform weighted random sampling at global, family, and sub-family levels, addressing issues such as sample bias and autocorrelation in cross-linguistic analyses.

Table 4

Distance between Swedish (swed1254) and other languages, as computed using the Neighbour Joining trees (NJ, from zero to infinite), the Bayesian trees (B, from zero to 4.0), and the normalized Bayesian trees (NB, from zero to 1.0).

LANGUAGE (GLOTTOCODE)NJBNB
Norwegian Bokmål (norw1259)0.210.110.02
Danish (dani1285)0.240.020.01
Dutch (dutc1256)0.411.400.35
English (stan1293)0.421.400.35
Italian (ital1282)0.841.600.40
Hindi (hind1269)0.901.950.48
Hittite (hitt1242)0.901.970.49
Basque (basq1248)4.001.00

Funding Statement

The database was developed in the “Cultural Evolution of Texts” project, with funding from the Riksbankens Jubileumsfond (grant agreement ID: MXM19-1087:1).

Competing Interests

The author has no competing interests to declare.

Author Contributions

Tiago Tresoldi: conceptualization, data curation, methodology, project administration, software, visualization, writing – original draft, writing – review & editing.

DOI: https://doi.org/10.5334/johd.96 | Journal eISSN: 2059-481X
Language: English
Published on: Feb 2, 2023
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2023 Tiago Tresoldi, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.