
Figure 1
An overview of the workflow.

Figure 2
The geographic distribution of the Hmong-Mien languages selected for our sample.
Table 1
A minimal example for four words in four Germanic languages, given in our minimal tabular format. The column VALUE (which is not required) provides the orthographical form of each word [20, 21].
| ID | DOCULECT | CONCEPT | VALUE | TOKENS |
|---|---|---|---|---|
| 1 | English | house | house | h aʊ s |
| 2 | German | house | Haus | h au s |
| 3 | Dutch | house | huis | h ʊɪ s |
| 4 | Swedish | house | hus | h ʉː s |

Figure 3
An example to illustrate the usage of orthography profiles to tokenize the phonetic transcriptions.

Table 2
The transformation from raw to machine-readable data. As illustrated in Table 1, the VALUE column displays the raw form. The tokenized forms are added to the TOKENS column.

Figure 4
The comparison of full cognates (COGID) and partial cognate sets (COGIDS). While none of the four words is entirely cognate with each other, they all share a common element. Note that the IDs for full cognates and partial cognates are independent from each other. For reasons of visibility, we have marked the partial cognates shared among all language varieties in red font.

Figure 5
The alignment of ‘sun’ (cognate ID 1) among 4 Hmong-Mien languages, with segments colored according to their basic sound classes. The table on the left shows the cognate identifiers for cognate morphemes, as discussed in Figure 4. The table on the right shows how the cognate morphemes with identifier 1 (basic meaning ‘sun’) are aligned.

Figure 6
Illustration of the template-based alignment procedure. a) Representing prosodic structure reflecting syllable templates for each morpheme in the data. b) Aligning tokenized transcriptions to templates, and deleting empty slots.

Table 3
Examples of compound words in Hmong-Mien languages. The column MORPHEMES uses morpheme glosses [31] in order to indicate which of the words are cognate inside the same language. The form for ‘net’ in the table serves to show that ‘bed’ and ‘net’ are not colexified, and that instead ‘fishnet’ is an analogical compound word.

Table 4
Two glosses, ‘son’ and ‘daughter’, in [8] are displayed here as an example to compare the differences between cognates inside and cognates across meaning slots.

Figure 7
Compare alignments for morphemes meaning ‘son’ and ‘daughter’ as an example to illustrate how cross-semantic cognates can be identified. The cognate sets in which the forms in the languages are identical are clustered together and assigned a unique cross-semantic cognate identifier (CROSSID). Those which are not compatible as the cognate sets 2 and 1 in our example are left separate.
Table 5
An example of correspondence sets in the classical literature, following Ratliff [11, p. 75], reconstructed forms for Proto-Hmong-Mien are preceded by an asterisk.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| blood [*ntshjamX] | ɕhaŋ³ | ȵtɕhi³ | ɳtʂha³ | ntsua³ᵇ | nʔtshenᴮ | θi³ | ȵe³ | ɕam³ | saːm³ | san³ | dzjɛm³ |
| head louse [*ntshjeiX] | ɕhu³ | ȵtɕhi³ | ɳtsau³ᵇ | ntsɔ³ᵇ | nʔtshuᴮ | – | tɕhi³ | ɕeib³ | tθei³ | – | dzɛi³ |
| to fear/be afraid [*ntshjeX] | ɕhi¹ | – | ɳtʂai⁵ | ntse⁵ᵇ | nʔtsheC | ɳtʃei¹ | ȵɛ⁵ | dʑa⁵ | ȡa⁵’ | ȡa⁵ | dzjɛ⁵ |
| clear [*ntshjiəŋ] | ɕhi¹ | – | ɳtʂia¹ | ntsæin¹ᵇ | nʔtsheA | – | nɪ̃¹ | dzaŋ¹ | – | – | – |
Table 6
A summary of the result of the sound correspondence pattern inference algorithm applied to our data. The numbers below each item are the quantities of sound correspondence patterns detected at each position in the syllables.
| Position | ‘Regular’ Patterns | Singletons |
|---|---|---|
| Initial | 165 | 106 |
| Medials | 45 | 23 |
| Nucleus | 213 | 57 |
| Coda | 66 | 13 |
| Tone | 164 | 29 |
| Total | 653 | 228 |

Table 7
Cells shaded in blue indicate the initial consonants belonging to a common correspondence pattern, with missing reflexes indicated by a Ø.
