Computer-Assisted Language Comparison: State of the Art

Mei-Shin Wu; Nathanael E. Schweikhard; Timotheus A. Bodt; Nathan W. Hill; Johann-Mattis List

doi:10.5334/johd.12

Figures & Tables

The geographic distribution of the Hmong-Mien languages selected for our sample.

Table 1

A minimal example for four words in four Germanic languages, given in our minimal tabular format. The column VALUE (which is not required) provides the orthographical form of each word [20, 21].

ID	DOCULECT	CONCEPT	VALUE	TOKENS
1	English	house	house	h aʊ s
2	German	house	Haus	h au s
3	Dutch	house	huis	h ʊɪ s
4	Swedish	house	hus	h ʉː s

An example to illustrate the usage of orthography profiles to tokenize the phonetic transcriptions.

The transformation from raw to machine-readable data. As illustrated in Table 1, the VALUE column displays the raw form. The tokenized forms are added to the TOKENS column.

The comparison of full cognates (COGID) and partial cognate sets (COGIDS). While none of the four words is entirely cognate with each other, they all share a common element. Note that the IDs for full cognates and partial cognates are independent from each other. For reasons of visibility, we have marked the partial cognates shared among all language varieties in red font.

The alignment of ‘sun’ (cognate ID 1) among 4 Hmong-Mien languages, with segments colored according to their basic sound classes. The table on the left shows the cognate identifiers for cognate morphemes, as discussed in Figure 4. The table on the right shows how the cognate morphemes with identifier 1 (basic meaning ‘sun’) are aligned.

Illustration of the template-based alignment procedure. a) Representing prosodic structure reflecting syllable templates for each morpheme in the data. b) Aligning tokenized transcriptions to templates, and deleting empty slots.

Examples of *compound words* in Hmong-Mien languages. The column MORPHEMES uses morpheme glosses [31] in order to indicate which of the words are cognate inside the same language. The form for ‘net’ in the table serves to show that ‘bed’ and ‘net’ are not colexified, and that instead ‘fishnet’ is an analogical compound word.

Two glosses, ‘son’ and ‘daughter’, in [8] are displayed here as an example to compare the differences between cognates inside and cognates across meaning slots.

Compare alignments for morphemes meaning ‘son’ and ‘daughter’ as an example to illustrate how cross-semantic cognates can be identified. The cognate sets in which the forms in the languages are identical are clustered together and assigned a unique cross-semantic cognate identifier (CROSSID). Those which are not compatible as the cognate sets 2 and 1 in our example are left separate.

Table 5

An example of correspondence sets in the classical literature, following Ratliff [11, p. 75], reconstructed forms for Proto-Hmong-Mien are preceded by an asterisk.

	1	2	3	4	5	6	7	8	9	10	11
blood [*ntshjamX]	ɕhaŋ³	ȵtɕhi³	ɳtʂha³	ntsua³ᵇ	nʔtshenᴮ	θi³	ȵe³	ɕam³	saːm³	san³	dzjɛm³
head louse [*ntshjeiX]	ɕhu³	ȵtɕhi³	ɳtsau³ᵇ	ntsɔ³ᵇ	nʔtshuᴮ	–	tɕhi³	ɕeib³	tθei³	–	dzɛi³
to fear/be afraid [*ntshjeX]	ɕhi¹	–	ɳtʂai⁵	ntse⁵ᵇ	nʔtshe^C	ɳtʃei¹	ȵɛ⁵	dʑa⁵	ȡa⁵’	ȡa⁵	dzjɛ⁵
clear [*ntshjiəŋ]	ɕhi¹	–	ɳtʂia¹	ntsæin¹ᵇ	nʔtshe^A	–	nɪ̃¹	dzaŋ¹	–	–	–

Table 6

A summary of the result of the sound correspondence pattern inference algorithm applied to our data. The numbers below each item are the quantities of sound correspondence patterns detected at each position in the syllables.

Position	‘Regular’ Patterns	Singletons
Initial	165	106
Medials	45	23
Nucleus	213	57
Coda	66	13
Tone	164	29
Total	653	228

Cells shaded in blue indicate the initial consonants belonging to a common correspondence pattern, with missing reflexes indicated by a Ø.

Computer-Assisted Language Comparison: State of the Art

Figures & Tables

Figure 1

Figure 2

Table 1

Figure 3

Table 2

Figure 4

Figure 5

Figure 6

Table 3

Table 4

Figure 7

Table 5

Table 6

Table 7

Paradigm

My account