| (1) | mỗi | group | phải | có | a different focus |
| each | must | have | |||
| “Each group must have a different focus.” | |||||
| (CanVEC, L. Nguyen & Bryant, 2020) | |||||
Table 1
The different language tags in the data, their meaning and some examples.
| TAG | MEANING | EXAMPLE |
|---|---|---|
| en | English | I, the, songs, listening |
| hi | Hindi | Apna (mine), ladki (girl), peeti (drinks) |
| univ | Universal | #, !, @abc, #happy |
| mixed | Mixed | Dedh-litre (1.5 litre) |
| acro | Acronym | IITB, USA |
| ne | Named Entity | Europe, Paris |
| undef | Undefined | M |
Table 2
The distribution of language tags across datasets and overall.
| TAG | OVERALL | |||
|---|---|---|---|---|
| en | 13,214 | 3,732 | 363 | 17,309 |
| hi | 2,857 | 9,779 | 2,539 | 15,175 |
| univ | 3,628 | 3,354 | 281 | 7,263 |
| mixed | 7 | 1 | 0 | 8 |
| acro | 251 | 32 | 0 | 283 |
| ne | 656 | 413 | 35 | 1,104 |
| undef | 2 | 0 | 0 | 2 |
| Total | 20,615 | 17,311 | 3,218 | 41,144 |
Table 3
Final distribution of language tags after preprocessing.
| TAG | OVERALL | |||
|---|---|---|---|---|
| en | 13,214 | 3,732 | 363 | 17,309 |
| hi | 2,857 | 9,779 | 2,539 | 15,175 |
| univ | 4,544 | 3,800 | 316 | 8,660 |
| Total | 20,615 | 17,311 | 3,218 | 41,144 |
Table 4
The top 20 most frequent ambiguous-language tokens and their frequency.
| TOKEN | FREQ. | TOKEN | FREQ. |
|---|---|---|---|
| to | 556 | this | 134 |
| I | 496 | my | 126 |
| a | 357 | for | 126 |
| of | 258 | aur | 122 |
| in | 236 | h | 111 |
| you | 212 | it | 108 |
| is | 185 | have | 104 |
| me | 184 | on | 100 |
| accha | 152 | or | 91 |
| ho | 145 | hi | 88 |

Figure 1
Language tagging performance as a function of manual disambiguation list size.
Table 5
Precision, Recall and F1 scores for each language tag in each corpus.
| TAG | |||||||
|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | ||
| en | 93.34 | 98.35 | 95.78 | 70.26 | 81.32 | 75.39 | |
| hi | 89.04 | 85.61 | 87.30 | 90.72 | 82.08 | 86.19 | |
| univ | 97.36 | 84.51 | 90.48 | 80.35 | 87.61 | 83.82 | |
| TAG | OVERALL | ||||||
| P | R | F1 | P | R | F1 | ||
| en | 39.52 | 80.99 | 53.12 | 85.98 | 94.32 | 89.95 | |
| hi | 96.65 | 78.30 | 86.51 | 91.28 | 82.12 | 86.45 | |
| univ | 59.71 | 78.80 | 67.94 | 87.23 | 85.66 | 86.44 | |
Table 6
Confusion matrices for correct (COR) and incorrect (INC) labels in each dataset.
| GOLD | COR | INC | GOLD | COR | INC | |
|---|---|---|---|---|---|---|
| PRED | PRED | |||||
| COR | 466 | 6 | COR | 425 | 25 | |
| INC | 24 | 4 | INC | 35 | 15 | |
| OVERALL | ||||||
| GOLD | COR | INC | GOLD | COR | INC | |
| PRED | PRED | |||||
| COR | 403 | 49 | COR | 1294 | 80 | |
| INC | 41 | 7 | INC | 100 | 26 | |
Table 7
The five different types of classification errors with examples.
| CODE | MEANING | EXAMPLES (AND INCORRECT PREDICTED TAG) | ||
|---|---|---|---|---|
| A | Tokenisation/Orthography | ∧LøVĕ∧ (hi) | -*Subha (en) | 2014–15)ka (en) |
| B | Named entity | Tanzeel (en) | Amir (hi) | chennai (en) |
| C | Token in both word lists | he (en) | to (en) | are (en) |
| D | Token in neither word list | Achhi (en) | Namaskar (en) | tiket (hi) |
| E | Token in incorrect word list | Mt (en) | thy (en) | pre (hi) |
