Have a personal or library account? Click to login
Automatic Language Identification in Code-Switched Hindi-English Social Media Text Cover

Automatic Language Identification in Code-Switched Hindi-English Social Media Text

Open Access
|Jun 2021

Figures & Tables

(1)mỗigroupphảia different focus
 each musthave 
 “Each group must have a different focus.”
 (CanVEC, L. Nguyen & Bryant, 2020)
Table 1

The different language tags in the data, their meaning and some examples.

TAGMEANINGEXAMPLE
enEnglishI, the, songs, listening
hiHindiApna (mine), ladki (girl), peeti (drinks)
univUniversal#, !, @abc, #happy
mixedMixedDedh-litre (1.5 litre)
acroAcronymIITB, USA
neNamed EntityEurope, Paris
undefUndefinedM
Table 2

The distribution of language tags across datasets and overall.

TAGFACEBOOKTWITTERWHATSAPPOVERALL
en13,2143,73236317,309
hi2,8579,7792,53915,175
univ3,6283,3542817,263
mixed7108
acro251320283
ne656413351,104
undef2002
Total20,61517,3113,21841,144
Table 3

Final distribution of language tags after preprocessing.

TAGFACEBOOKTWITTERWHATSAPPOVERALL
en13,2143,73236317,309
hi2,8579,7792,53915,175
univ4,5443,8003168,660
Total20,61517,3113,21841,144
Table 4

The top 20 most frequent ambiguous-language tokens and their frequency.

TOKENFREQ.TOKENFREQ.
to556this134
I496my126
a357for126
of258aur122
in236h111
you212it108
is185have104
me184on100
accha152or91
ho145hi88
johd-7-44-g1.png
Figure 1

Language tagging performance as a function of manual disambiguation list size.

Table 5

Precision, Recall and F1 scores for each language tag in each corpus.

TAGFACEBOOKTWITTER
PRF1PRF1
en93.3498.3595.7870.2681.3275.39
hi89.0485.6187.3090.7282.0886.19
univ97.3684.5190.4880.3587.6183.82
TAGWHATSAPPOVERALL
PRF1PRF1
en39.5280.9953.1285.9894.3289.95
hi96.6578.3086.5191.2882.1286.45
univ59.7178.8067.9487.2385.6686.44
Table 6

Confusion matrices for correct (COR) and incorrect (INC) labels in each dataset.

FACEBOOKTWITTER
GOLDCORINCGOLDCORINC
PREDPRED
COR4666COR42525
INC244INC3515
WHATSAPPOVERALL
GOLDCORINCGOLDCORINC
PREDPRED
COR40349COR129480
INC417INC10026
Table 7

The five different types of classification errors with examples.

CODEMEANINGEXAMPLES (AND INCORRECT PREDICTED TAG)
ATokenisation/OrthographyLøVĕ (hi)-*Subha (en)2014–15)ka (en)
BNamed entityTanzeel (en)Amir (hi)chennai (en)
CToken in both word listshe (en)to (en)are (en)
DToken in neither word listAchhi (en)Namaskar (en)tiket (hi)
EToken in incorrect word listMt (en)thy (en)pre (hi)
Table 8

The error type distribution between datasets.

CODEFACEBOOKTWITTERWHATSAPPOVERALL
A320124
B516728
C442129
D1281232
E42713
Total126
Table 9

Distribution of error types based on the target gold standard.

CODETYPETARGET
ENGLISHHINDIUNIVERSALUNDEFINEDOVERALL
ATokenisation/Orthography1182324
BNamed entity0028028
CToken in both word lists1280029
DToken in neither word list7214032
EToken in incorrect word list292013
Total1176353126
DOI: https://doi.org/10.5334/johd.44 | Journal eISSN: 2059-481X
Language: English
Published on: Jun 25, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Li Nguyen, Christopher Bryant, Sana Kidwai, Theresa Biberauer, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.