Automatic Language Identification in Code-Switched Hindi-English Social Media Text

Li Nguyen; Christopher Bryant; Sana Kidwai; Theresa Biberauer

doi:10.5334/johd.44

References

1Aguilar, G., & Solorio, T. (2020, July). From English to code-switching: Transfer learning with strong morphological clues. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8033–8044). Online: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.acl-main.716. DOI: 10.18653/v1/2020.acl-main.716
Back to article
2Ahn, E., Jimenez, C., Tsvetkov, Y., & Black, A. W. (2020, January). What code-switching strategies are effective in dialogue systems? In Proceedings of the Society for Computation in Linguistics 2020 (pp. 254–264). New York, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.scil-1.32
Back to article
3Anastasopoulos, A., & Neubig, G. (2020, July). Should all cross-lingual embeddings speak English? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8658–8679). Online: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.acl-main.766. DOI: 10.18653/v1/2020.acl-main.766
Back to article
4Attia, M., Samih, Y., Elkahky, A., Mubarak, H., Abdelali, A., & Darwish, K. (2019, August). POS tagging for improving code-switching identification in Arabic. In Proceedings of the Fourth Arabic Natural Language Processing Workshop (pp. 18–29). Florence, Italy: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W19-4603. DOI: 10.18653/v1/W19-4603
Back to article
5Bali, K., Sharma, J., Choudhury, M., & Vyas, Y. (2014, October). “I am borrowing ya mixing?”An analysis of English-Hindi code mixing in Facebook. In Proceedings of the First Workshop on Computational Approaches to Code Switching (pp. 116–126). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W14-3914. DOI: 10.3115/v1/W14-3914
Back to article
6Barman, U., Das, A., Wagner, J., & Foster, J. (2014, October). Code mixing: A challenge for language identification in the language of social media. In Proceedings of the First Workshop on Computational Approaches to Code Switching (pp. 13–23). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W14-3902. DOI: 10.3115/v1/W14-3902
Back to article
7Bullock, B., Guzmán, W., Serigos, J., Sharath, V., & Toribio, A. J. (2018, July). Predicting the presence of a Matrix Language in code-switching. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching (pp. 68–75). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-3208. DOI: 10.18653/v1/W18-3208
Back to article
8Çetinoğlu, Ö., Schulz, S., & Vu, N. T. (2016, November). Challenges of computational processing of code-switching. In Proceedings of the Second Workshop on Computational Approaches to Code Switching (pp. 1–11). Austin, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W16-5801. DOI: 10.18653/v1/W16-5801
Back to article
9Chan, J. Y. C., Ching, P. C., & Lee, T. (2005). Development of a Cantonese-English code-mixing speech corpus. In Proceedings of the Ninth European Conference on Speech Communication and Technology – Interspeech’05 (pp. 1533–1536). Lisbon, Portugal. Retrieved from https://www.isca-speech.org/archive/archive_papers/interspeech_2005/i05_1533.pdf
Back to article
10Choudhury, M., Chittaranjan, G., Gupta, P., & Das, A. (2014). Overview of FIRE 2014 Track on Transliterated Search (Tech. Rep.). Retrieved from https://www.isical.ac.in/~fire/working-notes/2014/MSR/2014-trainslit_search-track_over.pdf
Back to article
11Dey, A., & Fung, P. (2014, May). A Hindi-English code-switching corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/922_Paper.pdf
Back to article
12Elfardy, H., Al-Badrashiny, M., & Diab, M. (2013). Codeswitch point detection in Arabic. In E. Métais, F. Meziane, M. Saraee, V. Sugumaran & S. Vadera (Eds.), Natural Language Processing and Information Systems (pp. 412–416). Berlin, Germany: Springer. DOI: 10.1007/978-3-642-38824-8_51
Back to article
13Eskander, R., Al-Badrashiny, M., Habash, N., & Rambow, O. (2014, October). Foreign words and the automatic processing of Arabic social media text written in Roman script. In Proceedings of the First Workshop on Computational Approaches to Code Switching (pp. 1–12). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W14-3901. DOI: 10.3115/v1/W14-3901
Back to article
14Garrido-Muñoz, I., Montejo-Ráez, A., Martínez-Santiago, F., & Ureña-Lápez, L. A. (2021). A survey on bias in deep NLP. Applied Sciences, 11(7), 3184. Retrieved from https://www.mdpi.com/2076-3417/11/7/3184. DOI: 10.3390/app11073184
Back to article
15Grosjean, F., & Li, P. (2013). The psycholinguistics of bilingualism. Chichester, UK: Wiley-Blackwell.
Back to article
16Gupta, K., Choudhury, M., & Bali, K. (2012, May). Mining Hindi-English transliteration pairs from online Hindi lyrics. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (pp. 2459–2465). Istanbul, Turkey: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2012/pdf/365_Paper.pdf
Back to article
17Jamatia, A., Das, A., & Gambäck, B. (2019). Deep learning-based language identification in English-Hindi-Bengali code-mixed social media corpora. Journal of Intelligent Systems, 28(3), 399–408. DOI: 10.1515/jisys-2017-0440
Back to article
18Jamatia, A., Gambäck, B., & Das, A. (2015, September). Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. In Proceedings of the International Conference Recent Advances in Natural Language Processing (pp. 239–248). Hissar, Bulgaria: INCOMA Ltd. Retrieved from https://www.aclweb.org/anthology/R15-1033
Back to article
19Kaur, J., & Singh, J. (2015). Toward normalizing Romanized Gurumukhi text from social media. Indian Journal of Science and Technology, 8(27), 1–6. DOI: 10.17485/ijst/2015/v8i27/81666
Back to article
20Lyu, D-C., Tien-Ping, T., Eng, C., & Haizhou, L. (2015). Mandarin–English codeswitching speech corpus in South-East Asia: SEAME. Language Resources and Evaluation, 49, 1986–1989. DOI: 10.1007/s10579-015-9303-x
Back to article
21Mager, M., Çetinoğlu, Ö., & Kann, K. (2019, June). Subword-level language identification for intraword code-switching. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 2005–2011). Minneapolis, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/N19-1201. DOI: 10.18653/v1/N19-1201
Back to article
22Mave, D., Maharjan, S., & Solorio, T. (2018, July). Language identification and analysis of code-switched social media text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching (pp. 51–61). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-3206. DOI: 10.18653/v1/W18-3206
Back to article
23Mhaiskar, R. (2015). Romanagari an alternative for modern media writings. Bulletin of the Deccan College Research Institute, 75, 195–202. Retrieved from http://www.jstor.org/stable/26264736
Back to article
24Molina, G., AlGhamdi, F., Ghoneim, M., Hawwari, A., Rey-Villamizar, N., Diab, M., & Solorio, T. (2016, November). Overview for the second shared task on language identification in code-switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching (pp. 40–49). Austin, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W16-5805. DOI: 10.18653/v1/W16-5805
Back to article
25Nguyen, D., & Doğruöz, A. S. (2013, October). Word level language identification in online multilingual communication. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 857–862). Seattle, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/D13-1084
Back to article
26Nguyen, L., & Bryant, C. (2020, May). CanVEC – the Canberra Vietnamese-English code-switching natural speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC’20) (pp. 4121–4129). Marseille, France: European Language Resources Association. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.507
Back to article
27Roy, R. S., Choudhury, M., Majumder, P., & Agarwal, K. (2013, December). Overview of the FIRE 2013 track on transliterated search. In FIRE’12 & ’13: Post-Proceedings of the Fourth and Fifth Workshops of the Forum for Information Retrieval Evaluation (pp. 1–7). New York, USA: Association for Computing Machinery. DOI: 10.1145/2701336.2701636
Back to article
28Sasaki, Y. (2007). The truth of the F-measure (Tech. Rep.). Manchester, UK: University of Manchester. Retrieved from https://www.cs.odu.edu/~mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf
Back to article
29Shen, H. P., Wu, C. H., Yang, Y. T., & Hsu, C. S. (2011, October). CECOS: A Chinese-English code-switching speech database. In 2011 International Conference on Speech Database and Assessments, Oriental COCOSDA 2011 – Proceedings (pp. 120–123). Hsinchu City, Taiwan. DOI: 10.1109/ICSDA.2011.6085992
Back to article
30Si, A. (2011). A diachronic investigation of Hindi–English code-switching, using Bollywood film scripts. International Journal of Bilingualism, 15(4), 388–407. DOI: 10.1177/1367006910379300
Back to article
31Solorio, T., & Liu, Y. (2008, October). Part-of-speech tagging for English-Spanish code-switched text. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 1051–1060). Honolulu, Hawaii: Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/D08-1110. DOI: 10.3115/1613715.1613852
Back to article
32Soto, V., & Hirschberg, J. (2018, July). Joint part-of-speech and language ID tagging for code-switched data. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching (pp. 1–10). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-3201. DOI: 10.18653/v1/W18-3201
Back to article
33Sowmya, V. B., Choudhury, M., Bali, K., Dasgupta, T., & Basu, A. (2010, May). Resource creation for training and testing of transliteration systems for Indian languages. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2010/pdf/182_Paper.pdf
Back to article
34Virga, P., & Khudanpur, S. (2003). Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition – Volume 15 (pp. 57–64). Sapporo, Japan: Association for Computational Linguistics. DOI: 10.3115/1119384.1119392
Back to article
35Voss, C., Tratz, S., Laoudi, J., & Briesch, D. (2014, May). Finding Romanized Arabic dialect in code-mixed Tweets. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 2249–2253). Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/1116_Paper.pdf
Back to article
36Xia, M. X. (2016, November). Codeswitching language identification using subword information enriched word vectors. In Proceedings of the second workshop on computational approaches to code switching (pp. 132–136). Austin, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W16-5818. DOI: 10.18653/v1/W16-5818
Back to article

Automatic Language Identification in Code-Switched Hindi-English Social Media Text

References

Paradigm

My account