Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Kirill I. Semenov; Armine K. Titizian; Aleksandra O. Piskunova; Yulia O. Korotkova; Alena D. Tsvetkova; Elena A. Volf; Alexandra S. Konovalova; Yulia N. Kuznetsova

doi:10.2478/jazcas-2021-0054

.blurhash-client-img { display: none !important; }

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Journal of Linguistics/Jazykovedný casopis

Volume 72 (2021): Issue 2 (December 2021)

By: Kirill I. Semenov, Armine K. Titizian, Aleksandra O. Piskunova, Yulia O. Korotkova, Alena D. Tsvetkova, Elena A. Volf, Alexandra S. Konovalova and Yulia N. Kuznetsova

Open Access

|Dec 2021

Abstract

The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

References

[1] Semenov, K. I., Kuznetsova, Y. N., and Durneva, S. P. (2020). Russian-Chinese parallel corpus of RNC: Problems and perspectives. Proceedings of the 10^th International Conference “Russia and China: History and Perspectives for Cooperation”, pages 633–640.
Search in Google Scholar Back to article
[2] Emerson, T. (2005). The Second International Chinese Word Segmentation Bakeoff. Accessible at: http://sighan.cs.uchicago.edu/bakeoff2005/.
Search in Google Scholar Back to article
[3] Li, P.-H., and Ma, W.-Y. (2019). CkipTagger. Accessible at: https://github.com/ckiplab/ckiptagger.
Search in Google Scholar Back to article
[4] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Association for Computational Linguistics (ACL) System Demonstrations. Accessible at: https://nlp.stanford.edu/pubs/qi2020stanza.pdf.
Search in Google Scholar Back to article
[5] Honnibal, M., and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Accessible at: https://spacy.io/.
Search in Google Scholar Back to article
[6] Luo, R., xu, J., Zhang, Y., Ren, x., and Sun, x. (2019). PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. Accessible at: http://arxiv.org/abs/1906.11455.
Search in Google Scholar Back to article
[7] Geng, Z., Yan, H., Qiu, x., and Huang, x. (2020). fastHan: A BERT-based Joint Many-Task Toolkit for Chinese NLP. Accessible at: http://arxiv.org/abs/2009.08633.
Search in Google Scholar Back to article
[8] Zhang, H., and Shang, J. (2019). NLPIR-Parser: An intelligent semantic analysis toolkit for big data. Corpus Linguistics, 6(1), pages 87–104.
Search in Google Scholar Back to article
[9] Che, W., Feng, Y., Qin, L., and Liu, T. (2021). N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models. Accessible at: http://arxiv.org/abs/2009.11616.
Search in Google Scholar Back to article
[10] Straka, M. (2018). UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207. Accessible at: https://doi.org/10.18653/v1/K18-2020.10.18653/v1/K18-2020
Search in Google Scholar Back to article
[11] Semenov, K. I., Korotkova, Y. O., Volf, E. A., and Konovalova, A. S. (2021). Automatic Annotation of the Chinese Texts that Contain Loanwords: Word Segmentation, Transcription, PoS-tagging. DIALOG-2021: 27^th International Conference on Computational Linguistics and Intellectual Technologies, Supplementary volume, pages 1081–1095. Accessible at: http://www.dialog-21.ru/media/5420/_-dialog2021supvol.pdf.
Search in Google Scholar Back to article
[12] Cai, Z., Yang, Y., Zhang, C., Qin, x., and Li, M. (2019). Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features. Accessible at: https://arxiv.org/abs/1907.01749.
Search in Google Scholar Back to article
[13] Park, K., and Lee, S. (2020). g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset. Accessible at: http://arxiv.org/abs/2004.03136.
Search in Google Scholar Back to article
[14] Luo, E. (2020). xpinyin. Accessible at: https://github.com/lxneng/xpinyin.
Search in Google Scholar Back to article
[15] Huang, H. (2020). pypinyin. Accessible at: https://github.com/mozillazg/python-pinyin.
Search in Google Scholar Back to article
[16] Konovalova, A. S., and Tsvetkova, A. D. (2021). Comparative analysis of grapheme-to-phoneme models for the Russian-Chinese parallel corpus. Program book of Buckeye East Asian Linguistics Forum 4, pages 28–30. Accessible at: https://cpb-us-w2.wpmucdn.com/u.osu.edu/dist/6/3609/files/2021/03/BEALF-4_Program_Book_2021-3-5.pdf.
Search in Google Scholar Back to article
[17] Roten, T. S. (2018). PyNLPIR PoS tagset. Accessible at: https://pynlpir.readthedocs.io/en/latest/api.html.
Search in Google Scholar Back to article
[18] Semenov, K. I., Korotkova, Y. O., and Volf, E. A. (2021). Automatic Annotation of the Russian Loanwords in Chinese Texts: Issues in Word Segmentation and PoS-tagging. Proceedings of Corpora 2021 International Conference. 14 pages [in press].
Search in Google Scholar Back to article
[19] Konovalova, A. S. (2021). Automatic POS-tagging for Chinese Using Parallel Data [BA thesis]. Higher School of Economics. 82 pages.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/jazcas-2021-0054 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 590 - 602

Published on: Dec 30, 2021

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

Mandarin,

Russian,

parallel corpus,

Chinese word segmentation (CWS),

grapheme-to-phoneme conversion (G2P),

PoS-tagging,

code-switching detection

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2021 Kirill I. Semenov, Armine K. Titizian, Aleksandra O. Piskunova, Yulia O. Korotkova, Alena D. Tsvetkova, Elena A. Volf, Alexandra S. Konovalova, Yulia N. Kuznetsova, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Volume 72 (2021): Issue 2 (December 2021)