Have a personal or library account? Click to login

Abstract

The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

DOI: https://doi.org/10.2478/jazcas-2021-0054 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597
Language: English
Page range: 590 - 602
Published on: Dec 30, 2021
Published by: Slovak Academy of Sciences, Mathematical Institute
In partnership with: Paradigm Publishing Services
Publication frequency: 2 issues per year

© 2021 Kirill I. Semenov, Armine K. Titizian, Aleksandra O. Piskunova, Yulia O. Korotkova, Alena D. Tsvetkova, Elena A. Volf, Alexandra S. Konovalova, Yulia N. Kuznetsova, published by Slovak Academy of Sciences, Mathematical Institute
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.