Identifying Errors in Russian Web Corpora

Maria Khokhlova

doi:10.2478/jazcas-2022-0021

.blurhash-client-img { display: none !important; }

Identifying Errors in Russian Web Corpora

Journal of Linguistics/Jazykovedný casopis

Volume 72 (2021): Issue 4 (December 2021)

By: Maria Khokhlova

Open Access

|Aug 2021

Abstract

The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such “noisy” fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

References

BAEZA-YATES, Ricardo – RELLO, Luz: On measuring the lexical quality of the web. In: Proceedings of the 2^nd Joint WICOW/AIRWeb Workshop on Web Quality. Eds. C. Castillo – Z. Gyongyi – A. Jatowt – K. Tanaka. Lyon, France 2012, pp. 1–6. Available at: https://dl.acm.org/doi/pdf/10.1145/2184305.2184307
Search in Google Scholar Back to article
BENKO, Vladimír: Aranea: Yet another family of (comparable) web corpora. In: International Conference on Text, Speech, and Dialogue. Eds. P. Sojka – A. Horák – I. Kopeček – K. Pala. Cham: Springer 2014, pp. 247–256.
Search in Google Scholar Back to article
British National Corpus. Available at: http://www.natcorp.ox.ac.uk/corpus/
Search in Google Scholar Back to article
BUKCHINA – KALAKUTSKAYA: БуКчИНА, Бронислава З. – КАЛАКуЦКАя, Лариса П.: Слитно или раздельно. Москва: дрофа 2006. 936 с.
Search in Google Scholar Back to article
CLARK, Eleanor – ARAKI, Kenji: Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English. In: Procedia — Social and Behavioral Sciences. Eds. N. A. Aziz – K. Hasida – A. W. A. Rahman – H. Saito. 2011, 27, pp. 2–11.
Search in Google Scholar Back to article
GILYAREVSKIY – GRIVNIN: гИЛяреВСКИЙ, руджеро С. – грИВНИН, Владимир С.: определитель языков мира по письменностям. Москва: Наука 1965. 376 с.
Search in Google Scholar Back to article
JAKUBÍČEK, Miloš – KOVÁŘ, Vojtěch – RYCHLÝ, Pavel–SUCHOMEL, Vít: Current Challenges in Web Corpus Building. In: Proceedings of the 12^th Web as Corpus Workshop. Language Resources and Evaluation Conference (LREC 2020). Eds. A. Barbaresi – F. Bildhauer – R. Schäfer – E. Stemle. Marseille, 11–16 May 2020, 2020, pp. 1–4.
Search in Google Scholar Back to article
KHOKHLOVA, Maria: Large Corpora and Frequency Nouns. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”. Ed. V. P. Selegey, Vol. 15(22). Moscow: RSUH 2016, pp. 224–238.
Search in Google Scholar Back to article
KHOKHLOVA, Maria – BENKO, Vladimír: Size of corpora and collocations: the case of Russian. In: Slovenščina 2.0, 2020, Vol. 8, No 2, pp. 58–77.
Search in Google Scholar Back to article
KUTUZOV, Andrey–KUNILOVSKAYA, Maria: Size vs. structure in training corpora for word embedding models: Araneum Russicum maximum and Russian national corpus. In: Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science. Eds. W. M. P. van der Aalst et al. 10716 LNCS. Cham: Springer 2018. https://doi.org/10.1007/978-3-319-73013-4_5
Search in Google Scholar Back to article
RINGLSTETTER, Christoph – SCHULZ, Klaus – MIHOV, Stoyan: Orthographic Errors in Web Pages: Toward Cleaner Web Corpora. Computational Linguistics, 2006, 32(3), pp. 295–340.
Search in Google Scholar Back to article
ROSENTHAL: роЗеНТАЛь, дитмар Э.: Справочник по правописанию и литературной правке. Москва: Айрис-пресс 2016. 368 с.
Search in Google Scholar Back to article
SHAPOVAL: ШАПоВАЛ, Виктор В.: Новые типы ошибок в письменной речи. In: русский язык в школе, 2009, № 9, с. 76–83.
Search in Google Scholar Back to article
SHAVRINA – SOROKIN: ШАВрИНА, Татьяна о. – СороКИН, Алексей А.: Моделирование расширенной лемматизации для русского языка на основе морфологическо-го парсера TnT-Russian. In: Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции «диалог». ред. В. П. Селегей. Москва: российский государственный гуманитарный университет 2015. URL: http://www.dialog-21.ru/digests/dialog2015/materials/pdf/ShavrinaTOSorokinAA.pdf.
Search in Google Scholar Back to article
SHAVRINA: ШАВрИНА, Татьяна олеговна: Методы обнаружения и исправления опечаток: исторический обзор. In: Вопросы языкознания, 2017, № 4, с. 115–134.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/jazcas-2022-0021 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 977 - 985

Published on: Aug 17, 2021

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

corpora,

web texts,

errors,

typos,

orthography,

typography,

Russian language

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2021 Maria Khokhlova, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 72 (2021): Issue 4 (December 2021)