Have a personal or library account? Click to login
Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora Cover

Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

Open Access
|Dec 2021

Abstract

Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.

DOI: https://doi.org/10.2478/acss-2021-0016 | Journal eISSN: 2255-8691 | Journal ISSN: 2255-8683
Language: English
Page range: 132 - 138
Published on: Dec 30, 2021
Published by: Riga Technical University
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Rolands Laucis, Gints Jēkabsons, published by Riga Technical University
This work is licensed under the Creative Commons Attribution 4.0 License.