Have a personal or library account? Click to login
Slovak Language Models for Basic Preprocessing Tasks in Python Cover

Slovak Language Models for Basic Preprocessing Tasks in Python

Open Access
|Dec 2023

References

  1. Boroş, T., Dumitrescu, S. D., and Burtica, R. (2018). NLP-Cube: End-to-end raw text processing with neural networks. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pages 171–179. Accessible at: https://aclanthology.org/K18-2017.pdf.
  2. Colic, N., and Rinaldi, F. (2019). Improving spaCy dependency annotation and PoS tagging web service using independent NER services. Genomics Inform., 17(2) e21. Accessible at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6808626/.
  3. Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL HLT, Minneapolis, Minnesota, pages 4171–4186. Accessible at: https://aclanthology.org/N19-1423.pdf.
  4. Erjavec, T. (2012). MULTEXT-East: morphosyntactic resources for Central and Eastern European languages Language Resources and Evaluation, 46(1), pages 131–142. Accessible at: https://www.jstor.org/stable/41486069.
  5. Gajdošová, K., Šimková, M. et al. (2016). Slovak dependency treebank. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Accessible at: https://lindat.cz/repository/xmlui/handle/11234/1-1822.
  6. Hajič, J., Hajičová, E., Mikulová, M., and Mírovský, J. (2017). Prague dependency treebank. In Handbook of Linguistic Annotation, pages 555–594.
  7. Harahus, M, Juhár, J., and Hládek D. (2022). Morphological annotation of the Slovak language in the Spacy library with the pretraining. 32nd International Conference Radioelektronika (RADIOELEKTRONIKA). IEEE, 2022. Accessible at: doi 10.1109/RADIOELEKTRONIKA54537.2022.9764935.
  8. Hládek, D., Staš, J., and Juhár, J. (2014). The Slovak Categorized News Corpus. In LREC, pages 1705–1708. Accessible at: https://aclanthology.org/L14-1517/.
  9. Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of tricks for efficient text classification. In Proc. of EACL: Volume 2, Short Papers, Valencia, Spain, pages 427–431. Accessible at: https://aclanthology.org/E17-2068.pdf.
  10. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., and Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. CoRR, arXiv preprint, arXiv: 1907.11692. Accessible at: https://arxiv.org/pdf/1907.11692.pdf.
  11. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, arXiv preprint, arXiv:1301.3781. Accessible at: https://arxiv.org/pdf/1301.3781.pdf.
  12. Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., and Ji, H. (2017). Cross-lingual name tagging and linking for 282 languages. In Proc. of ACL: Volume 1, Long Papers, Vancouver, Canada, pages 1946–1958. Accessible at: https://aclanthology.org/P17-1178.pdf.
  13. Partalidou, E., Spyromitros-Xioufis, E., Doropoulos, S., Vologiannidis, S., and Diamantaras, K. (2019). Design and implementation of an open source Greek PoS tagger and entity recognizer using spaCy. In Proc. of WI’19: IEEE/WIC/ACM International Conference on Web Intelligence, Thessaloniki, Greece, pages 337–341. Accessible at: https://dl.acm.org/doi/10.1145/3350546.3352543.
  14. Pikuliak, M., Grivalský, Š., Konôpka, M., Blšták, M., Tamajka, M., Bachratý, V., Šimko, M., Balážik, P., Trnka, M., and Uhlárik, F. (2022). SlovakBERT: Slovak masked language model. In Proc. of EMNLP, Abu Dhabi, United Arab Emirates, pages 7156–7168. Accessible at: https://aclanthology.org/2022.findings-emnlp.530.pdf.
  15. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, D. Ch. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proc. of ACL: System Demonstrations, Online, pages 101–108. Accessible at: https://aclanthology.org/2020.acldemos.14.pdf.
  16. Straka, M., Hajič, J., and Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, PoS tagging and parsing. In Proc. of LREC, Portorož, Slovenia, pages 4290–4297. Accessible at: https://aclanthology.org/L16-1680.pdf.
  17. Ye, W., Li, B., Xie, R., Sheng, Z., Chen, L., and Zhang, S. (2019). Exploiting entity BIO tag embeddings and multi-task learning for relation extraction with imbalanced data. arXiv preprint arXiv:1906.08931.
  18. Zeman, D. (2017). Slovak dependency treebank in universal dependencies. Jazykovedný časopis, 68(2), pages 385–395. Accessible at: https://sciendo.com/article/10.1515/jazcas-2017-0048.
DOI: https://doi.org/10.2478/jazcas-2023-0049 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597
Language: English
Page range: 323 - 332
Published on: Dec 25, 2023
Published by: Slovak Academy of Sciences, Mathematical Institute
In partnership with: Paradigm Publishing Services
Publication frequency: 2 issues per year

© 2023 Daniel Hládek, Maroš Harahus, Ján Staš, Matúš Pleva, published by Slovak Academy of Sciences, Mathematical Institute
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.