Application of NLP Technologies to Low-Resource Croatian Dialects

Maja Polanec; Marina Bagić Babac

doi:10.2478/crdj-2025-0008

.blurhash-client-img { display: none !important; }

Application of NLP Technologies to Low-Resource Croatian Dialects

Croatian Regional Development Journal

Volume 6 (2025): Issue 2 (March 2025)

By: Maja Polanec and Marina Bagić Babac

Open Access

|Apr 2026

Abstract

In natural language processing (NLP) systems, a trend of decreased performance is observed when applied to texts written in low-resource dialects rather than the standard language. Dependency parsing is an essential component in NLP systems, and therefore, its improvement could lead to enhanced overall system performance. This paper aims to compare the performance of Slovenian and Croatian parsers for dependency parsing of the Kajkavian dialect. The comparison results will provide insight into the Slovenian parser’s potential for parsing Kajkavian. A dependency parsing dataset was created using parallel translations of the book „Mali kraljević“. Based on the created dataset, label projection from the parsed standard Croatian language to the Kajkavian dialect was performed to obtain data for calculating UAS and LAS metrics for comparing the Croatian and Slovenian parsers, which were implemented using the open-source SpaCy library. The Croatian parser achieved UAS and LAS scores of 0.47 and 0.30, respectively, which are lower than those of the Slovenian parser (0.52 and 0.34, respectively). The results indicate that the Slovenian parser performs more accurately on the Kajkavian dialect. However, to draw a general conclusion, the dataset would need to be expanded.

References

Agić, Ž., & Ljubešić, N. (2015, September). Universal dependencies for Croatian (that work for Serbian, too). In Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing (pp. 1–8).
Search in Google Scholar Back to article
Ali, A., Dehak, N., Cardinal, P., Khurana, S., Yella, S. H., Glass, J., Bell, P., & Renals, S. (2016). Automatic dialect detection in Arabic broadcast speech. In Proceedings of INTERSPEECH 2016 (pp. 2934–2938). San Francisco, CA, USA.
Search in Google Scholar Back to article
Alshutayri, A., & Atwell, E. (2017). Exploring Twitter as a source of an Arabic dialect corpus. International Journal of Computational Linguistics (IJCL), 8(2).
Search in Google Scholar Back to article
Bagić Babac, M. (2023). Emotion analysis of user reactions to online news. Information Discovery and Delivery, 51(2), 179–193. https://doi.org/10.1108/IDD-04-2022-0027
Search in Google Scholar Back to article
Borotić, G., Granoša, L., Kovačević, J., & Bagić Babac, M. (2023). Effective spam detection with machine learning. Croatian Regional Development Journal, 3(2), 43–64. https://doi.org/10.2478/crdj-2023-0007
Search in Google Scholar Back to article
Celinić, A. (2020). Kajkavian. Hrvatski dijalektološki zbornik, 24, 1–37.
Search in Google Scholar Back to article
Farkaš, D., & Filko, M. (2022). Obilježavanje koordinacije u ovisnosnim bankama stabala. Jezikoslovlje, 23(2), 193–214.
Search in Google Scholar Back to article
Joshi, A., Dabre, R., Kanojia, D., Li, Z., Zhan, H., Haffari, G., & Dippold, D. (2024). Natural language processing for dialects of a language: A survey. arXiv preprint arXiv:2401.05632. https://arxiv.org/abs/2401.05632
Search in Google Scholar Back to article
Jørgensen, A., Hovy, D., & Søgaard, A. (2016). Learning a POS tagger for AAVE-like language. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1115–1120). San Diego, CA, USA.
Search in Google Scholar Back to article
Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457. https://arxiv.org/abs/1901.10457
Search in Google Scholar Back to article
Scherrer, Y. (2014, August). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (pp. 30–38).
Search in Google Scholar Back to article
Scherrer, Y., & Rabus, A. (2017, April). Multi-source morphosyntactic tagging for spoken Rusyn. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) (pp. 84–92).
Search in Google Scholar Back to article
Scherrer, Y., & Rabus, A. (2019). Neural morphosyntactic tagging for Rusyn. Natural Language Engineering, 25(5), 633–650. https://doi.org/10.1017/S1351324919000202
Search in Google Scholar Back to article
Scherrer, Y., Samardžič, T., & Glaser, E. (2019). Digitising Swiss German – How to process and study a polycentric spoken language. Language Resources and Evaluation, 53(4), 735–769. https://doi.org/10.1007/s10579-019-09459-5
Search in Google Scholar Back to article
Šandor, D., & Bagić Babac, M. (2024). Sarcasm detection in online comments using machine learning. Information Discovery and Delivery, 52(2), 213–226. https://doi.org/10.1108/IDD-01-2023-0002
Search in Google Scholar Back to article
Tadić, M. (2007). Building the Croatian dependency treebank: The initial stages. Suvremena lingvistika, 33(63), 85–92.
Search in Google Scholar Back to article
Vania, C., Kementchedjhieva, Y., Søgaard, A., & Lopez, A. (2019). A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 1105–1116).
Search in Google Scholar Back to article
Zampieri, M., Malmasi, S., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J., & Aepli, N. (2017, April). Findings of the VarDial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) (pp. 1–15).
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/crdj-2025-0008 | Journal eISSN: 2718-4978

Journal RSS Feed

Language: English

Page range: 13 - 23

Submitted on: Jun 14, 2024

Accepted on: Jan 4, 2025

Published on: Apr 26, 2026

Published by: Međimurje University of Applied Sciences in Čakovec

In partnership with: Paradigm Publishing Services

Publication frequency: 2 issues per year

Keywords:

Natural Language Processing (NLP),

low-resource dialect,

Croatian language,

dependency parser

Related subjects:

Computer sciences,

Programming and languages,

Business and economics,

Business management,

Management, organization, corporate governance,

Engineering,

Mechanical engineering,

Fundamentals of mechanical engineering,

Life sciences,

Ecology

© 2026 Maja Polanec, Marina Bagić Babac, published by Međimurje University of Applied Sciences in Čakovec
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 6 (2025): Issue 2 (March 2025)