Training of Large Language Model Mistral on Slovak Language Data

Peter Bednár; Marek Dobeš; Radovan Garabík

doi:10.2478/jazcas-2025-0037

Abstract

This study investigates the adaptation of the Mistral 7B large language model for the Slovak language, addressing the limited availability of high-quality open-source models for low-resource languages. While commercial models like GPT-4 and Claude exhibit strong Slovak proficiency, their proprietary nature restricts transparency and customization. To overcome this, we fine-tuned the open-weight Mistral 7B model using the Araneum Slovacum VII Maximum corpus (5.3 billion tokens), creating a specialized Slovak variant, Mistral-SK-7b. The training, conducted on the Leonardo supercomputer (10,000 GPU hours), yielded significant improvements: the fine-tuned model generates grammatically correct and contextually coherent Slovak text, eliminating the errors (code-switching, repetition loops, and lexical interference from other languages) observed in the original Mistral-7B-v0.1. The resulting model, released under the Apache 2.0 license, provides a publicly accessible resource for Slovak NLP applications while preserving the base model’s multilingual capabilities. Our work demonstrates the feasibility of adapting state-of-the-art LLMs for linguistically underrepresented languages and underscores the role of open models in promoting digital language preservation.

References

Anthropic (2023): Claude. Available at: https://www.anthropic.com/. [accessed 2025-07-24]
Search in Google Scholar Back to article
BENKO, Vladimír (2024): The Aranea Corpora Family: Ten+ Years of Processing Web-Crawled Data. In: E. Nöth – A. Horák – P. Sojka (Eds.): Text, Speech, and Dialogue. TSD 2024 Lecture Notes in Artificial Intelligence, Vol. 15048, Heidelberg: Springer, pp. 55–70. DOI: https://doi.org/10.1007/978-3-031-70563-2_5.
Search in Google Scholar Back to article
BENDER, Emily M. – GEBRU, Timnit – MCMILLAN-MAJOR, Angelina – SHMITCHELL, Shmargaret (2021): On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, New York: Association for Computing Machinery, pp. 610–623.
Search in Google Scholar Back to article
BLASI, Damian – ANASTASOPOULOS, Antonios – NEUBIG, Graham (2022): Systematic Inequalities in Language Technology Performance across the World’s Languages. In: S. Muresan – P. Nakov – A. Villavicencio (Eds.): Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: Association for Computational Linguistics, pp. 5486–5505.
Search in Google Scholar Back to article
BOMMASANI, Rishi – HUDSON, Drew A. – CARD, Dallas – DURMUS, Esin – SRINIVASAN, Krishnan et al. (2021): On the Opportunities and Risks of Foundation Models. In: arXiv preprint arXiv:2108.07258.
Search in Google Scholar Back to article
BROWN, Tom B. – MANN, Benjamin – RYDER, Nick – SUBBIAH, Mellanie – KAPLAN, Jared D. – DHARIWAL, Prafulla – ... – AMODEI, Dario (2020): Language models are few-shot learners. In: Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901.
Search in Google Scholar Back to article
CHEN, Tianqi – XU, Bing – ZHANG, Chiyuan – GUESTRIN, Carlos (2016): Training Deep Nets with Sublinear Memory Cost. In: arXiv preprint arXiv:1604.06174.
Search in Google Scholar Back to article
COSTA-JUSSÀ, Marta R. – CROSS, James – ÇELEBI, Onur – ELBAYAD, Maha – HEAFIELD, Kenneth – HEFFERNAN, Kevin – ... – NLLB Team (2022): No language left behind: Scaling human-centered machine translation. In: arXiv preprint arXiv:2207.04672.
Search in Google Scholar Back to article
DARĢIS, Roberts – BĀRZDINS, Guntis – SKADIŅA, Inguna – SAULĪTE, Baiba – GRŪZĪTIS, Normunds (2024): Evaluating open-source LLMs in low-resource languages: Insights from Latvian high school exams. In: M. Hämäläinen – E. Öhman – S. Miyagawa – K. Alnajjar – Y. Bizzoni (Eds.): Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, Miami: Association for Computational Linguistics, pp. 289–293.
Search in Google Scholar Back to article
DEAN, Jeffrey – CORRADO, Greg – MONGA, Rajat – CHEN, Kai – DEVIN, Matthieu – MAO, Mark, et al. (2012): Large Scale Distributed Deep Networks. In: Advances in Neural Information Processing Systems, Vol. 25.
Search in Google Scholar Back to article
DEVLIN, Jacob – CHANG, Ming-Wei – LEE, Kenton – TOUTANOVA, Kristina (2019): Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Vol. 1 (long and short papers), pp. 4171–4186.
Search in Google Scholar Back to article
GARABÍK, Radovan (2025): Webový korpus slovenčiny ARANEUM + HPLT + FineWeb2. In: Kultúra slova, Vol. 59, No. 5, pp. 292–296.
Search in Google Scholar Back to article
Google (2024): Gemini. Available at: https://gemini.google.com/. [accessed 2025-07-24]
Search in Google Scholar Back to article
Google (2025): Gemma 3. Available at: https://huggingface.co/google/gemma-3-4b-it/. [accessed 2025-07-24]
Search in Google Scholar Back to article
HPC Cineca (2023): Leonardo HPC System. Available at: https://leonardo-supercomputer.cineca.eu/. [accessed 2025-07-24]
Search in Google Scholar Back to article
HOWARD, Jeremy – RUDER, Sebastian (2018): Universal Language Model Fine-tuning for Text Classification. In: arXiv preprint arXiv:1801.06146.
Search in Google Scholar Back to article
LIN, Chin-Yew – OCH, Franz Josef (2004): Looking for a few good metrics: ROUGE and its evaluation. In: Ntcir workshop, pp. 1–8.
Search in Google Scholar Back to article
Meta (2025): Llama 3. Available at: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/. [accessed 2025-07-24]
Search in Google Scholar Back to article
Mistral AI (2023): Mistral 7B v0.1. Available at: https://huggingface.co/mistralai/Mistral-7B-v0.1/. [accessed 2025-07-24]
Search in Google Scholar Back to article
OpenAI. (2023): GPT-4. Available at: https://platform.openai.com/docs/models/overview. [accessed 2025-07-24]
Search in Google Scholar Back to article
PAPINENI, Kishore – ROUKOS, Salim – WARD, Todd – ZHU, Wei-Jing (2002): Bleu: a method for automatic evaluation of machine translation. In: P. Isabelle – E. Charniak – D. Lin (Eds.): Proceedings of the 40th annual meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania: Association for Computational Linguistics, pp. 311–318.
Search in Google Scholar Back to article
RADFORD, Alec – WU, Jeffrey – CHILD, Rewon – LUAN, David – AMODEI, Dario – SUTSKEVER, Ilya (2019): Language Models are Unsupervised Multitask Learners. In: OpenAI Blog. Available at: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. [accessed 2025-07-24]
Search in Google Scholar Back to article
RAJPURKAR, Pranav – ZHANG, Jian – LOPYREV, Konstantin – LIANG, Percy (2016): SquAD: 100,000+ Questions for Machine Comprehension of Text. In: arXiv preprint arXiv:1606.05250.
Search in Google Scholar Back to article
RUDER, Sebastian – PETERS, Matthew E. – SWAYAMDIPTA, Swabha – WOLF, Thomas (2021): Transfer Learning in Natural Language Processing. In: K. Toutanova – A. Rumshisky – L. Zettlemoyer et al. (Eds.): Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pp. 765–780.
Search in Google Scholar Back to article
TOUVRON, Hugo – LAVRIL, Thibaut – IZACARD, Gautier – MARTINET, Xavier – LACHAUX, Marie-Anne – LACROIX, Timothée et al. (2023): LLaMA: Open and Efficient Foundation Language Models. In: arXiv preprint arXiv:2302.13971.
Search in Google Scholar Back to article
VASWANI, Ashish – SHAZEER, Noam – PARMAR, Niki – USZKOREIT, Jakob – JONES, Llion – GOMEZ, Aidan N. – KAISER, Lukasz – POLOSUKHIN, Illia (2017): Attention is All you Need. In: Advances in Neural Information Processing Systems, Vol. 30, pp. 5998–6008.
Search in Google Scholar Back to article
Webový korpus slovenčiny ARANEUM + HPLT + FineWeb (2025). Available at: https://www.juls.savba.sk/webskahfcorp.html. [accessed 2025-07-24]
Search in Google Scholar Back to article

Training of Large Language Model Mistral on Slovak Language Data

Abstract

Paradigm

My account