Training of Large Language Model Mistral on Slovak Language Data
Abstract
This study investigates the adaptation of the Mistral 7B large language model for the Slovak language, addressing the limited availability of high-quality open-source models for low-resource languages. While commercial models like GPT-4 and Claude exhibit strong Slovak proficiency, their proprietary nature restricts transparency and customization. To overcome this, we fine-tuned the open-weight Mistral 7B model using the Araneum Slovacum VII Maximum corpus (5.3 billion tokens), creating a specialized Slovak variant, Mistral-SK-7b. The training, conducted on the Leonardo supercomputer (10,000 GPU hours), yielded significant improvements: the fine-tuned model generates grammatically correct and contextually coherent Slovak text, eliminating the errors (code-switching, repetition loops, and lexical interference from other languages) observed in the original Mistral-7B-v0.1. The resulting model, released under the Apache 2.0 license, provides a publicly accessible resource for Slovak NLP applications while preserving the base model’s multilingual capabilities. Our work demonstrates the feasibility of adapting state-of-the-art LLMs for linguistically underrepresented languages and underscores the role of open models in promoting digital language preservation.
© 2026 Peter Bednár, Marek Dobeš, Radovan Garabík, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.