Abstract
This study investigates the adaptation of the Mistral 7B large language model for the Slovak language, addressing the limited availability of high-quality open-source models for low-resource languages. While commercial models like GPT-4 and Claude exhibit strong Slovak proficiency, their proprietary nature restricts transparency and customization. To overcome this, we fine-tuned the open-weight Mistral 7B model using the Araneum Slovacum VII Maximum corpus (5.3 billion tokens), creating a specialized Slovak variant, Mistral-SK-7b. The training, conducted on the Leonardo supercomputer (10,000 GPU hours), yielded significant improvements: the fine-tuned model generates grammatically correct and contextually coherent Slovak text, eliminating the errors (code-switching, repetition loops, and lexical interference from other languages) observed in the original Mistral-7B-v0.1. The resulting model, released under the Apache 2.0 license, provides a publicly accessible resource for Slovak NLP applications while preserving the base model’s multilingual capabilities. Our work demonstrates the feasibility of adapting state-of-the-art LLMs for linguistically underrepresented languages and underscores the role of open models in promoting digital language preservation.