Have a personal or library account? Click to login
Training of Large Language Model Mistral on Slovak Language Data Cover

Training of Large Language Model Mistral on Slovak Language Data

Open Access
|Jan 2026

Abstract

This study investigates the adaptation of the Mistral 7B large language model for the Slovak language, addressing the limited availability of high-quality open-source models for low-resource languages. While commercial models like GPT-4 and Claude exhibit strong Slovak proficiency, their proprietary nature restricts transparency and customization. To overcome this, we fine-tuned the open-weight Mistral 7B model using the Araneum Slovacum VII Maximum corpus (5.3 billion tokens), creating a specialized Slovak variant, Mistral-SK-7b. The training, conducted on the Leonardo supercomputer (10,000 GPU hours), yielded significant improvements: the fine-tuned model generates grammatically correct and contextually coherent Slovak text, eliminating the errors (code-switching, repetition loops, and lexical interference from other languages) observed in the original Mistral-7B-v0.1. The resulting model, released under the Apache 2.0 license, provides a publicly accessible resource for Slovak NLP applications while preserving the base model’s multilingual capabilities. Our work demonstrates the feasibility of adapting state-of-the-art LLMs for linguistically underrepresented languages and underscores the role of open models in promoting digital language preservation.

DOI: https://doi.org/10.2478/jazcas-2025-0037 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597
Language: English
Page range: 433 - 451
Published on: Jan 18, 2026
In partnership with: Paradigm Publishing Services
Publication frequency: 2 issues per year

© 2026 Peter Bednár, Marek Dobeš, Radovan Garabík, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.