Have a personal or library account? Click to login
Training of Large Language Model Mistral on Slovak Language Data Cover

Training of Large Language Model Mistral on Slovak Language Data

Open Access
|Jan 2026

References

  1. Anthropic (2023): Claude. Available at: https://www.anthropic.com/. [accessed 2025-07-24]
  2. BENKO, Vladimír (2024): The Aranea Corpora Family: Ten+ Years of Processing Web-Crawled Data. In: E. Nöth – A. Horák – P. Sojka (Eds.): Text, Speech, and Dialogue. TSD 2024 Lecture Notes in Artificial Intelligence, Vol. 15048, Heidelberg: Springer, pp. 55–70. DOI: https://doi.org/10.1007/978-3-031-70563-2_5.
  3. BENDER, Emily M. – GEBRU, Timnit – MCMILLAN-MAJOR, Angelina – SHMITCHELL, Shmargaret (2021): On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, New York: Association for Computing Machinery, pp. 610–623.
  4. BLASI, Damian – ANASTASOPOULOS, Antonios – NEUBIG, Graham (2022): Systematic Inequalities in Language Technology Performance across the World’s Languages. In: S. Muresan – P. Nakov – A. Villavicencio (Eds.): Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: Association for Computational Linguistics, pp. 5486–5505.
  5. BOMMASANI, Rishi – HUDSON, Drew A. – CARD, Dallas – DURMUS, Esin – SRINIVASAN, Krishnan et al. (2021): On the Opportunities and Risks of Foundation Models. In: arXiv preprint arXiv:2108.07258.
  6. BROWN, Tom B. – MANN, Benjamin – RYDER, Nick – SUBBIAH, Mellanie – KAPLAN, Jared D. – DHARIWAL, Prafulla – ... – AMODEI, Dario (2020): Language models are few-shot learners. In: Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901.
  7. CHEN, Tianqi – XU, Bing – ZHANG, Chiyuan – GUESTRIN, Carlos (2016): Training Deep Nets with Sublinear Memory Cost. In: arXiv preprint arXiv:1604.06174.
  8. COSTA-JUSSÀ, Marta R. – CROSS, James – ÇELEBI, Onur – ELBAYAD, Maha – HEAFIELD, Kenneth – HEFFERNAN, Kevin – ... – NLLB Team (2022): No language left behind: Scaling human-centered machine translation. In: arXiv preprint arXiv:2207.04672.
  9. DARĢIS, Roberts – BĀRZDINS, Guntis – SKADIŅA, Inguna – SAULĪTE, Baiba – GRŪZĪTIS, Normunds (2024): Evaluating open-source LLMs in low-resource languages: Insights from Latvian high school exams. In: M. Hämäläinen – E. Öhman – S. Miyagawa – K. Alnajjar – Y. Bizzoni (Eds.): Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, Miami: Association for Computational Linguistics, pp. 289–293.
  10. DEAN, Jeffrey – CORRADO, Greg – MONGA, Rajat – CHEN, Kai – DEVIN, Matthieu – MAO, Mark, et al. (2012): Large Scale Distributed Deep Networks. In: Advances in Neural Information Processing Systems, Vol. 25.
  11. DEVLIN, Jacob – CHANG, Ming-Wei – LEE, Kenton – TOUTANOVA, Kristina (2019): Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Vol. 1 (long and short papers), pp. 4171–4186.
  12. GARABÍK, Radovan (2025): Webový korpus slovenčiny ARANEUM + HPLT + FineWeb2. In: Kultúra slova, Vol. 59, No. 5, pp. 292–296.
  13. Google (2024): Gemini. Available at: https://gemini.google.com/. [accessed 2025-07-24]
  14. Google (2025): Gemma 3. Available at: https://huggingface.co/google/gemma-3-4b-it/. [accessed 2025-07-24]
  15. HPC Cineca (2023): Leonardo HPC System. Available at: https://leonardo-supercomputer.cineca.eu/. [accessed 2025-07-24]
  16. HOWARD, Jeremy – RUDER, Sebastian (2018): Universal Language Model Fine-tuning for Text Classification. In: arXiv preprint arXiv:1801.06146.
  17. LIN, Chin-Yew – OCH, Franz Josef (2004): Looking for a few good metrics: ROUGE and its evaluation. In: Ntcir workshop, pp. 1–8.
  18. Meta (2025): Llama 3. Available at: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/. [accessed 2025-07-24]
  19. Mistral AI (2023): Mistral 7B v0.1. Available at: https://huggingface.co/mistralai/Mistral-7B-v0.1/. [accessed 2025-07-24]
  20. OpenAI. (2023): GPT-4. Available at: https://platform.openai.com/docs/models/overview. [accessed 2025-07-24]
  21. PAPINENI, Kishore – ROUKOS, Salim – WARD, Todd – ZHU, Wei-Jing (2002): Bleu: a method for automatic evaluation of machine translation. In: P. Isabelle – E. Charniak – D. Lin (Eds.): Proceedings of the 40th annual meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania: Association for Computational Linguistics, pp. 311–318.
  22. RADFORD, Alec – WU, Jeffrey – CHILD, Rewon – LUAN, David – AMODEI, Dario – SUTSKEVER, Ilya (2019): Language Models are Unsupervised Multitask Learners. In: OpenAI Blog. Available at: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. [accessed 2025-07-24]
  23. RAJPURKAR, Pranav – ZHANG, Jian – LOPYREV, Konstantin – LIANG, Percy (2016): SquAD: 100,000+ Questions for Machine Comprehension of Text. In: arXiv preprint arXiv:1606.05250.
  24. RUDER, Sebastian – PETERS, Matthew E. – SWAYAMDIPTA, Swabha – WOLF, Thomas (2021): Transfer Learning in Natural Language Processing. In: K. Toutanova – A. Rumshisky – L. Zettlemoyer et al. (Eds.): Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pp. 765–780.
  25. TOUVRON, Hugo – LAVRIL, Thibaut – IZACARD, Gautier – MARTINET, Xavier – LACHAUX, Marie-Anne – LACROIX, Timothée et al. (2023): LLaMA: Open and Efficient Foundation Language Models. In: arXiv preprint arXiv:2302.13971.
  26. VASWANI, Ashish – SHAZEER, Noam – PARMAR, Niki – USZKOREIT, Jakob – JONES, Llion – GOMEZ, Aidan N. – KAISER, Lukasz – POLOSUKHIN, Illia (2017): Attention is All you Need. In: Advances in Neural Information Processing Systems, Vol. 30, pp. 5998–6008.
  27. Webový korpus slovenčiny ARANEUM + HPLT + FineWeb (2025). Available at: https://www.juls.savba.sk/webskahfcorp.html. [accessed 2025-07-24]
DOI: https://doi.org/10.2478/jazcas-2025-0037 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597
Language: English
Page range: 433 - 451
Published on: Jan 18, 2026
In partnership with: Paradigm Publishing Services
Publication frequency: 2 issues per year

© 2026 Peter Bednár, Marek Dobeš, Radovan Garabík, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.