Tailored Fine-tuning for Comma Insertion in Czech

Jakub Machura; Hana Žižková; Patrik Stano; Tereza Vrabcová; Dana Hlaváčková; Ondřej Trnovec

doi:10.2478/jazcas-2025-0024

.blurhash-client-img { display: none !important; }

Tailored Fine-tuning for Comma Insertion in Czech

Journal of Linguistics/Jazykovedný casopis

Volume 76 (2025): Issue 1 (June 2025)

By: Jakub Machura , Hana Žižková , Patrik Stano , Tereza Vrabcová , Dana Hlaváčková and Ondřej Trnovec

Open Access

|Nov 2025

Conneau, A. et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. In: D. Juravsky et al. (eds.): Proceedings of the 58^th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 8440–8451.
Search in Google Scholar Back to article
Kovář, V. et al. (2016). Evaluation and improvements in punctuation detection for Czech. In: P. Sojka et al. (eds.): Text, Speech, and Dialogue. Springer International Publishing, pp. 287–294.
Search in Google Scholar Back to article
Křen, M. et al. (2020). SYN2020: A representative corpus of written Czech. Institute of the Czech National Corpus, Faculty of Arts, Charles University, Prague. Accessible at: http://www.korpus.cz.
Search in Google Scholar Back to article
Křen, M. et al. (2021). SYN v9: large corpus of written Czech, LINDAT/CLARIAHCZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Accessible at: http://hdl.handle.net/11234/1-4635.
Search in Google Scholar Back to article
Kumar, P. et al. (2023). Transformer-Based Models for Named Entity Recognition: A Comparative Study. 14^th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1–5. Accessible at: https://doi.org/10.1109/ICCCNT56998.2023.10308039.
Search in Google Scholar Back to article
Machálek, T. (2020): KonText: Advanced and Flexible Corpus Query Interface. In Proceedings of LREC 2020, pp. 7005–7010.
Search in Google Scholar Back to article
Machura, J. et al. (2022). Automatic Grammar Correction of Commas in Czech Written Texts: Comparative Study. In: P. Sojka et al. (eds): Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science, Vol. 13502. Springer. Accessible at: https://doi.org/10.1007/978-3-031-16270-1_10.
Search in Google Scholar Back to article
Machura, J. et al. (2023). Is it possible to re-educate RoBERTa? Expert-driven machine learning for punctuation correction. In Slovko (October 18 – 20, 2023) Bratislava. Accessible at: https://dx.doi.org/10.2478/jazcas-2023-0052.
Search in Google Scholar Back to article
Omelianchuk, K. el al. (2020). GECToR – Grammatical Error Correction: Tag, Not Rewrite. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA → Online. Association for Computational Linguistics, pp. 163–170.
Search in Google Scholar Back to article
OpenAI. (2024). ChatGPT-4o. Accessible at: https://chat.openai.com.
Search in Google Scholar Back to article
Straka, M. et al. (2021). RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model. In: K. Ekštein et al. (eds): Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science, Vol. 12848. Springer, Cham. Accessible at: https://doi.org/10.1007/978-3-030-83527-9_17.
Search in Google Scholar Back to article
Suchomel, V. (2018). csTenTen17, a Recent Czech Web Corpus. In Twelveth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, pp. 111–123.
Search in Google Scholar Back to article
Švec, J. et al. (2021). Transformer-based automatic punctuation prediction and word casing reconstruction of the ASR output. In: Ekštein, K. et al. (eds.): Text, Speech, and Dialogue, Springer International Publishing, pp. 86–94.
Search in Google Scholar Back to article
Xue, L. et al. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483–498, Online. Association for Computational Linguistics.
Search in Google Scholar Back to article

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.2478/jazcas-2025-0024 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 268 - 278

Published on: Nov 27, 2025

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

comma,

Czech,

Fine-tuning,

Large Language Model (LLM)

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2025 Jakub Machura, Hana Žižková, Patrik Stano, Tereza Vrabcová, Dana Hlaváčková, Ondřej Trnovec, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 76 (2025): Issue 1 (June 2025)