Slovak Question Answering Dataset Based on the Machine Translation of the Squad V2.0

Ján Staš; Daniel Hládek; Tomáš Koctúr

doi:10.2478/jazcas-2023-0054

.blurhash-client-img { display: none !important; }

Slovak Question Answering Dataset Based on the Machine Translation of the Squad V2.0

Journal of Linguistics/Jazykovedný casopis

Volume 74 (2023): Issue 1 (June 2023)

By: Ján Staš, Daniel Hládek and Tomáš Koctúr

Open Access

|Dec 2023

Abstract

This paper describes the process of building the first large-scale machinetranslated question answering dataset SQuAD-sk for the Slovak language. The dataset was automatically translated from the original English SQuAD v2.0 using the Marian neural machine translation together with the Helsinki-NLP Opus English-Slovak model. Moreover, we proposed an effective approach for the approximate search of the translated answer in the translated paragraph based on measuring their similarity using their word vectors. In this way, we obtained more than 92% of the translated questions and answers from the original English dataset. We then used this machine-translated dataset to train the Slovak question answering system by fine-tuning monolingual and multilingual BERT-based language models. The scores achieved by EM = 69.48% and F1 = 78.87% for the fine-tuned mBERT model show comparable results of question answering with recently published machinetranslated SQuAD datasets for other European languages.

References

Abadani, N., Mozafani, J., Fatemi, A., Nematbakhsh, M., and Kazemi, A. (2021). ParSQuAD: Persian question answering dataset based on machine translation of SQuAD 2.0. International Journal of Web Research, 4(1), pages 34–46.
Search in Google Scholar Back to article
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. Trans. of the ACL, Vol. 5, Cambridge, MA, pages 135–146. Accessible at: https://aclanthology.org/Q17-1010.pdf.
Search in Google Scholar Back to article
Carrino, C. P., Costa-Jussa, M. R., and Fonollosa, J. A. R. (2020). Automatic Spanish translation of the SQuAD dataset for multilingual question answering. In Proc. of LREC, Marseille, France, pages 5515–5523. Accessible at: https://arxiv.org/abs/1912.05200.
Search in Google Scholar Back to article
Cattan, O., Servan, C., and Rosset, S. (2021). On the usability of transformers-based models for a French question-answering task. In Proc. of RANLP, Varna, Bulgaria, pages 244–255. Accessible at: https://hal.archives-ouvertes.fr/hal-03336060/.
Search in Google Scholar Back to article
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proc. of ACL, Online, pages 8440–8451. Accessible at: https://aclanthology.org/2020.acl-main.747.pdf.
Search in Google Scholar Back to article
Croce, D., Zelenanska, A, and Basili, R. (2018). Neural learning for question answering in Italian. In C. Ghidini – B. Magnini – A. Passerini – P. Traverso (eds): Advances in Artificial Intelligence, LNAI vol. 11298, Springer, Cham, pages 389–402. Accessible at: https://doi.org/10.1007/978-3-030-03840-3_29.
Search in Google Scholar Back to article
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, Minneapolis, Minnesota, pages 4171–4186. Accessible at: https://aclanthology.org/N19-1423/.
Search in Google Scholar Back to article
Germirter, C. B., and Goularas, D. (2021). A Turkish question answering system based on deep learning neural networks. Journal of Intelligent Systems: Theory and Applications, 4(2), pages 65–75. Accessible at: https://dergipark.org.tr/tr/download/article-file/1361881.
Search in Google Scholar Back to article
Gupta, D., Ekbal, A., and Bhattacharyya, P. (2019). A deep neural network framework for English Hindi question answering. ACM TALLIP, 19(2), Article No. 25, pages 1–22.
Search in Google Scholar Back to article
Hládek, D., Staš, J., Juhár, J., and Koctúr, T. (2023). Slovak dataset for multilingual question answering. IEEE Access, Vol. 11, pages 32869–32881. Accessible at: https://ieeexplore.ieee.org/document/10082887.
Search in Google Scholar Back to article
Honnibal, M., Montani, I., Landeghem van, S., and Boyd, A. (2020). spaCy: Industrialstrength natural language processing in Python. Accessible at: doi 10.5281/zenodo.1212303.
Search in Google Scholar Back to article
Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K., Neckermann, T., Seide, F., Germann, U., Aji, A. F., Bogoychev, N., Martins, A. F. T., and Birch, A. (2018). Marian: Fast neural machine translation in C++. In Proc. of ACL, Melbourne, Australia, pages 116–121. Accessible at: https://aclanthology.org/P18-4020.pdf.
Search in Google Scholar Back to article
Lee, K., Yoon, K., Park, S., and Hwang, S. W. (2018). Semi-supervised training data generation for multilingual question answering. In Proc. of LREC, Miyazaki, Japan, pages 2758–2762. Accessible at: https://aclanthology.org/L18-1437.
Search in Google Scholar Back to article
Macková, K., and Straka, M. (2020). Reading comprehension in Czech via machine translation and cross-lingual transfer. In Proc. of TSD, Brno, Czech Republic, pages 171–179. Accessible at: https://arxiv.org/abs/2007.01667.
Search in Google Scholar Back to article
Mayeesha, T. T., Sarwar, A. Md., and Rahman, R. M. (2021). Deep learning based question answering in Bengali. Journal of Information and Telecommunication, 5(2), pages 145–178. Accessible at: https://doi.org/10.1080/24751839.2020.1833136.
Search in Google Scholar Back to article
Mozannar, H., El Hajal, K., Maamary, E., and Hajj, H. M. (2019). Neural Arabic question answering. In Proc. of WANLP, Florence, Italy, pages 108–118. Accessible at: https://arxiv.org/abs/1906.05394v1.
Search in Google Scholar Back to article
Pikuliak, M., Grivalský, Š., Konôpka, M., Blšták, M., Tamajka, M., Bachratý, V., Šimko, M., Balážik, P., Trnka, M., and Uhlárik, F. (2022). SlovakBERT: Slovak masked language model. In Proc. of EMNLP, Abu Dhabi, United Arab Emirates, pages 7156–7168. Accessible at: https://aclanthology.org/2022.findings-emnlp.530.pdf.
Search in Google Scholar Back to article
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP, Austin, Texas, pages 2383–2392. Accessible at: https://aclanthology.org/2021.emnlp-main.530.pdf.
Search in Google Scholar Back to article
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerable questions for SQuAD. In Proc. of ACL, Melbourne, Australia, pages 784–789. Accessible at: https://aclanthology.org/P18-2124.pdf.
Search in Google Scholar Back to article
Tiedemann, J., and Thottingal, S. (2020). OPUS-MT – Building open translation services for the Worlds. In Proc. of EAMT, Lisboa, Portugal, pages 479–4810. Accessible at: https://aclanthology.org/2020.eamt-1.61.pdf.
Search in Google Scholar Back to article
Tiutiunnyk, S., and Dyomkin, V. (2019). Context-based question-answering system for the Ukrainian language. In Proc. of MS-AMLV, Lviv, Ukraine, pages 81–88. Accessible at: https://ceur-ws.org/Vol-2566/MS-AMLV-2019-paper17-p081.pdf.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/jazcas-2023-0054 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 381 - 390

Published on: Dec 25, 2023

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

language modeling,

machine reading comprehension,

machine translation,

natural language processing,

question answering

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2023 Ján Staš, Daniel Hládek, Tomáš Koctúr, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 74 (2023): Issue 1 (June 2023)