References
- 1Blain, F., Zerva C., Rei R., Guerreiro N. M., Kanojia D., de Souza J. G. C., Silva B., Vaz T., Jingxuan Y., Azadi F., Orăsan C., & Martins A. (2023). Findings of the WMT 2023 Shared Task on Quality Estimation. In P. Koehn, B. Haddow, T. Kocmi, & C. Monz (Eds.), Proceedings of the Eighth Conference on Machine Translation (WMT 2023) (pp. 629–653). 10.18653/v1/2023.wmt-1.52
- 2Callison-Burch, C., Osborne, M., & Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. In Proceedings of EACL 2006 (pp. 249–256).
https://aclanthology.org/E06-1032 last accessed 1 November 2025. - 3Cho, S. (2022).
Politeness Strategies in Korean . In S. Cho & J. Whitman (Eds.), The Cambridge Handbook of Korean Linguistics (pp. 105–132). Cambridge University Press. 10.1017/9781108292351.006 - 4Fomicheva, M., Specia, L., & Aletras, N. (2022). Translation Error Detection as Rationale Extraction. In Findings of ACL 2022 (pp. 4148–4159). 10.18653/v1/2022.findings-acl.327
- 5Forcada, M. L. (2017). Making sense of neural machine translation. Translation Spaces, 6(2), 291–309. 10.1075/ts.6.2.06for
- 6Freitag, M., Mathur, N., Ma, Q., Fomicheva, M., Avramidis, E., Albrecht, J., … & Bojar, O. (2023). Results of WMT23 metrics shared task: Automatic MT evaluation in the age of large language models. In P. Koehn, B. Haddow, T. Kocmi, & C. Monz (Eds.), Proceedings of the Eighth Conference on Machine Translation (pp. 624–661).
https://aclanthology.org/2023.wmt-1.51.pdf . Last accessed 1 November 2025. - 7Graham, Y., Baldwin, T., & Mathur, N. (2015). Accurate evaluation of segment-level machine translation metrics. In R. Mihalcea, J. Chai, & A. Sarkar (Eds.), Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1183–1191). 10.3115/v1/N15-1124
- 8Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation (pp. 193–203).
https://aclanthology.org/2023.eamt-1.19/ . Last accessed 1 November 2025. - 9Liu, Y., Iter, D., Xu, Y., Xu, R. & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 2511–2522). 10.18653/v1/2023.emnlp-main.153
- 10Matusov, E. (2019). The Challenges of Using Neural Machine Translation for Literature. In J. Hadley, M. Popović, H. Afli, & A. Way (Eds.), Proceedings of the Qualities of Literary Machine Translation (pp. 10–19).
https://aclanthology.org/W19-7302.pdf . Last accessed 1 November 2025. - 11Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). 10.3115/1073083.1073135
- 12Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A Neural Framework for MT Evaluation. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2685–2702). 10.18653/v1/2020.emnlp-main.213
- 13Reiter, E. (2018). A structured review of the validity of BLEU. Computational Linguistics, 44(3), 393–401. 10.1162/COLI_a_00322
- 14Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning Robust Metrics for Text Generation. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7881–7892). 10.18653/v1/2020.acl-main.704
- 15Sohn, H. (1999). The Korean language. Cambridge University Press.
- 16Tkachenko, M., Malyuk, M., Holmanyuk, A., & Liubimov, N. (2020–2025).
Label Studio: Data labeling software [Computer software] . HumanSignal.https://github.com/HumanSignal/label-studio - 17Toral, A., & Way, A. (2018).
What level of quality can neural machine translation attain on literary text? In J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (Eds.), Translation Quality Assessment: From Principles to Practice (pp. 263–287). New York: Springer. 10.1007/978-3-319-91241-7_12 - 18Wang, P., L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao, L. Kong, Q. Liu, T. Liu, & Z. Sui. (2024). Large Language Models are Not Fair Evaluators. In L. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (pp. 9440–9450). 10.18653/v1/2024.acl-long.511
