Have a personal or library account? Click to login
Evaluating English-Korean Literary Machine Translations: A Dataset Featuring the RULER and VERSE Annotation Methods Cover

Evaluating English-Korean Literary Machine Translations: A Dataset Featuring the RULER and VERSE Annotation Methods

Open Access
|Nov 2025

References

  1. 1Blain, F., Zerva C., Rei R., Guerreiro N. M., Kanojia D., de Souza J. G. C., Silva B., Vaz T., Jingxuan Y., Azadi F., Orăsan C., & Martins A. (2023). Findings of the WMT 2023 Shared Task on Quality Estimation. In P. Koehn, B. Haddow, T. Kocmi, & C. Monz (Eds.), Proceedings of the Eighth Conference on Machine Translation (WMT 2023) (pp. 629653). 10.18653/v1/2023.wmt-1.52
  2. 2Callison-Burch, C., Osborne, M., & Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. In Proceedings of EACL 2006 (pp. 249256). https://aclanthology.org/E06-1032 last accessed 1 November 2025.
  3. 3Cho, S. (2022). Politeness Strategies in Korean. In S. Cho & J. Whitman (Eds.), The Cambridge Handbook of Korean Linguistics (pp. 105132). Cambridge University Press. 10.1017/9781108292351.006
  4. 4Fomicheva, M., Specia, L., & Aletras, N. (2022). Translation Error Detection as Rationale Extraction. In Findings of ACL 2022 (pp. 41484159). 10.18653/v1/2022.findings-acl.327
  5. 5Forcada, M. L. (2017). Making sense of neural machine translation. Translation Spaces, 6(2), 291309. 10.1075/ts.6.2.06for
  6. 6Freitag, M., Mathur, N., Ma, Q., Fomicheva, M., Avramidis, E., Albrecht, J., … & Bojar, O. (2023). Results of WMT23 metrics shared task: Automatic MT evaluation in the age of large language models. In P. Koehn, B. Haddow, T. Kocmi, & C. Monz (Eds.), Proceedings of the Eighth Conference on Machine Translation (pp. 624661). https://aclanthology.org/2023.wmt-1.51.pdf. Last accessed 1 November 2025.
  7. 7Graham, Y., Baldwin, T., & Mathur, N. (2015). Accurate evaluation of segment-level machine translation metrics. In R. Mihalcea, J. Chai, & A. Sarkar (Eds.), Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 11831191). 10.3115/v1/N15-1124
  8. 8Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation (pp. 193203). https://aclanthology.org/2023.eamt-1.19/. Last accessed 1 November 2025.
  9. 9Liu, Y., Iter, D., Xu, Y., Xu, R. & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 25112522). 10.18653/v1/2023.emnlp-main.153
  10. 10Matusov, E. (2019). The Challenges of Using Neural Machine Translation for Literature. In J. Hadley, M. Popović, H. Afli, & A. Way (Eds.), Proceedings of the Qualities of Literary Machine Translation (pp. 1019). https://aclanthology.org/W19-7302.pdf. Last accessed 1 November 2025.
  11. 11Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311318). 10.3115/1073083.1073135
  12. 12Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A Neural Framework for MT Evaluation. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 26852702). 10.18653/v1/2020.emnlp-main.213
  13. 13Reiter, E. (2018). A structured review of the validity of BLEU. Computational Linguistics, 44(3), 393401. 10.1162/COLI_a_00322
  14. 14Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning Robust Metrics for Text Generation. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 78817892). 10.18653/v1/2020.acl-main.704
  15. 15Sohn, H. (1999). The Korean language. Cambridge University Press.
  16. 16Tkachenko, M., Malyuk, M., Holmanyuk, A., & Liubimov, N. (2020–2025). Label Studio: Data labeling software [Computer software]. HumanSignal. https://github.com/HumanSignal/label-studio
  17. 17Toral, A., & Way, A. (2018). What level of quality can neural machine translation attain on literary text? In J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (Eds.), Translation Quality Assessment: From Principles to Practice (pp. 263287). New York: Springer. 10.1007/978-3-319-91241-7_12
  18. 18Wang, P., L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao, L. Kong, Q. Liu, T. Liu, & Z. Sui. (2024). Large Language Models are Not Fair Evaluators. In L. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (pp. 94409450). 10.18653/v1/2024.acl-long.511
DOI: https://doi.org/10.5334/johd.393 | Journal eISSN: 2059-481X
Language: English
Submitted on: Sep 13, 2025
Accepted on: Oct 20, 2025
Published on: Nov 26, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Sheikh Shafayat, Dongkeun Yoon, Jiwoo Choi, Woori Jang, Seohyon Jung, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.