(1) Overview
Repository location
Context
Finding effective ways to evaluate the quality of machine translation has been a central concern for researchers across translation studies, literary studies, and machine learning. The evaluation of machine translation has traditionally depended on automatic metrics. BLEU has been the most widely adopted approach of them (Papineni et al., 2002), although it has considerable limitations in literary contexts (Reiter, 2018; Callison-Burch et al., 2006). Newer metrics like COMET and BLEURT have shown significant improvements by using n-gram and embedding-based approaches, but they still struggle to capture the subtle qualitative differences critical for literary translation, such as semantic nuance, discourse coherence, and stylistic fidelity (Fomicheva et al., 2022; Rei et al., 2020; Sellam et al., 2020; Toral & Way, 2018; Graham et al., 2015). Our research answers this urgent need for more sophisticated, interpretable evaluation methods that can address the complexities of literary translation.
Literary translation poses especially difficult challenges for machine translation systems, requiring careful attention to tone, style, and cultural resonance—qualities that have made literature the “last frontier” of machine translation (Toral & Way, 2018). The more advanced neural machine translation systems also fail to accurately assess literary works that mobilize figurative language and idiomatic expressions (Matusov 2019; Forcada, 2017). The English–Korean language pair makes these challenges even more complex due to the characteristics of the Korean language including agglutinative morphology, free word order, and intricate honorific system (Sohn, 1999; Cho, 2022). Because most existing evaluation methods overlook or struggle to identify these culturally embedded features—especially the complex use of honorifics—it has been almost impossible to accurately assess literary translation quality for this particular pair.
The rapid development in large language models seemed to present new possibilities. In fact, research has shown how systems like GPT-4 can be effective evaluators when used with frameworks like MQM and story-specific evaluation prompts (Freitag et al., 2023; Kocmi & Federmann, 2023; Liu et al., 2023). But they have clear limitations, too. Their performances vary significantly based on how prompts are written, and they often prioritize factual accuracy over stylistic quality, while also showing order effects (Wang et al., 2024). These issues call for hybrid approaches that combine automated efficiency with human interpretive skills. In this context, our dataset design is based on the premise that human-in-the-loop evaluation offers a promising way forward by bringing together human insight and automated scalability offered by computational methods (Blain et al., 2023). We highlight that this lack of comparative datasets or research pipelines that capture both human and AI perspectives has slowed progress in developing more robust, culturally aware evaluation methods, and present the two-step framework that balances interpretive, human input and automated, widely-adaptable evaluation.
Our dataset was created as part of our broader research investigating the need and feasibility of culturally sensitive evaluation of literary machine translation (https://arxiv.org/abs/2412.01340, last accessed 28 October 2025). Supporting materials and analysis code are available through the accompanying repository: (https://github.com/sheikhshafayat/ruler-verse), a snapshot of which is archived at https://doi.org/10.5281/zenodo.17354947. While the accompanying work describes the core methodologies of the proposed automated literary evaluation framework, this work’s primary concern is with the human evaluation aspect of literary machine translations. We focus on the understudied area of English–Korean literary machine translation with special attention to often-overlooked features like honorific usage and cultural adaptation.
To tackle previously mentioned gaps in research, our approach includes two novel methodologies for evaluating machine translations of literature: RULER provides structured, rubric-based assessment, while VERSE uses a verification method with Large Language Model (LLM) generated evaluation questions, which are answered by another LLM. The final translation scores are aggregated across paragraphs.
However, in order to measure the current LLM’s ability to evaluate literary translation, we first conduct extensive expert human evaluation of machine translations, which we later use to benchmark the LLMs.
(2) Dataset Collection Method
We first collected 15 short stories originally authored in English, representing a wide range of genres and styles (e.g., gothic fiction, modernist literature, realist fiction, speculative fiction). Each story had a Korean translation available by literary translators with professional experience (three of the stories had two different translations).
In this work, we release the human-annotation dataset, which was randomly sampled from a subset of 10 stories out of the 15. We leave the rest five stories so that they can later be used as an uncontaminated test set. In the context of this work, those five stories haven’t been utilized. We randomly selected 20 paragraph-machine translation pairs from each story (200 total) and paired them with three LLM-generated, story-specific literary evaluation questions (600 total). Note that some paragraphs in our dataset have multiple corresponding translations. As a result, the unique identifier is the combination of a paragraph and its translation, rather than the paragraph text alone.
Three literary experts with extensive experience in both English and Korean literature, who are also native speakers in Korean, evaluated the translations using two distinct criteria. These two distinct criteria refer to 1) the rubric-based 1–5 scale evaluation of translations under four categories defined by our RULER system, and 2) the 1–3 scale evaluation in response to LLM-generated questions about the translations focusing on specific traits of each literary work (VERSE).
All annotations were done on the Label Studio platform (Tkachenko et al., 2020–2025) hosted on HuggingFace (see https://labelstud.io/, last accessed 28 October 2025).
Dataset Construction Steps
Initial study and rubric development: we first conducted a small-scale qualitative evaluation of machine-translated literary texts (five selected short stories) produced by various large and small language models as well as commercial translation services. This analysis revealed four recurrent error categories in English-to-Korean literary translation — Lexical Choice, Proper Use of Honorifics in Dialogues, Syntax and Grammar, and Content Accuracy — motivating the creation of a detailed, fine-grained rubric (RULER) partly motivated by the Multidimensional Quality Metric (MQM) framework for systematic human evaluation. Recognizing that eliminating common errors alone is insufficient for high-quality literary translation, we also designed a second evaluation step (VERSE) to capture nuanced, story-specific literary qualities. In VERSE, LLMs take in the original English paragraph and generate a list of questions that need to be satisfied for a translation to be successful translation. LLM-generated questions target aspects such as cultural context, characterization, and stylistic resonance, which are to be judged by another LLM. The human annotation is conducted to see if LLMs are capable of performing these same judgments, thus providing a scalable way to evaluate literature.
Annotation:
Pilot annotation: one round of annotation on a small sample was conducted by the three expert annotators to develop annotation guidelines and point out potential User Interface issues on Label Studio. These samples were not used in the final annotation later on. Pilot annotation and initial experiments also revealed that a 1–3 scale is better for LLM graders for Step VERSE annotation.
Main annotation: after the pilot annotation, detailed guidelines were created by the three experts, and the main annotation was conducted over the span of several weeks. The dataset released here belongs to this round of annotation.
Annotation interface: annotations were conducted using a custom web-based interface built on Label Studio, presenting the original English text, the machine translation, and a reference human translation for context. For RULER, the interface displayed the four rubric criteria, allowing annotators to assign independent 1–5 scores for each. For VERSE, the interface additionally showed three LLM-generated, story-specific evaluation questions, and annotators rated the machine translation on a 1–3 scale based solely on the aspect targeted by the question. RULER and VERSE annotations were presented as separate tasks to annotators (i.e., RULER and VERSE tasks were not shown on the interface at the same time). Annotation interface templates are available in the codebase.
Dataset quality: the inter-annotator agreement in terms of Kendall’s τ and Krippendorff’s α across RULER and VERSE categories was consistently high. For the RULER categories — Honorifics (τ = .70, α = .74), Lexical Choices (τ = .73, α = .80), Syntax (τ = .66, α = .75), and Content (τ = .73, α = .80) — annotators showed strong concordance. The VERSE category also exhibited substantial agreement with τ = .65 and α = .70. These values indicate high reliability of the annotations across both Step 1 (RULER) and Step 2 (VERSE) evaluations.
We also asked the annotators to reannotate a small subset of 40 paragraphs after four weeks of the annotation and found that intra-annotator agreement was high (α = 0.7).
Translation generation parameters: to create a diverse set of translations, we translated the source stories using multiple LLMs and Google Translate. All LLMs were sampled with temperature = 1.0, top-p = 1.0, and top-k = 1.0. The names of the generation models are provided in the released dataset. To ensure diversity in translation style and quality, we employed a variety of prompting strategies, including zero-shot, few-shot, sentence-level, paragraph-level, and summary-augmented prompting, during generation. While the specific prompting variants are not indicated in the dataset release, the resulting outputs collectively reflect the variation introduced by these approaches. The Google Translation API was used with its default English-to-Korean settings. From the generated pool, 200 translations were sampled for inclusion in the human annotation. All VERSE questions were generated by gpt-4o-2024-05-13 model (the quality was assessed by human inspection).
(3) Dataset Description
Repository name
Zenodo
Object name
‘dataset.tar.gz’ containing:
‘step1-ruler.csv’
‘step2-verse.csv’
‘story-metadata.csv’
‘README.md
Format names and versions
CSV
Creation dates
2024-08-01 – 2024-09-30
Dataset creators
Seohyon Jung (KAIST): Annotation and supervision
Woori Jang (KAIST): Annotation (graduate student)
Jiwoo Choi (KAIST): Annotation (graduate student)
All three were also authors of the paper.
Language
English, Korean (English to Korean single direction)
License
Creative Commons Attribution Non Commercial 4.0 International
Publication date
2025-08-08
(4) Reuse Potential
Our dataset can contribute to advancing research across three interconnected domains: a) computational evaluation of literary machine translation, b) literary translation studies, and c) human-AI interaction and annotation as a humanistic research method.
Developing evaluation methods for literary machine translation
Literary translation requires more than technical accuracy. Current automatic evaluation metrics for machine translation insufficiently capture the nuanced stylistic, cultural, and narrative features central to literary translation quality. We have developed two complementary evaluation frameworks—RULER (rubric-based ratings) and VERSE (verification-based questions)—that address the unique challenges of Korean literary translation by leveraging LLMs’ capacity to assess stylistic, tonal, and thematic variance beyond simple computational measures.
Our dataset opens up creative possibilities for researchers to benchmark different machine translation systems on dimensions that matter most for literature, including stylistic fidelity, cultural adaptation, and narrative expressiveness. The parallel structure of our framework also allows for systematic analysis of how different evaluation approaches correlate or diverge. Thus, the dual-framework methodology provides a replicable template for creating similar resources across other language pairs and literary genres.
Expanding the scope of literary translation studies and digital humanities
While originally created for computational applications research in Natural Language Processing (NLP), it offers significant potential as a resource for literary translation studies and digital humanities. The annotated dataset includes 200 paragraph-aligned segments from fifteen English short stories spanning varying historical periods, genres, and styles, enabling comparative analysis of how MT systems negotiate literary features including tone, figurative language, and irony.
By focusing on English to Korean translation, a relatively understudied language pair, this dataset facilitates investigation into linguistically and culturally specific challenges in literary translation. The proper use of honorifics and idiomatic expressions, as well as historical and cultural contexts, can be examined across different MT systems, including LLMs from Claude, OpenAI, and Google Translate.1 The annotations by native Korean speakers with formal literary training enable studies of interpretive variance and reader reception. These are central concerns in literary translation studies that have traditionally relied on theoretical frameworks.
Advancing research in human-AI interaction and annotation science
Literature, being both highly subjective and culturally high-stakes, makes an ideal testbed for examining human interaction with AI-generated outputs. The dataset’s design enables methodological developments in examining patterns of reliability and inter-annotator agreement across different annotation types and machine translation styles.
The incorporation of LLM-generated questions as evaluation prompts in the VERSE demonstrates a model of human-AI collaboration and its assessment. Researchers can adapt our two-step framework to other languages by customizing the RULER rubric for language-specific requirements. Our hybrid approach, combining well-crafted, structured rubrics and adaptive questioning, can reduce dependency on cognitively-demanding human annotation, without sacrificing the depth and quality of the evaluation. A well-designed annotation system in literary translation studies encompasses largely two things: first, training protocol optimization according to the cultural and linguistic characteristics of each language, and second, quality control mechanisms that are tailored to subjective domains such as literature.
This dataset ultimately calls for more academic attention toward developing evaluation frameworks that honor both computational efficiency and humanistic depth. We demonstrate how AI tools can be designed to complement rather than replace human cultural expertise, highlighting the need for a more nuanced understanding of where and how AI should be integrated into humanistic scholarship.
Notes
[1] See Claude: https://claude.ai/; OpenAI ChatGPT: https://chatgpt.com/; Google Translate: https://translate.google.com/. All websites last accessed on 28 October 2025.
Acknowledgements
The authors would like to acknowledge Professor Alice Oh (KAIST) for her guidance and feedback throughout this process, particularly in the accompanying research project. Alice Oh is listed as one of the authors of the companion paper describing the broader methodology.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Sheikh Shafayat, Dongkeun Yoon: software, methodology, visualization, writing.
Jiwoo Choi, Woori Jang, Seohyon Jung: conceptualization, funding acquisition, formal analysis, methodology, writing.
