Have a personal or library account? Click to login
Annotated Mystery Narratives Data Set Cover
Open Access
|Dec 2025

Full Article

1 Overview

Repository location

South African Centre for Digital Language Resources: https://hdl.handle.net/20.500.12185/693.

Context

To understand storytelling, we analyze existing stories and their narrative structures. Narrative structure is often analyzed through temporal aspects such as order, duration, and frequency, as defined by Genette et al. (1980), which provide insights into how events are sequenced, paced, and repeated within a story. For example, flashbacks and flash-forwards, disrupt linear timelines to heighten suspense or provide context (Eisenberg & Finlayson, 2021; Kearns, 2020). Such structural techniques pose unique challenges in literary text due to their complex temporal shifts and ambiguous boundaries compared to other narrative domains like news or games (Piper et al., 2021).

Universal properties include narrator or scene changes but certain properties may be genre dependent. For example, in the mystery or whodunit genre, thoughts of the detective, a confession, or a reveal, are genre dependent properties that provide more detailed insight into aspects of the genre.

Computational efforts, like the SANTA project, annotate narrative levels and scenes in shorter texts (Reiter et al., 2019). Full-text analysis, as in this data set, captures complete story arcs and genre-specific structures, building on prior work to support computational analysis and literary studies of general and genre-specific narrative properties.

This data set is produced as part of the author’s PhD research. A subset of the data set was used by Heyns & Van Zaanen (2024).

2 Method

The collection and annotation of the material in the data set consists of a number of steps: selection and collection, clean-up, import and annotation, and packaging, which we will describe in more detail here.

Steps

First, we manually selected ten narratives from Project Gutenberg.1 Project Gutenberg provides digital versions of books that are freely usable. We downloaded plain text (UTF-8) versions of the books listed in Table 1. Note that some narratives come from the same downloaded text file.

Table 1

Information on the narratives in the data set.

TEXTAUTHOR# SENTENCES# WORDS
The adventures of the Italian noblemanaAgatha Christie3243839
The jewel robbery at the grand metropolitanaAgatha Christie5645101
The mystery of Hunter’s LodgeaAgatha Christie4084618
A case of identitybSir Arthur Conan Doyle4517028
A scandal in BohemiabSir Arthur Conan Doyle6868628
The red-headed leaguebSir Arthur Conan Doyle6329273
A question of passportscEmmuska Orczy2573767
Story of the young man in holy ordersdRobert Louis Stevenson2965623
The chronicles of Addington PeacedBertram Fletcher Robinson5426295
The mystery of the steel diskdBroughton Brandenburg4887356
Total464861,528

Second, texts were cleaned by manually removing non-narrative elements (colophon, title, table of contents, foreword). Chapter headings, divisions, and section breaks were retained for annotation utility. Only the narrative and its layout structure were kept. See Table 1 for information on the size of each text.2

Third, cleaned texts were imported into CATMA,3 a tool for manual annotation and analysis. Following guidelines in Heyns and Van Zaanen (2024), 18 tag types (listed in Table 2) were created to annotate narrative structure. Brief tag descriptions are provided here; the full schema with definitions and examples appears in Heyns and Van Zaanen (2024).

Table 2

Information on the tags assigned in the data set.

TAG SETTAGTAG SETTAG
PerspectiveActs
Focalization (2)Aftermath (9)
Voice (22) (homodiegetic/heterodiegetic)Confession (1)
SegmentConfrontation (4)
Ellipsis (0)Introduction (10)
Non-scene (Summary (26)/Description (3))Investigation (10)
Scene (Time, Place, Characters) (115)Reveal (10)
AnachronismsMisc
Analepsis (11)Clue (42)
Prolepsis (0)Detective thought (33)
Diegetic level
Extradiegetic (69)
Intradiegetic (124)
Metadiegetic (20)

Perspective comprises focalization (viewpoint/narrator knowledge (Wirén & Ek, 2021)) and voice (narrator-text relationship and presence (Ketschik et al., 2021; Wirén & Ek, 2021)).

Segment types include scene (consistent time/ place/ characters (Gius et al., 2019b)), non-scene (absent time/ place/ characters) and ellipsis (implied elapsed time).

Anachronisms disrupt primary sequence (Kearns, 2021): analepsis (past shift), prolepsis (future visions/ foreshadowing).

Diegetic levels is the narrator’s position relative to the story (Genette et al., 1980): extradiegetic (narrator outside), intradiegetic (characters/events within), metadiegetic (nested secondary story).

Acts mark progression via shifts in focus/perspective/theme (Heyns & van Zaanen, 2021). The mystery structure consist of introduction, investigation, confrontation, confession, reveal, aftermath (not all are required). Additional tags mark when a clue is given/recognized or when the detective is thinking.

Narratives where manually tagged with 511 total annotations (details in Table 2).

Finally, annotated narratives were stored in TEI XML and CSV formats, with CSV generated from TEI XML via the teitocsv package.4 Plain UTF-8 texts, annotated CATMA files (TEI XML and CSV), and a UTF-8 markdown readme describing the data set were packaged into a zip file on the South African Centre for Digital Language Resources (SADiLaR) repository. In addition to the markdown readme file, the zip file contains the following directories:

  • original: This directory contains the original texts of the ten mystery short stories as downloaded from Project Gutenberg. The names of these files correspond to the UTF-8 filenames downloaded from the Project Gutenberg website as indicated in Table 1.

  • txt: This directory contains the cleaned texts.

  • xml: This directory contains the annotated versions of texts in TEI XML format, generated using CATMA.

  • csv: This directory contains the CSV files of the texts that are generated from the CATMA TEI XML files in the xml directory.

Sampling strategy

Stories were selected for their adherence to the whodunit genre, which offers rigid, comparable structures, natural information gain through clues and revelations, and support for genre-specific annotation (Heyns & Van Zaanen, 2024). Only works published between 1891 and 1924 were included to capture the genre’s formative consolidation after Arthur Conan Doyle’s establishment of structured conventions, logical deduction, and key archetypes, while excluding earlier psychological works by Poe and Collins that predate the puzzle-solving structure that defined this later period. Limiting the upper limit of the date range to 1924 ensures public domain access via sources like Project Gutenberg. Texts were limited to 2,000–10,000 words to capture complete structures while keeping manual annotation manageable, unlike the SANTA Project’s snippet-based approach,5,6 which cannot reveal full narrative arcs.

To streamline cleaning, collections with uniform formatting were prioritized: The Project Gutenberg eBooks of Great Short Stories (Volume 1), Poirot Investigates, The Adventures of Sherlock Holmes, and The League of the Scarlet Pimpernel. Multiple stories featuring the same detective (Poirot, Holmes) were selected for annotation consistency, with remaining stories chosen for diverse detective styles. The limited scope, due to manual annotation costs, will be expanded in future work.

Quality control

The annotation was performed by the first author, who has an in-depth understanding of each tag’s requirements due to their role in developing the guidelines presented in Heyns and Van Zaanen (2024). Quality control was systematic and iterative, following a multi-phase process to ensure consistency. Annotation proceeded sequentially: scenes and non-scenes were annotated first, followed by anachronic structures, diegetic levels, perspective, and clues, concluding with the acts. This was done to ensure consistent annotation throughout the texts in the data set.

Annotation of the texts across the different diegetic levels (extradiegetic, intradiegetic, and metadiegetic) proved challenging, especially concerning detailed decisions like handling letters or embedded analepses. For example, character interjections during an analepsis complicated decisions regarding whether each interjection required a new tag or should be included within the existing analepsis tag.

After a small subset of texts was annotated with narrative levels, the annotator encountered a case that did not fit the initial tagging rules relating to the distinction between intradiegetic and metadiegetic. The guidelines were refined, and all four previously annotated text were re-annotated from the start, ensuring alignment with the correct definitions.

Additionally, all texts underwent a complete re-reading to verify tag consistency, particularly for complex structures such as embedded analepses and character interjections.

3 Data set description

Repository name

The data set is uploaded to the repository of the South African Centre for Digital Language Resources (SADiLaR), which can be found here: https://hdl.handle.net/20.500.12185/693.

Object name

Annotated Mystery Short Stories Data Set

Format names and versions

The texts are stored in plain text UTF-8 format. The annotations are stored in TEI XML format and CSV format (with semi-colons as separators) for easy further processing.

Creation dates

Start date: 2024-08-13; End date: 2025-04-01.

Data set creators

The texts in the data set were downloaded and annotated by the first author of this article: Nuette Heyns, North-West University, Potchefstroom, South Africa.

Language

The texts contained in the data set are all written in English.

License

The data set is made available under a Creative Commons License Attribution-ShareAlike 4.0 International license.

Publication date

2025-08-29

4 Reuse potential

In computational narratology, this data set aids in refining algorithms for automatic narrative segmentation, plot structure analysis, and genre classification. The detailed annotations of elements like scene boundaries, diegetic levels, acts, focalization, voice, and genre-specific tags such as detective thoughts and reveals offer a structured resource for training and evaluating models.

As noted in Heyns & Van Zaanen (2024), few comparable full-text, genre-specific annotated data sets exist, enhancing its value. Scene and non-scene tags may be used to train models to detect boundaries in unseen texts, akin to SANTA project approaches (Reiter et al., 2019), while act and clue tags support plot arc extraction for classification, extending schemas like Eisenberg & Finlayson (2021) for temporal disruptions. The annotations build on SANTA frameworks for narrative levels and segmentation (Reiter et al., 2019).

Literary studies, linguistics, and digital humanities scholars may use it for genre studies, examining how mystery short stories use techniques like temporal disruptions for suspense (Eisenberg & Finlayson, 2021). It enables cross-genre comparisons, modeling annotations to contrast whodunit structures with forms like modernist novels (Kearns, 2020).

The data set also supports narrative generation and summarization, where plot structure and temporal flow are key (Droog Hayes et al., 2018). Clue and reveal patterns could guide generative models for suspenseful sequences or summarization prioritizing acts and diegetic shifts, building on SANTA-inspired automated story understanding (Gius et al., 2019a). Educationally, it illustrates mystery narrative techniques and contrasts with other genres.

Limitations include its single-genre focus, restricting applicability to types like fantasy or historical fiction; reliance on one annotator’s subjective interpretations, needing validation; and short-story scope, limiting capture of longer narrative complexity and scaling challenges.

Notes

[3] Sentence and word counts were computed using an online text analyzer: https://www.online-utility.org/text/analyzer.jsp.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Nuette Heyns: Conceptualization, Data curation, Formal analysis, Methodology, Resources, Validation, Writing – original draft, Writing – review and editing; Menno van Zaanen: Conceptualization, Methodology, Supervision, Writing – original draft; Writing – review and editing.

DOI: https://doi.org/10.5334/johd.391 | Journal eISSN: 2059-481X
Language: English
Submitted on: Sep 12, 2025
Accepted on: Nov 3, 2025
Published on: Dec 11, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Nuette Heyns, Menno van Zaanen, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.