A Classification Benchmark Based on the Literary Theme Ontology

Noa Visser Solissa; Paul Sheridan; Mikael Onsjö; Andreas van Cranenburgh; Federico Pianzola

doi:10.5334/johd.480

Full Article

1 Introduction

In this research, we use human-annotated themes in TV series episodes to automatically classify the themes in TV episode summaries and subtitles. The goal is to introduce a new benchmark dataset of summaries and subtitles of TV series episodes, which can be used for theme classification. Additionally, we present baseline results of the theme classification task, using traditional bag-of-words classification methods, such as logistic regression and support vector machine (SVM), as well as different small open-weight large language models (LLM), as they are promising for classification in cultural data when annotations for fine-tuning are not available (Bamman et al., 2024).

The thematic annotations used in this work are drawn from the Literary Theme Ontology (LTO) v2025.04 release, which comprises 2,977 hierarchically organized literary themes and 4,218 thematically annotated stories (Onsjö & Sheridan, 2025a; Sheridan et al., 2019). This taxonomy is used to identify themes in different stories in the form of works of, for example, literature, films, and TV episodes. Using LTO thematic annotations, a new benchmark dataset consisting of themes occurring in 644 TV episode summaries and 956 TV series episode subtitles was created.

Themes can be seen as a higher order label of categories of elements of stories, which can relate stories to each other (Brinker, 1993; Rimmon-Kenan, 1995). As themes are an abstract categorization, we have chosen to work with literary themes lowest in the hierarchy of the LTO. The more general parent themes can easily be retrieved via the Python package totolo (Onsjö & Sheridan, 2025b). Thus this benchmark dataset of TV series episodes can be used not only for research on the specific themes used in this classification task, but also in research on the parent themes.

Our goal is to offer a resource for the automatic classification of themes. In the theme classification task, we use episode summaries and subtitles to predict the ten most frequent themes in the TV episodes. Since multiple themes can occur per episode, this is a multi-label classification task, which is hard to perform due to the limited data available for each theme. Additionally, themes in TV series are often not explicitly expressed in the summaries or subtitles. For example, a theme can also be derived based on prior knowledge about the specific characters discussed: when two characters are married, any episode discussing their relationship can be linked to the theme husband and wife. However, it does not necessarily mean that it is mentioned in the subtitles of the summaries that the couple is married, as it can be assumed that someone watching an episode or reading the summary has seen previous episodes of the TV series, and is already aware of the married status of the characters. These implicit themes increase the complexity of the multi-label classification task. This new dataset of TV series episode summaries and subtitles has a high potential for humanities research on literary themes as it can be extended using the further available LTO annotations as a starting point.

2 Related work

The classification of fiction by theme and subject has been a persistent challenge in library and information science (LIS), knowledge organization, literary studies, and folklore studies. As this benchmark focuses on the textual representation of themes, this section focuses on text-based work.

Early 20th-century LIS scholars, such as E.A. Baker and William Borden, recognized the need for thematic classification to improve reader access, but traditional systems like Dewey Decimal and Library of Congress have historically focused on form, genre, and language rather than thematic content (Baker & Shepherd, 1987). This disciplinary emphasis on formal attributes has limited the ability of readers and researchers to discover fiction based on thematic resonance. Annelise Pejtersen’s work has sought to address these limitations by identifying subject matter, setting, author’s intention, and accessibility as critical dimensions for indexing fiction, advocating for systems that accommodate both denotative and connotative elements (Mark Pejtersen & Austin, 1983). Building on this, Jarmo Saarti introduced the Kaunokki thesaurus, which integrates facets such as events, actors, and spaces to better capture the complexity of fictional themes (Saarti, 2019). These contributions reflect a broader disciplinary shift toward user-centered and multidimensional approaches to classification. A notable advancement comes from the Integrative Levels Classification, which organizes knowledge around phenomena, enabling fiction to be indexed by its thematic content—such as “censorship” or “bravery” (Almeida & Gnoli, 2021). This phenomenon-based approach facilitates hybrid retrieval systems that bridge fiction and nonfiction.

The study of themes in literary studies has evolved through a dynamic interplay of interpretive, theoretical, and interdisciplinary approaches. Traditionally, thematic criticism focused on identifying and analyzing the central ideas or “aboutness” of literary works, often emphasizing the signified—the conceptual or ideological content—over formal or linguistic structures. Despite some skepticism, particularly during the rise of formalist and poststructuralist methodologies, thematic analysis has persisted and even flourished within politically engaged and interdisciplinary frameworks, such as feminist, ethnic, and cultural studies, where themes often serve as lenses for examining broader social and historical contexts (Sollors, 1993). The resurgence of thematic criticism in the late 20th century is evident in collections like Bremond et al. (1995), which seeks to theorize thematics by distinguishing it from related concepts like motif, subject matter, and concept, while also exploring its connections to narratology and structuralist analysis. Similarly, Louwerse and Van Peer (2008) expand the scope of thematic inquiry by integrating cognitive, computational, and cross-disciplinary perspectives, demonstrating how themes function not only in literature but also in discourse, media, and cultural artifacts.

In folklore studies, motif classification has long been a foundational method for analyzing narrative structures and thematic elements (Propp, 1968). The Aarne-Thompson-Uther Index categorizes folktales based on recurring motifs, such as “the hero’s journey” or “the wicked stepmother”, providing a systematic framework for comparative analysis across cultures and historical periods (Uther, 2004). Motifs are distinct, recurring narrative elements in folklore, such as “troll under the bridge” (Yarlott & Finlayson, 2016). Motifs and themes are thus both narrative elements that occur throughout a tale or TV series above the word level (Hagedorn & Darányi, 2022). As motifs in myths are recurring narrative elements throughout texts over time, motifs can be automatically detected using word similarity measures across texts with similar motifs. A zero-shot approach to motif classification consists in classifying motifs based on the highest cosine similarity between sets of texts and corresponding motifs (Matveeva & Malykh, 2022). Another successful approach to cluster motifs is labeled Latent Dirichlet Allocation (LDA) (Karsdorp & van den Bosch, 2013). Lastly, folktales with the same story type, defined as stories with similar recurring plot, motifs and themes, can be successfully classified using a ranking task (Nguyen et al., 2013). This line of research has only worked with written text; an open question is how well the automatic classification of motifs and themes can be performed on audiovisual material like films and TV series. In particular, it is underexplored whether relying only on textual data (e.g., subtitles, scripts, and summaries) related to multimodal artifacts can lead to good results.

The LTO focuses mostly on the thematic classification of films and TV series episodes. Onsjö and Sheridan (2020) show how theme enrichment analysis can be applied to the comparative study of audiovisual content, using the Star Trek franchise as a case study. They statistically identify themes that are significantly over-represented in specific subsets of episodes relative to a broader background set, revealing distinct thematic patterns across different Star Trek series. The analysis not only highlights what makes each series unique but also illustrates how thematic trends evolve over time, offering a quantitative, data-driven approach to comparing narrative content in visual media. Despite the usefulness of thematic annotation for the comparative analysis of fiction, the LTO is one of the few structured thematic resources available and its annotations only cover a limited set of works. It would be extremely useful to implement a system for the automatic identification of these themes. To this end, we present an annotated dataset that can be used to train computational models.

3 The Literary Theme Ontology and thematic annotation

The LTO comprises a hierarchically arranged collection of literary themes together with a trove of thematically annotated works of fiction. In the LTO, a literary theme is broadly conceived as either a topic addressed by a work of fiction or an opinion expressed by the work about that topic (Sheridan et al., 2019). Themes may often be adequately captured at an appropriate level of generality by a single word or short phrase, as in such stock examples as courage, coming of age, and coping with the passage of time. These examples are value-neutral abstractions. Themes may equally take the form of value judgments, such as ignorance is bliss or be wary of strangers. Themes may also represent more complex ideas, for which the short name serves only as a guide and must be supplemented by a fuller definition to be properly understood (e.g. capitalism). The LTO theme inventory and hierarchical structure draw inspiration from prior research across multiple domains, including reference works that curate lists of themes (Armstrong & Armstrong, 2001; McClinton-Temple, 2010; Seigneuret, 1988), online resources that tend to group themes into categories (Encyclopedia of Science Fiction contributors, 2025; TV Trope contributors, 2025; Wikipedia contributors, 2025), and formal theme/motif taxonomies (Bartalesi & Meghini, 2017; Khan et al., 2016).

LTO themes are arranged in a hierarchy, and themes therein can be closely related to each other. The upper levels of the LTO hierarchy are loosely based on a traditional classification system proposed by literary critic William Henry Hudson (Hudson, 1913; Sheridan et al., 2019). At the highest level, the three domain themes the human world, the natural world, and alternate reality descend from the generic root theme literary thematic entity. Each of these domain themes branches into numerous sub-themes. For example, the human world subdivides into personal human experience (i.e., themes concerning individuals, their inner lives, their interactions with others, and their engagement with the surrounding world), human idea about life (i.e., mainly common aphorisms about how to lead a good life), human nature (i.e., claims about inherent aspects of humanity), and society (i.e., themes addressing issues arising from people living together in societies).

According to the LTO guidelines (Onsjö & Sheridan, 2025c), a theme is considered “central” to a story when it is either treated more-or-less continuously throughout the main narrative or plays an important role in the story’s resolution. A theme is deemed “peripheral” when it is featured only briefly. In LTO story annotations, peripheral themes are labeled minor, while central themes are designated either major or choice. Choice themes reflect what the thematic annotator regards as truly significant within the story, and are therefore inherently more subjective than the central–peripheral distinction itself. In brief, the annotators watched the TV series episodes, recorded themes with motivations, and came to consensus by comparing their findings (see Appendix A.1 for an expanded discussion). For a comprehensive treatment of the annotation process see the LTO wiki (Onsjö & Sheridan, 2025c).

To illustrate the differences between the choice, major and minor themes, we will briefly discuss the LTO annotation of the classic fairy tale Little Red Riding Hood (Onsjö & Sheridan, 2025c) which are part of the LTO wiki.¹ The choice themes are beware of strangers, as this theme captures what is generally considered to be the moral of the story, and appearances can be deceiving, as this theme could also argued to be a moral of the story. The major themes are human childhood and wicked character vs. virtuous character, as these themes occur in the majority of the fairy tale. The minor themes are coping with a loved one being gravely ill, grandmother and granddaughter and mother and daughter. Each theme is motivated; for example, the motivation for human childhood is “Little Red Riding Hood innocently trusted the Wolf, and engaged in childish diversions such as chasing butterflies”, and the motivation for coping with a loved one being gravely ill is “The mother was concerned because the grandmother had recently been “very ill””.

4 Dataset description

We derived two different datasets starting from the list of annotated LTO stories, one consisting of summaries collected from Fandom (Fandom, 2025), and one of subtitles. The subtitles were obtained via TVSubs.net (2025) and via VLsub 0.11.1, an extension of the media player VLC that searches for the requested subtitles on Open Subtitles (2025). Due to varying availability of summaries and subtitles, the datasets do not contain the same annotated TV series episodes (see Tables 1 and 2). At the moment of publication, the LTO contained 2070 annotated TV episodes, out of which 1906 with available subtitles and 1345 available summaries. Therefore, the two datasets also contain different themes.

Table 1

The summary dataset, showing the number of episodes, the shortest episode summary length, the longest episode summary length, and the average episode summary length per episode per TV series.

TV SERIES	# EPISODES	MIN WORD LENGTH	MAX WORD LENGTH	MEAN WORD LENGTH
Babylon 5	45	1,068	2,878	1,767.6
Black Mirror	12	578	3,837	1,581.7
Futurama	81	255	2,117	716.3
Game of Thrones	2	2,518	4,860	3,689.5
Guillermo del Toro’s Cabinet of Curiosities	5	1,111	3,621	2,308.6
Red Dwarf	39	571	2,220	952.1
Sherlock	2	526	813	669.5
Star Trek: Deep Space Nine	109	1,253	7,490	2,638.7
Star Trek: Enterprise	43	1,122	6,458	2,148.6
Star Trek: The Animated Series	5	712	1,517	941.2
Star Trek: The Next Generation	66	1,160	6,910	2,440.7
Star Trek: The Original Series	38	841	3,765	1,853.4
Star Trek: Voyager	59	1,265	6,182	2,577.3
Tales from the Crypt (1989)	31	95	1,031	572.1
Tales from the Loop (2020)	4	2,577	3,322	2,958
The Twilight Zone Franchise	103	154	1,093	373
Total	644	95	7,490	1,630.8

Table 2

The subtitles dataset, showing the number of episodes, the shortest subtitle length of an episode, the longest subtitle length of an episode, and the average subtitle length of the subtitle per episode per TV series.

TV SERIES	# EPISODES	MIN WORD LENGTH	MAX WORD LENGTH	MEAN WORD LENGTH
Alfred Hitchcock Presents	185	969	3,584	2,527.6
Amazing Stories (1985)	15	1,089	2,402	1,608.3
Amazing Stories (2020)	4	3,649	4,892	4,300
Babylon 5	43	3,197	5,514	4,400.1
Black Mirror	13	2,563	7,185	4,680.1
Brideshead Revisited (1981)	10	4,012	9,604	5,087.4
Futurama	58	1,770	9,402	2,659.4
Game of Thrones	2	3,460	5,416	4,438
Guillermo del Toro’s Cabinet of Curiosities	5	1,964	4,825	3,511
I Claudius	9	4,921	5,943	5,300.8
Piece of Cake (1988)	1	4,608	4,608	4,608
Red Dwarf	7	2,680	3,539	3,222
Sherlock	10	1,460	10,252	8,400.1
Star Trek: Deep Space Nine	98	3,073	4,970	4,080.6
Star Trek: Enterprise	24	2,931	4,878	3,851.8
Star Trek: The Animated Series	2	1,925	2,353	2,139
Star Trek: The Next Generation	56	2,756	9,622	4,079.2
Star Trek: The Original Series	30	3,132	5,681	4,340.1
Star Trek: Voyager	48	3,241	5,667	4,692.4
Tales from the Crypt (1989)	74	392	3,514	2,101.1
Tales from the Loop (2020)	3	677	1,954	1,440.3
Tales of the Unexpected	77	1,499	4,023	2,441.1
The Alfred Hitchcock Hour	84	2,942	7,408	4,925.4
The Twilight Zone Franchise	98	486	5,421	2,397.2
Total	956	392	10,252	3,374.8

The summaries and subtitles were collected between April 2024 and June 2025. LTO themes have been continuously annotated since 2017 and are ongoing. The benchmark was created using LTO version v2025.04, published on April 14, 2025.

The LTO annotations were the basis for the datasets, with summaries and subtitles collected for the most common themes among the annotated TV episode. We extracted the major and choice themes in TV series episodes from the LTO dataset, and matched them with the available summaries and subtitles. We chose not to consider the minor themes, as they are not consistently present in an episode and therefore are unlikely to be included in summaries, or mentioned in the subtitles. In order to train a theme classifier, a sufficient number of examples is required for each theme. We therefore included only the ten most frequent major themes in the dataset. In this section, we introduce the two datasets and which themes occur in the datasets.

Repository location https://doi.org/10.5281/zenodo.17611141

Repository name Zenodo

Object name The LTO classification benchmark

Format names and versions CSV

Creation dates 2024-08-04 – 2025-06-19

Dataset creators Noa Visser Solissa (University of Groningen) was responsible for conceptualization, data curation, formal analysis, investigation, methodology, Paul Sheridan (University of Price Edward Island) was responsible for conceptualization, data curation and validation and Mikael Onsjö (Independent Researcher) was responsible for data curation.

Language English

License Theme annotations and TV episode summaries: Creative Commons Attribution-Share Alike License 3.0 (Unported) (CC BY-SA); TV episode subtitles: CC-BY

Publication date 2025-11-14

4.1 Summaries

The summary dataset consists of 644 TV episodes from 16 TV series (see Table 1) that first aired from 1959 to 2023. The majority of the dataset consists of Star Trek franchise episodes (320 episodes), followed by the Twilight Zone franchise (103 episodes). The summaries are in English and on average 9,745 characters, which is about 1,500–2,000 words; 56% of the summaries are below average length.

As the summaries are fan-made, the length and layout of the summaries differ strongly per fandom. For example, the Star Trek franchise summarizes episodes per act, whereas the smaller episodes of the Twilight Zone franchise contain summaries of about 100 words. This can also be seen in Table 1, as the shortest summary in the dataset is 551 characters (94 words) and the longest summary is 46,027 characters (7,345 words). Thus, 48% of the dataset consists of extensive Star Trek summaries per act, and 16% of Twilight Zone summaries ranging from 155 to 1,066 words. We preferred Fandom (2025) as a source, rather than, for example, Wikipedia, because summaries are longer, available for each episode, and with more plot details. As with any online source of collaborative content creation, Fandom can contain biases due to the specific interests of the editors, but is the most thorough source available in English.

4.2 Subtitles

The English subtitles dataset contains 956 TV episodes from 24 different TV series that first aired from 1955 to 2023 (see Table 2). The subtitles were obtained as .srt files, and converted to .txt files.² Similar to the summary dataset, the majority of the subtitle dataset consists of Star Trek franchise episodes (258 episodes). However, the second largest TV series present in the dataset is Alfred Hitchcock Presents. The subtitle length is on average 9,754 characters; 54% of the subtitles are below this average.

4.3 Themes

Seven themes occur in both datasets: father and son, friendship, greed for riches, husband and wife, infatuation, romantic love, and the desire of vengeance (see Table 3). In both datasets, the most frequent theme is husband and wife. In the summary dataset, the themes human vs. captivity, humanoid robot, and time travel occur, whereas in the subtitle dataset we find the themes extramarital affair, murder, and spouse murder. The LTO uses well-defined definitions of themes (Sheridan et al., 2019). The following definition of themes in both datasets are used:

father and son: The relationship between a father and his son is featured.
friendship: The friendship between two characters is featured.
greed for riches: A character exhibits an inordinate desire for wealth such as money, luxuries, and the like.
husband and wife: The relationship between husband and wife is featured.
infatuation: An intense but (typically) short-lived passion that may peter out or settle into more enduring romantic love.
romantic love: Featured is that peculiar sort of love between people so often associated with sexual attraction.
the desire of vengeance: A character seeks retribution over a perceived injury or wrong.
human vs. captivity: A struggle between captive and captor is featured.
humanoid robot: An automaton that resembles a human being is featured.
time travel: Traveling between past and future points in time is featured.
extramarital affair: A character who is married engages in a sexual encounter or relationship outside the marriage, and deals with the consequences.
murder: The crime of unlawful and intentional homicide is featured.
spouse murder: One spouse in a married couple seeks to bring about the death of their partner.

Table 3

The 10 most frequent themes in the available summaries and subtitles. The total number of theme occurrence is higher than the total number of episodes in each dataset, as one episode can contain multiple themes.

(A) SUMMARIES
THEME	# OCCURRENCES
father and son	73
friendship	101
greed for riches	62
human vs. captivity	63
humanoid robot	75
husband and wife	113
infatuation	104
romantic love	88
the desire for vengeance	81
time travel	63
Total	832
(B) SUBTITLES
extramarital affair	96
father and son	98
friendship	130
greed for riches	108
husband and wife	348
infatuation	140
murder	166
romantic love	132
spouse murder	127
the desire for vengeance	155
Total	1,500

For the summary dataset, two themes are parent-child in the LTO hierarchy, as infatuation is a child of romantic love. For the subtitle dataset a similar connection is seen, as spouse murder is the child of the theme murder of a lover, which does not occur in the datasets, but which is the child of murder which does occur in the subtitle dataset. Thus, murder is the grandparent of the theme spouse murder.

The summary dataset contains 832 themes that occur in the 644 summaries, as each episode can have multiple themes. The subtitle dataset contains 1500 themes that occur in the 956 subtitles. The themes are not evenly distributed across the TV series. For example, in the summary dataset 72 out of the 75 humanoid robot theme occurrences are seen in the TV series Futurama and Red Dwarf (for the theme distribution across TV series see Appendix A.2).

5 Method

To estimate the difficulty of theme classification using these two datasets, we have tested four different classification methods. We have focused on simple, but established, pipelines and local, open-weight LLMs, for replicability and reproducibility purposes. Both datasets are split into a train (80%) and test (20%) set. The train and test sets were randomly created, but with a comparable theme distribution as seen in the full subtitle and summary dataset. The bag-of-words, FastText, and SetFit classification were all trained using the complete train set, whereas the LLM classification uses a zero-shot and a few-shot approach. The code for each classification method can be found on https://github.com/theme-ontology/lto-classification-benchmark.

Bag-of-Words

A bag-of-words classification on the ten themes using logistic regression and SVM (Pedregosa et al., 2011). We tested both models with different parameters for the vector size and n-gram words. We tested classifiers with unigrams, bigrams with vectors of 1,000, 5,000 and 10,000 n-grams. In Section 6 we only reported the best performing logistic regression and SVM models.

FastText

We also implemented logistic regression and SVM classification using word embeddings created with FastText (Bojanowski et al., 2017), as these embeddings exploit subword information.

SetFit

Since the dataset contains limited examples per theme, we tested SetFit (Tunstall et al., 2022) with logistic regression for the theme classification. SetFit is designed to learn from few examples of contrastive pairs and can be used with three strategies: oversampling, undersampling, and unique. In the oversampling strategy, an equal number of positive and negative training pairs are sampled, where the minority pair type is sampled to match that of the majority pair type. In the undersampling strategy there are also an equal number of positive and negative training pairs sampled, but contrary to oversampling, the majority pair type is undersampled to match the minority pair. Lastly, in the unique sampling strategy, all possible pairs are sampled exactly once.

LLM

Open-weight LLMs: mistral 7B-instruct (Mistral AI, 2023), gemma3:12b-it-qat (Kamath et al., 2025), and llama3.1:8b-instruct-q8_0 (Grattafiori et al., 2024). All three models are instruct models, fine-tuned for following an instruction to complete a task. The LLMs were tested on the complete test set, where the complete summaries and subtitles are used as input. We did not use excerpts or shortened input texts, as the themes might not be equally strongly present throughout the entire episode. We only used small models that can be run locally, also with limited computational resources. We tried a zero-shot and a few-shots approach for the LLMs. The prompts can be seen in Figures 3 to 6 in the appendix (see Appendix A). In the zero-shot approach, the prompt contains the instruction of the theme classification, the ten themes with the theme definitions used in the LTO, and one example of desired output. For the few-shots prompt, the same prompt was used, but extended with an example input text and an example output text. For the few-shots summary classification, two example summaries from the train set were included in the prompt. For the subtitle few-shots prompt, only one train set subtitle example input was included, as multiple examples could not be included due to prompt strength constraints.

6 Results

As can be seen in Table 4, the best performing classification models on both datasets is bag-of-words SVM bigrams, for the summaries with 10,000 features and for the subtitles with 5000 features. As can be seen in Tables 5 and 6, the theme time travel has the highest F₁ score for the summary dataset, and husband and wife the highest score for the subtitle dataset. In both cases, these themes also have the highest precision. In both datasets, the theme with the highest recall has a low precision, resulting in a low F₁ score.

Table 4

The results for the summary and subtitle datasets. All models were tested on the complete test set and evaluated using the macro precision, recall, and F₁ scores. For both datasets, the highest F₁ score is highlighted in bold.

Model	Summary			Subtitles
Model	Precision	Recall	F₁	Precision	Recall	F₁
LogReg bigrams 1,000	0.33	0.74	0.44	0.31	0.78	0.42
LogReg bigrams 5,000	0.34	0.75	0.45	0.31	0.78	0.43
SVM bigrams 5,000	0.38	0.77	0.50	0.32	0.79	0.44
SVM bigrams 10,000	0.39	0.77	0.51	0.32	0.74	0.44
FastText LogReg	0.23	0.60	0.33	0.22	0.66	0.33
FastText SVM	0.24	0.63	0.34	0.24	0.63	0.34
Setfit Undersampling	0.43	0.22	0.28	0.18	0.24	0.20
Setfit Unique	0.50	0.28	0.34	0.24	0.23	0.21
Setfit Oversampling	0.44	0.30	0.34	0.17	0.17	0.16
LLM: zero-shot
Mistral 7B instruct	0.37	0.48	0.36	0.33	0.19	0.20
Gemma3:12b-it-qat	0.37	0.64	0.42	0.30	0.25	0.21
llama3.1:8b-instruct-q8_0	0.32	0.52	0.38	0.31	0.21	0.22
LLM: few-shot
Mistral 7B instruct	0.31	0.42	0.31	0.31	0.13	0.16
Gemma3:12b-it-qat	0.38	0.51	0.40	0.33	0.14	0.15
llama3.1:8b-instruct-q8_0	0.37	0.45	0.37	0.33	0.16	0.15

Table 5

The results for the highest-performing model on the summary dataset (SVM, 10,000 features). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.51	0.77	0.62	26
infatuation	0.38	0.67	0.48	18
friendship	0.22	0.95	0.36	20
romantic love	0.18	0.64	0.28	14
the desire for vengeance	0.29	0.63	0.40	16
humanoid robot	0.58	0.94	0.71	16
father and son	0.40	0.92	0.56	13
human vs. captivity	0.32	0.73	0.44	11
time travel	0.82	0.75	0.78	12
greed for riches	0.33	0.80	0.47	15
macro avg	0.40	0.78	0.51	161

Table 6

The results for the highest-performing model on the subtitle dataset (SVM, 5000 features). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.66	0.85	0.74	65
murder	0.34	0.84	0.49	32
the desire for vengeance	0.22	0.56	0.31	32
infatuation	0.39	0.61	0.48	33
romantic love	0.26	0.89	0.40	26
friendship	0.19	0.81	0.30	21
spouse murder	0.28	0.96	0.43	24
greed for riches	0.27	0.72	0.39	25
father and son	0.32	0.88	0.47	17
extramarital affair	0.28	0.78	0.41	18
macro avg	0.32	0.79	0.44	293

Overall, for the bag-of-words logistic regression, SVM, and FastText classification, a pattern can be seen of a high average recall in both datasets, but have a low average precision, meaning that there are many false positives among their classifications. In the summary dataset, the SetFit models have a high precision and a low recall, meaning that the models are accurate in predicting the different classes, but miss many true positives. For the subtitle dataset, the SetFit models have both low precision and low recall, indicating that the models are not successful at predicting the ten themes.

For the LLMs, on the summary dataset the gemma3:12b model approaches the results of the bag-of-words models. For the zero-shot models, all three models predominantly give output in the correct format as shown in the prompt (see Table 9). Thus, the zero-shot models for the summaries are predominantly successful at following the instructions in the prompt. However, the few-shot models are less successful at predicting themes in the requested output, despite having two example input texts and two example outputs. Thus, including examples in the prompt does not result in more predictions using the correct output.

This same pattern is seen for the subtitle dataset. In general, it seems to be much more difficult for the LLMs to follow the instructions when identifying themes in subtitles. Due to these wrongly formatted outputs, the zero-shot and few-shot LLM results have the lowest scores out of all the models used.

In Tables 7 and 8 the best LLM results can be seen. Time travel has the highest F₁ score for the summaries, and murder for the subtitles. The pattern of high precision resulting in a high F₁ score cannot be seen in the best LLM results. The best LLM results do show that some themes are particularly hard to detect, resulting in very low F₁ scores, such as human vs. captivity for the summaries and the desire for vengeance in both datasets.

Table 7

The results for the highest-performing LLM on the summary dataset (Gemma3:12b-it-qat, zero-shot). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.59	0.62	0.60	26
infatuation	0.18	0.78	0.29	18
friendship	0.22	0.85	0.35	20
romantic love	0.19	0.71	0.30	14
the desire for vengeance	0.20	0.50	0.29	16
humanoid robot	0.57	0.50	0.53	16
father and son	0.75	0.46	0.57	13
human vs. captivity	0.18	0.18	0.18	11
time travel	0.53	0.83	0.65	12
greed for riches	0.26	0.93	0.41	15
macro avg	0.37	0.64	0.42	161

Table 8

The results for the highest-performing LLM on the subtitle dataset (llama3.1:8b-instruct-q8_0, zero-shot). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.67	0.19	0.29	65
murder	0.44	0.72	0.55	32
the desire for vengeance	0.25	0.03	0.06	32
infatuation	0.15	0.09	0.11	33
romantic love	0.27	0.12	0.16	26
friendship	0.18	0.14	0.16	21
spouse murder	0.29	0.08	0.13	24
greed for riches	0.39	0.20	0.26	25
father and son	0.33	0.24	0.28	17
extramarital affair	0.14	0.33	0.20	18
macro avg	0.31	0.21	0.22	293

Table 9 shows the percentages of predictions made by the LLMs that are not a list of strings. When splitting the test set into two subsets of examples below and above the average length of the respective subtitles or summaries, it is seen that almost all predictions with the wrong output are made when the given input text is above the average length of the respective summaries and subtitles. 55.2% of the subtitles test set and 62% of the summaries test set are below average length. Thus, the LLMs are successful at following the instructions in the prompt when an input text is below average.

Table 9

The percentage of wrong output per model for both the summary and subtitle datasets. The predictions are listed as wrong output when the provided message is in a different format than the requested list of themes. The first column is the percentage of wrong output of the predictions on the whole test set, and the second column shows the percentage of wrong output for the input texts smaller than the average length of summaries or subtitles in the dataset. The last column shows the percentage of wrong output for the input text above the average length.

MODEL	WRONG OUTPUT	WRONG OUTPUT BELOW AVG	WRONG OUTPUT ABOVE AVG
Summary
LLM: zero-shot
Mistral 7B instruct	14.0%	0.0%	14.0%
Gemma3:12b-it-qat	10.1%	0.0%	10.1%
llama3.1:8b-instruct-q8_0	10.1%	0.0%	10.1%
LLM: few-shot
Mistral 7B instruct	20.9%	0.0%	20.9%
Gemma3:12b-it-qat	16.3%	0.8%	15.5%
llama3.1:8b-instruct-q8_0	17.8%	0.0%	17.8%
Subtitle
LLM: zero-shot
Mistral 7B instruct	48.4%	0.0%	48.4%
Gemma3:12b-it-qat	48.4%	0.0%	48.4%
llama3.1:8b-instruct-q8_0	47.9%	0.0%	47.9%
LLM: few-shot
Mistral 7B instruct	62.5%	0.0%	62.5%
Gemma3:12b-it-qat	54.2%	0.0%	54.2%
llama3.1:8b-instruct-q8_0	52.1%	0.0%	52.1%

6.1 Case study

To gain a deeper understanding of the difficulty of the task and the models’ predicted themes, we compare the results for two sample input texts from the test set. Since the theme classifications on summaries obtain the best results, it is most interesting to analyze the summaries and predicted classifications more closely by comparing the SVM classifier predictions to the predictions of the different LLMs. Figures 1 and 2 show two shorter summaries from the summary test set. For the first summary (see Figure 1), the human annotated theme is humanoid robot. As the word “robot” is mentioned five times in the summary and the different parts of the robot, such as the head, torso, and arms, are discussed in the summary, the theme is rather explicitly mentioned in the summary. Humanoid robot has above average F₁ scores across different models, implying that this theme is more easily predictable (see Tables 5 and 7).

A shortened version of the summary input text of the episode Bendin’ in the Wind of Futurama, which is the 13th episode of season 3.

A shortened version of the summary input text of the episode The Cyber House Rules of Futurama, which is the 9th episode of season 3.

The best performing SVM classifier predicts the themes humanoid robot as well as friendship. The zero-shot LLMs gemma3:12b and llama3.1:8b make the same two predictions, whereas mistral:7B predicts humanoid robot and greed for riches. For the few-shot LLMs, gemma3:12b and llama3.1:8b correctly predict humanoid robot without any other themes, whereas mistral:7B again predicts humanoid robot and greed for riches. The prediction greed for riches could be explained by the occurrence of the words “money” and “money-making tool”. Furthermore, the theme greed for riches is actually annotated as a minor theme in the episode in the LTO. Thus, the theme is present in the episode, but not as a major theme, which we are focusing on. The wrongly predicted theme friendship cannot be as explicitly connected to word usage in the texts, but it can be assumed that there are words used that occur in summaries in the train set that do have the theme friendship, as the SVM classifier and the zero-shot gemma3:12b and llama3.1:8b models predict this theme.

In Figure 2, the human-annotated theme is infatuation. This theme has below average results across different models (see Tables 5, 6, 7, 8). The following motivation for the theme is given in the LTO: “Leela and Adlai Atkins” (Onsjö & Sheridan, 2025a). Thus, the relationship between the two characters implies infatuation. In the summary, it is mentioned that Leela had a crush on Adlai Atkins in the past, and that they get into a relationship later in the episode. It is not mentioned that they are infatuated with each other, and based on the summary, romantic love could also be a suitable theme. The SVM classifier predicts the theme friendship and humanoid robot, of which the former misclassification can be explained by the usage of the word “friends” in the summary. In the LTO, the theme humanoid robot does occur in the episode, but as a minor theme (Onsjö & Sheridan, 2025a).

For this summary, the LLMs predict multiple themes. The zero-shot gemma3:12b correctly predicts infatuation, but also predicts friendship and greed for riches. Llama3.1:8b predicts infatuation as well as friendship, husband and wife, humanoid robot and father and son. The zero-shot mistral:7B fails to predict infatuation, but instead predicts the following five themes: romantic love, father and son, greed for riches and human vs. captivity. For the few-shot predictions, gemma3:12b predicts infatuation, romantic love and friendship, llama3.1:8b predicts friendship and father and son and mistral:7B predicts friendship and greed for riches. Father and son is also listed as a minor theme in this episode in the LTO.

As romantic love is the parent of infatuation in the hierarchy of the LTO, it would be logical if the classification models had picked the theme romantic love over infatuation, in particular since the summary explicitly mentions Leela and Adlai Atkins being in a relationship. However, the models seem to be able to distinguish between these two themes in this example, as two models predict infatuation without predicting romantic love and only one model predicts romantic love, without also predicting infatuation.

Thus, this analysis indicates that the predicted themes can be related to specific word usage in the summaries. This also leads to wrongly predicted minor themes, as these themes are explicitly mentioned in the summaries. However, implicit major themes such as infatuation can still be correctly identified by the classifiers. Lastly, despite the theme infatuation being a child of the theme romantic love in the hierarchy of the LTO, the models seem to be able to distinguish between these themes.

7 Discussion

We find that bag-of-words classification is most successful in classifying themes in the summaries and subtitles of TV series episodes. The best results are obtained for the summary dataset, despite the train set being smaller than for the subtitle dataset.

The results for the LLMs also show that the prompt length has a strong influence on the predictions made. All models perform better in the zero-shot prompt, than in the few-shots prompt when example input and output is given. Additionally, the LLMs are more accurate in using the requested output format when the given input text is below average length. Except for one model, all models make no mistakes in answering the prompt in the correct output form, when the given input text is below average for the respective datasets. These results indicate that the LLMs are more likely to hallucinate answers and output formats when the prompt and input texts are longer. Since including examples results in longer prompts, this also explains why the LLMs perform better for the zero-shot prompts than for the few-shot prompts.

7.1 Reuse potential

Future research could use this benchmark to investigate the influence of shortening or segmenting of the summaries and subtitles, comparing the LLM results using shorter prompts. This is particularly interesting as the majority of the summary dataset consists of summaries of the Star Trek franchise. As these summaries are rather extensive and discuss the episode per act, shortening these summaries into more comprehensive, broad summaries could strongly impact the performance of the different models in the classification task. An interesting approach to shorten the input text would be to use Retell (Lucy et al., 2025), a technique where language models are asked to describe, paraphrase, or summarize a given paragraph. LDA topic modeling (Blei et al., 2003) is then used to abstract the topics from the output of the language model. The results show that this leads to more accurate topic abstractions than when LDA topic modeling was applied on the original passages. Lastly, this research focuses on small LLMs that can be used on standard hardware. The effect of prompt length may differ for larger LLMs. A limitation of this dataset is that it contains a higher proportion of science fiction episodes than other genres. However, since the focus of this benchmark is literary theme classification rather than genre classification, a more genre-homogeneous dataset may reduce the risk of classifiers learning genre-specific patterns instead of thematic features. To further minimize genre-related noise, the dataset could also be restricted to science fiction TV series only. Alternatively, the dataset could be used to extend research on literary themes beyond literary fiction or prestigious movies.

The LTO includes motivations for the annotated themes. It would therefore also be additional information for the LLMs if these motivations can be included in the prompts, without reducing the performance of the LLMs due to the longer prompt length. Lastly, the dataset could also be tested on larger LLMs, to investigate whether they perform better at following the instructions provided in the prompts. Additionally, classification scores could be improved if themes higher up in the LTO hierarchy are chosen, or if themes that are directly related to each other, such as infatuation and romantic love and spouse murder and murder are avoided in the list of possible themes. Furthermore, distinguishing between the major and minor themes in an episode can be difficult; thus, more clearly defining the difference between major and minor themes could lead to better results. This could be done by providing negative examples of minor themes when training classifiers, as well as by including the definition of a major theme in a prompt given to an LLM. Another approach would be to classify both major and minor themes, so that the models do not need to distinguish between the different types of themes.

Additional File

The additional file for this article can be found as follows:

Appendix

Appendix A.1 to A.2. DOI: https://doi.org/10.5334/johd.480.s1

Notes

[1] https://github.com/theme-ontology/theming/wiki/Thematically-Annotating-Andrew-Lang’s-Little-Red-Riding-Hood.

[2] In the conversion, the subtitles were converted to lowercase strings of texts. As subtitles contain relatively few names, we decided that the lower case texts would not have a strong influence on the theme classification.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Noa Visser Solissa: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Writing – original draft, Paul Sheridan: Conceptualization, Data curation, Validation, Writing – review & editing, Mikael Onsjö: Data curation, Writing – review & editing, Andreas van Cranenburgh: Conceptualization, Supervision, Writing – review & editing, Federico Pianzola: Conceptualization, Supervision, Writing – review & editing.