A Classification Benchmark Based on the Literary Theme Ontology

Noa Visser Solissa; Paul Sheridan; Mikael Onsjö; Andreas van Cranenburgh; Federico Pianzola

doi:10.5334/johd.480

Figures & Tables

Table 1

The summary dataset, showing the number of episodes, the shortest episode summary length, the longest episode summary length, and the average episode summary length per episode per TV series.

TV SERIES	# EPISODES	MIN WORD LENGTH	MAX WORD LENGTH	MEAN WORD LENGTH
Babylon 5	45	1,068	2,878	1,767.6
Black Mirror	12	578	3,837	1,581.7
Futurama	81	255	2,117	716.3
Game of Thrones	2	2,518	4,860	3,689.5
Guillermo del Toro’s Cabinet of Curiosities	5	1,111	3,621	2,308.6
Red Dwarf	39	571	2,220	952.1
Sherlock	2	526	813	669.5
Star Trek: Deep Space Nine	109	1,253	7,490	2,638.7
Star Trek: Enterprise	43	1,122	6,458	2,148.6
Star Trek: The Animated Series	5	712	1,517	941.2
Star Trek: The Next Generation	66	1,160	6,910	2,440.7
Star Trek: The Original Series	38	841	3,765	1,853.4
Star Trek: Voyager	59	1,265	6,182	2,577.3
Tales from the Crypt (1989)	31	95	1,031	572.1
Tales from the Loop (2020)	4	2,577	3,322	2,958
The Twilight Zone Franchise	103	154	1,093	373
Total	644	95	7,490	1,630.8

Table 2

The subtitles dataset, showing the number of episodes, the shortest subtitle length of an episode, the longest subtitle length of an episode, and the average subtitle length of the subtitle per episode per TV series.

TV SERIES	# EPISODES	MIN WORD LENGTH	MAX WORD LENGTH	MEAN WORD LENGTH
Alfred Hitchcock Presents	185	969	3,584	2,527.6
Amazing Stories (1985)	15	1,089	2,402	1,608.3
Amazing Stories (2020)	4	3,649	4,892	4,300
Babylon 5	43	3,197	5,514	4,400.1
Black Mirror	13	2,563	7,185	4,680.1
Brideshead Revisited (1981)	10	4,012	9,604	5,087.4
Futurama	58	1,770	9,402	2,659.4
Game of Thrones	2	3,460	5,416	4,438
Guillermo del Toro’s Cabinet of Curiosities	5	1,964	4,825	3,511
I Claudius	9	4,921	5,943	5,300.8
Piece of Cake (1988)	1	4,608	4,608	4,608
Red Dwarf	7	2,680	3,539	3,222
Sherlock	10	1,460	10,252	8,400.1
Star Trek: Deep Space Nine	98	3,073	4,970	4,080.6
Star Trek: Enterprise	24	2,931	4,878	3,851.8
Star Trek: The Animated Series	2	1,925	2,353	2,139
Star Trek: The Next Generation	56	2,756	9,622	4,079.2
Star Trek: The Original Series	30	3,132	5,681	4,340.1
Star Trek: Voyager	48	3,241	5,667	4,692.4
Tales from the Crypt (1989)	74	392	3,514	2,101.1
Tales from the Loop (2020)	3	677	1,954	1,440.3
Tales of the Unexpected	77	1,499	4,023	2,441.1
The Alfred Hitchcock Hour	84	2,942	7,408	4,925.4
The Twilight Zone Franchise	98	486	5,421	2,397.2
Total	956	392	10,252	3,374.8

Table 3

The 10 most frequent themes in the available summaries and subtitles. The total number of theme occurrence is higher than the total number of episodes in each dataset, as one episode can contain multiple themes.

(A) SUMMARIES
THEME	# OCCURRENCES
father and son	73
friendship	101
greed for riches	62
human vs. captivity	63
humanoid robot	75
husband and wife	113
infatuation	104
romantic love	88
the desire for vengeance	81
time travel	63
Total	832
(B) SUBTITLES
extramarital affair	96
father and son	98
friendship	130
greed for riches	108
husband and wife	348
infatuation	140
murder	166
romantic love	132
spouse murder	127
the desire for vengeance	155
Total	1,500

Table 4

The results for the summary and subtitle datasets. All models were tested on the complete test set and evaluated using the macro precision, recall, and F₁ scores. For both datasets, the highest F₁ score is highlighted in bold.

Model	Summary			Subtitles
Model	Precision	Recall	F₁	Precision	Recall	F₁
LogReg bigrams 1,000	0.33	0.74	0.44	0.31	0.78	0.42
LogReg bigrams 5,000	0.34	0.75	0.45	0.31	0.78	0.43
SVM bigrams 5,000	0.38	0.77	0.50	0.32	0.79	0.44
SVM bigrams 10,000	0.39	0.77	0.51	0.32	0.74	0.44
FastText LogReg	0.23	0.60	0.33	0.22	0.66	0.33
FastText SVM	0.24	0.63	0.34	0.24	0.63	0.34
Setfit Undersampling	0.43	0.22	0.28	0.18	0.24	0.20
Setfit Unique	0.50	0.28	0.34	0.24	0.23	0.21
Setfit Oversampling	0.44	0.30	0.34	0.17	0.17	0.16
LLM: zero-shot
Mistral 7B instruct	0.37	0.48	0.36	0.33	0.19	0.20
Gemma3:12b-it-qat	0.37	0.64	0.42	0.30	0.25	0.21
llama3.1:8b-instruct-q8_0	0.32	0.52	0.38	0.31	0.21	0.22
LLM: few-shot
Mistral 7B instruct	0.31	0.42	0.31	0.31	0.13	0.16
Gemma3:12b-it-qat	0.38	0.51	0.40	0.33	0.14	0.15
llama3.1:8b-instruct-q8_0	0.37	0.45	0.37	0.33	0.16	0.15

Table 5

The results for the highest-performing model on the summary dataset (SVM, 10,000 features). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.51	0.77	0.62	26
infatuation	0.38	0.67	0.48	18
friendship	0.22	0.95	0.36	20
romantic love	0.18	0.64	0.28	14
the desire for vengeance	0.29	0.63	0.40	16
humanoid robot	0.58	0.94	0.71	16
father and son	0.40	0.92	0.56	13
human vs. captivity	0.32	0.73	0.44	11
time travel	0.82	0.75	0.78	12
greed for riches	0.33	0.80	0.47	15
macro avg	0.40	0.78	0.51	161

Table 6

The results for the highest-performing model on the subtitle dataset (SVM, 5000 features). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.66	0.85	0.74	65
murder	0.34	0.84	0.49	32
the desire for vengeance	0.22	0.56	0.31	32
infatuation	0.39	0.61	0.48	33
romantic love	0.26	0.89	0.40	26
friendship	0.19	0.81	0.30	21
spouse murder	0.28	0.96	0.43	24
greed for riches	0.27	0.72	0.39	25
father and son	0.32	0.88	0.47	17
extramarital affair	0.28	0.78	0.41	18
macro avg	0.32	0.79	0.44	293

Table 7

The results for the highest-performing LLM on the summary dataset (Gemma3:12b-it-qat, zero-shot). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.59	0.62	0.60	26
infatuation	0.18	0.78	0.29	18
friendship	0.22	0.85	0.35	20
romantic love	0.19	0.71	0.30	14
the desire for vengeance	0.20	0.50	0.29	16
humanoid robot	0.57	0.50	0.53	16
father and son	0.75	0.46	0.57	13
human vs. captivity	0.18	0.18	0.18	11
time travel	0.53	0.83	0.65	12
greed for riches	0.26	0.93	0.41	15
macro avg	0.37	0.64	0.42	161

Table 8

The results for the highest-performing LLM on the subtitle dataset (llama3.1:8b-instruct-q8_0, zero-shot). Results are shown per theme.

THEME	PRECISION	RECALL	F₁	SUPPORT
husband and wife	0.67	0.19	0.29	65
murder	0.44	0.72	0.55	32
the desire for vengeance	0.25	0.03	0.06	32
infatuation	0.15	0.09	0.11	33
romantic love	0.27	0.12	0.16	26
friendship	0.18	0.14	0.16	21
spouse murder	0.29	0.08	0.13	24
greed for riches	0.39	0.20	0.26	25
father and son	0.33	0.24	0.28	17
extramarital affair	0.14	0.33	0.20	18
macro avg	0.31	0.21	0.22	293

Table 9

The percentage of wrong output per model for both the summary and subtitle datasets. The predictions are listed as wrong output when the provided message is in a different format than the requested list of themes. The first column is the percentage of wrong output of the predictions on the whole test set, and the second column shows the percentage of wrong output for the input texts smaller than the average length of summaries or subtitles in the dataset. The last column shows the percentage of wrong output for the input text above the average length.

MODEL	WRONG OUTPUT	WRONG OUTPUT BELOW AVG	WRONG OUTPUT ABOVE AVG
Summary
LLM: zero-shot
Mistral 7B instruct	14.0%	0.0%	14.0%
Gemma3:12b-it-qat	10.1%	0.0%	10.1%
llama3.1:8b-instruct-q8_0	10.1%	0.0%	10.1%
LLM: few-shot
Mistral 7B instruct	20.9%	0.0%	20.9%
Gemma3:12b-it-qat	16.3%	0.8%	15.5%
llama3.1:8b-instruct-q8_0	17.8%	0.0%	17.8%
Subtitle
LLM: zero-shot
Mistral 7B instruct	48.4%	0.0%	48.4%
Gemma3:12b-it-qat	48.4%	0.0%	48.4%
llama3.1:8b-instruct-q8_0	47.9%	0.0%	47.9%
LLM: few-shot
Mistral 7B instruct	62.5%	0.0%	62.5%
Gemma3:12b-it-qat	54.2%	0.0%	54.2%
llama3.1:8b-instruct-q8_0	52.1%	0.0%	52.1%

A shortened version of the summary input text of the episode Bendin’ in the Wind of Futurama, which is the 13th episode of season 3.

A shortened version of the summary input text of the episode The Cyber House Rules of Futurama, which is the 9th episode of season 3.