Towards Leitmotif Activity Detection in Opera Recordings

Michael Krause; Meinard Müller; Christof Weiß

doi:10.5334/tismir.116

Figures & Tables

Illustration of a leitmotif (here the *Ring* motif `L-Ri`) and its manifestations as (a) leitmotif occurrences in the score, (b) leitmotif instances in several recorded performances (audio), (c) continuous leitmotif activity output by a detection system.

Structure of Richard Wagner’s *Ring* cycle and overview of 16 recorded performances, see also Zalkow et al. (2017a). Measure positions have been annotated manually for the topmost three performances (`P-Ka`, `P-Ba`, and `P-Ha`), which also constitute the test set in our performance split. The three middle performances (`P-Sa`, `P-So`, and `P-We`) constitute the validation set. All other performances are used for training.

Table 1

Overview of the 20 leitmotifs used in this study (the first ten of these motifs were previously used in Krause et al. (2020)). Score examples shown are adapted from Wagner (2013). Lengths are given as means and standard deviations over all annotated occurrences (in measures) or instances (in seconds) from all performances given in Figure 2. Counts and lengths differ from Krause et al. (2020), because we allow for concurrent motif activity in this study.

Name (English translation)	ID	# Occurrences	Length
Measures	Seconds
Nibelungen (Nibelungs)	`L-Ni`	562	0.95 ± 0.24	1.72 ± 0.50
Ring (Ring)	`L-Ri`	297	1.50 ± 0.66	3.77 ± 2.46
Nibelungenhass (Nibelungs’ hate)	`L-NH`	252	0.96 ± 0.17	3.22 ± 1.20
Mime (Mime)	`L-Mi`	243	0.83 ± 0.25	0.84 ± 0.20
Ritt (Ride)	`L-RT`	228	0.66 ± 0.17	1.26 ± 0.38
Waldweben (Forest murmurs)	`L-Wa`	228	1.10 ± 0.30	2.65 ± 0.73
Waberlohe (Swirling blaze)	`L-WL`	194	1.21 ± 0.39	4.59 ± 1.70
Horn (Horn)	`L-Ho`	195	1.30 ± 1.02	2.34 ± 1.51
Geschwisterliebe (Siblings’ love)	`L-Ge`	158	1.32 ± 0.84	3.13 ± 2.65
Schwert (Sword)	`L-Sc`	148	1.88 ± 0.63	3.73 ± 1.99
Jugendkraft (Youthful vigor)	`L-Ju`	146	1.23 ± 0.57	0.96 ± 0.38
Walhall-b (Valhalla-b)	`L-WH`	143	1.10 ± 0.47	3.53 ± 2.14
Riesen (Giants)	`L-RS`	136	0.95 ± 0.39	2.83 ± 1.96
Feuerzauber (Magic fire)	`L-Fe`	112	1.18 ± 0.40	3.57 ± 1.09
Schicksal (Fate)	`L-SK`	94	2.02 ± 0.47	8.11 ± 2.64
Unmuth (Upset)	`L-Un`	92	1.87 ± 0.70	5.85 ± 3.21
Liebe (Love)	`L-Li`	89	1.78 ± 0.51	5.54 ± 2.47
Siegfried (Siegfried)	`L-Si`	86	2.88 ± 1.60	8.03 ± 5.46
Mannen (Men)	`L-Ma`	83	1.15 ± 0.50	1.37 ± 0.70
Vertrag (Contract)	`L-Ve`	83	2.29 ± 0.65	5.72 ± 2.12

Illustration of our ground truth occurrence annotations. Measures 112 to 390 from the first act of *Siegfried* are shown. For instance, `L-Ni` is active around measure 150, whereas `L-SK` is never active throughout this excerpt.

Table 2

Network architecture used for our RNN-based leitmotif activity detection system (adapted from Krause et al. (2020)).

Layer	Output Shape	Parameters
Input	(431, 84)
LSTM	(431, 128)	109 056
LSTM	(431, 128)	131 584
LSTM	(431, 128)	131 584
Batch normalization	(431, 128)	512
Dense (per frame)	(431, 21)	2 709
Output: Sigmoid	(431, 21)

Table 3

Network architecture used for our CNN-based leitmotif activity detection system (inspired by Schlüter and Lehner (2018)). Note that all operations have stride one in time and pitch, except for MaxPool2D, which has stride three in the pitch direction. Dilation rates in time increase after each max-pooling operation.

Layer (Kernel size), (Strides), (Dilations)	Output Shape	Parameters
Input	(431, 84)
Expand	(431, 84, 1)
Conv2D (3, 3), (1, 1), (1, 1)	(431, 84, 128)	1 152
Batch normalization	(431, 84, 128)	512
Conv2D (3, 3), (1, 1), (1, 1)	(431, 84, 64)	73 728
Batch normalization	(431, 84, 64)	256
MaxPool2D (3, 3), (1, 3), (1, 1)	(431, 29, 64)
Conv2D (3, 3), (1, 1), (3, 1)	(431, 29, 128)	73 728
Batch normalization	(431, 29, 128)	512
Conv2D (3, 3), (1, 1), (3, 1)	(431, 29, 64)	73 728
Batch normalization	(431, 29, 64)	256
MaxPool2D (3, 3), (1, 3), (3, 1)	(431, 10, 64)
Conv2D (3, 3), (1, 1), (9, 1)	(431, 10, 128)	73 728
Batch normalization	(431, 10, 128)	512
Conv2D (3, 3), (1, 1), (9, 1)	(431, 10, 64)	73 728
Batch normalization	(431, 10, 64)	256
MaxPool2D (3, 3), (1, 3), (9, 1)	(431, 4, 64)
Conv2D (1, 4), (1, 1), (1, 1)	(431, 1, 64)	16 384
Batch normalization	(431, 1, 64)	256
Squeeze	(431, 64)
Conv1D (3), (1), (27)	(431, 128)	24 576
Batch normalization	(431, 128)	512
Conv1D (3), (1), (27)	(431, 64)	24 576
Batch normalization	(431, 64)	256
MaxPool1D (3), (1), (27)	(431, 64)
Dense (per frame)	(431, 21)	1 365
Output: Sigmoid	(431, 21)

Table 4

Results for our deep learning-based leitmotif activity detection systems on the test set.

	RNN			CNN
P	R	F	P	R	F
`L-Ni`	0.87	0.76	0.81	0.85	0.79	0.82
`L-Ri`	0.80	0.73	0.76	0.82	0.76	0.79
`L-NH`	0.89	0.78	0.83	0.91	0.82	0.86
`L-Mi`	0.86	0.86	0.86	0.87	0.79	0.83
`L-RT`	0.85	0.86	0.85	0.80	0.83	0.82
`L-Wa`	0.94	0.90	0.92	0.93	0.95	0.94
`L-WL`	0.86	0.85	0.85	0.83	0.85	0.84
`L-Ho`	0.80	0.76	0.78	0.82	0.80	0.81
`L-Ge`	0.89	0.81	0.85	0.85	0.81	0.83
`L-Sc`	0.74	0.72	0.73	0.83	0.72	0.77
`L-Ju`	0.82	0.68	0.74	0.87	0.78	0.82
`L-WH`	0.79	0.77	0.78	0.78	0.76	0.77
`L-RS`	0.87	0.84	0.86	0.86	0.81	0.84
`L-Fe`	0.87	0.88	0.88	0.93	0.86	0.89
`L-SK`	0.75	0.72	0.74	0.81	0.75	0.78
`L-Un`	0.79	0.75	0.77	0.84	0.81	0.83
`L-Li`	0.89	0.81	0.85	0.82	0.84	0.83
`L-Si`	0.78	0.75	0.76	0.83	0.80	0.81
`L-Ma`	0.79	0.81	0.80	0.87	0.79	0.83
`L-Ve`	0.84	0.73	0.78	0.83	0.83	0.83
Class mean	0.83	0.79	0.81	0.85	0.81	0.83
Matrix mean	0.83	0.78	0.80	0.85	0.80	0.82

Illustration of results for our RNN-based leitmotif activity detection system (shown for measures 112 to 390 from the first act of *Siegfried* in `P-Ba`).

Results for our RNN-based leitmotif activity detection system on measures 117 to 123.5 of the first act of *Siegfried* in `P-Ba` (see also Figure 3 and Figure 4; outputs of the CNN-based model are similar). A prominent instance of `L-Sc` is being played in the higher registers, accompanied by low-frequency tremolo. The model input is shown in the upper row. The respective output activations for the `L-Sc` class are plotted underneath in red (solid line). The dashed blue line corresponds to the ground truth annotations for `L-Sc`. The input is given to the network (a) unchanged, (b) slowed down to 175% of the original length, (c) with a pitch shift of eleven semitones, (d) with motif frames replaced by noise, and (e) with motif frames shuffled along the time axis.

Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set under tempo changes. The CQT input is stretched in time (using bilinear resampling) by the given percentage.

Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set under pitch shifts. The CQT input has been shifted (using nearest-neighbor padding) on the pitch axis by the given number of semitones (corresponding to CQT bins).

Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set when (1) replacing leitmotif frames by noise or (2) shuffling them along the time axis. The modifications have been applied to either the first, middle, or last third of each leitmotif instance (Start, Middle, End), for none (Unchanged), or for all leitmotif frames (All).

Towards Leitmotif Activity Detection in Opera Recordings

Figures & Tables

Figure 1

Figure 2

Table 1

Figure 3

Table 2

Table 3

Table 4

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Paradigm

My account