Have a personal or library account? Click to login
Towards Leitmotif Activity Detection in Opera Recordings Cover
Open Access
|Nov 2021

Figures & Tables

tismir-4-1-116-g1.png
Figure 1

Illustration of a leitmotif (here the Ring motif L-Ri) and its manifestations as (a) leitmotif occurrences in the score, (b) leitmotif instances in several recorded performances (audio), (c) continuous leitmotif activity output by a detection system.

tismir-4-1-116-g2.png
Figure 2

Structure of Richard Wagner’s Ring cycle and overview of 16 recorded performances, see also Zalkow et al. (2017a). Measure positions have been annotated manually for the topmost three performances (P-Ka, P-Ba, and P-Ha), which also constitute the test set in our performance split. The three middle performances (P-Sa, P-So, and P-We) constitute the validation set. All other performances are used for training.

Table 1

Overview of the 20 leitmotifs used in this study (the first ten of these motifs were previously used in Krause et al. (2020)). Score examples shown are adapted from Wagner (2013). Lengths are given as means and standard deviations over all annotated occurrences (in measures) or instances (in seconds) from all performances given in Figure 2. Counts and lengths differ from Krause et al. (2020), because we allow for concurrent motif activity in this study.

Name (English translation)IDScore# OccurrencesLength
MeasuresSeconds
Nibelungen (Nibelungs)L-Nitismir-4-1-116-g9.png5620.95 ± 0.241.72 ± 0.50
Ring (Ring)L-Ritismir-4-1-116-g10.png2971.50 ± 0.663.77 ± 2.46
Nibelungenhass (Nibelungs’ hate)L-NHtismir-4-1-116-g11.png2520.96 ± 0.173.22 ± 1.20
Mime (Mime)L-Mitismir-4-1-116-g12.png2430.83 ± 0.250.84 ± 0.20
Ritt (Ride)L-RTtismir-4-1-116-g13.png2280.66 ± 0.171.26 ± 0.38
Waldweben (Forest murmurs)L-Watismir-4-1-116-g14.png2281.10 ± 0.302.65 ± 0.73
Waberlohe (Swirling blaze)L-WLtismir-4-1-116-g15.png1941.21 ± 0.394.59 ± 1.70
Horn (Horn)L-Hotismir-4-1-116-g16.png1951.30 ± 1.022.34 ± 1.51
Geschwisterliebe (Siblings’ love)L-Getismir-4-1-116-g17.png1581.32 ± 0.843.13 ± 2.65
Schwert (Sword)L-Sctismir-4-1-116-g18.png1481.88 ± 0.633.73 ± 1.99
Jugendkraft (Youthful vigor)L-Jutismir-4-1-116-g19.png1461.23 ± 0.570.96 ± 0.38
Walhall-b (Valhalla-b)L-WHtismir-4-1-116-g20.png1431.10 ± 0.473.53 ± 2.14
Riesen (Giants)L-RStismir-4-1-116-g21.png1360.95 ± 0.392.83 ± 1.96
Feuerzauber (Magic fire)L-Fetismir-4-1-116-g22.png1121.18 ± 0.403.57 ± 1.09
Schicksal (Fate)L-SKtismir-4-1-116-g23.png942.02 ± 0.478.11 ± 2.64
Unmuth (Upset)L-Untismir-4-1-116-g24.png921.87 ± 0.705.85 ± 3.21
Liebe (Love)L-Litismir-4-1-116-g25.png891.78 ± 0.515.54 ± 2.47
Siegfried (Siegfried)L-Sitismir-4-1-116-g26.png862.88 ± 1.608.03 ± 5.46
Mannen (Men)L-Matismir-4-1-116-g27.png831.15 ± 0.501.37 ± 0.70
Vertrag (Contract)L-Vetismir-4-1-116-g28.png832.29 ± 0.655.72 ± 2.12
tismir-4-1-116-g3.png
Figure 3

Illustration of our ground truth occurrence annotations. Measures 112 to 390 from the first act of Siegfried are shown. For instance, L-Ni is active around measure 150, whereas L-SK is never active throughout this excerpt.

Table 2

Network architecture used for our RNN-based leitmotif activity detection system (adapted from Krause et al. (2020)).

LayerOutput ShapeParameters
Input(431, 84)
LSTM(431, 128)109 056
LSTM(431, 128)131 584
LSTM(431, 128)131 584
Batch normalization(431, 128)512
Dense (per frame)(431, 21)2 709
Output: Sigmoid(431, 21)
Table 3

Network architecture used for our CNN-based leitmotif activity detection system (inspired by Schlüter and Lehner (2018)). Note that all operations have stride one in time and pitch, except for MaxPool2D, which has stride three in the pitch direction. Dilation rates in time increase after each max-pooling operation.

Layer (Kernel size), (Strides), (Dilations)Output ShapeParameters
Input(431, 84)
Expand(431, 84, 1)
Conv2D (3, 3), (1, 1), (1, 1)(431, 84, 128)1 152
Batch normalization(431, 84, 128)512
Conv2D (3, 3), (1, 1), (1, 1)(431, 84, 64)73 728
Batch normalization(431, 84, 64)256
MaxPool2D (3, 3), (1, 3), (1, 1)(431, 29, 64)
Conv2D (3, 3), (1, 1), (3, 1)(431, 29, 128)73 728
Batch normalization(431, 29, 128)512
Conv2D (3, 3), (1, 1), (3, 1)(431, 29, 64)73 728
Batch normalization(431, 29, 64)256
MaxPool2D (3, 3), (1, 3), (3, 1)(431, 10, 64)
Conv2D (3, 3), (1, 1), (9, 1)(431, 10, 128)73 728
Batch normalization(431, 10, 128)512
Conv2D (3, 3), (1, 1), (9, 1)(431, 10, 64)73 728
Batch normalization(431, 10, 64)256
MaxPool2D (3, 3), (1, 3), (9, 1)(431, 4, 64)
Conv2D (1, 4), (1, 1), (1, 1)(431, 1, 64)16 384
Batch normalization(431, 1, 64)256
Squeeze(431, 64)
Conv1D (3), (1), (27)(431, 128)24 576
Batch normalization(431, 128)512
Conv1D (3), (1), (27)(431, 64)24 576
Batch normalization(431, 64)256
MaxPool1D (3), (1), (27)(431, 64)
Dense (per frame)(431, 21)1 365
Output: Sigmoid(431, 21)
Table 4

Results for our deep learning-based leitmotif activity detection systems on the test set.

RNNCNN
PRFPRF
L-Ni0.870.760.810.850.790.82
L-Ri0.800.730.760.820.760.79
L-NH0.890.780.830.910.820.86
L-Mi0.860.860.860.870.790.83
L-RT0.850.860.850.800.830.82
L-Wa0.940.900.920.930.950.94
L-WL0.860.850.850.830.850.84
L-Ho0.800.760.780.820.800.81
L-Ge0.890.810.850.850.810.83
L-Sc0.740.720.730.830.720.77
L-Ju0.820.680.740.870.780.82
L-WH0.790.770.780.780.760.77
L-RS0.870.840.860.860.810.84
L-Fe0.870.880.880.930.860.89
L-SK0.750.720.740.810.750.78
L-Un0.790.750.770.840.810.83
L-Li0.890.810.850.820.840.83
L-Si0.780.750.760.830.800.81
L-Ma0.790.810.800.870.790.83
L-Ve0.840.730.780.830.830.83
Class mean0.830.790.810.850.810.83
Matrix mean0.830.780.800.850.800.82
tismir-4-1-116-g4.png
Figure 4

Illustration of results for our RNN-based leitmotif activity detection system (shown for measures 112 to 390 from the first act of Siegfried in P-Ba).

tismir-4-1-116-g5.png
Figure 5

Results for our RNN-based leitmotif activity detection system on measures 117 to 123.5 of the first act of Siegfried in P-Ba (see also Figure 3 and Figure 4; outputs of the CNN-based model are similar). A prominent instance of L-Sc is being played in the higher registers, accompanied by low-frequency tremolo. The model input is shown in the upper row. The respective output activations for the L-Sc class are plotted underneath in red (solid line). The dashed blue line corresponds to the ground truth annotations for L-Sc. The input is given to the network (a) unchanged, (b) slowed down to 175% of the original length, (c) with a pitch shift of eleven semitones, (d) with motif frames replaced by noise, and (e) with motif frames shuffled along the time axis.

tismir-4-1-116-g6.png
Figure 6

Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set under tempo changes. The CQT input is stretched in time (using bilinear resampling) by the given percentage.

tismir-4-1-116-g7.png
Figure 7

Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set under pitch shifts. The CQT input has been shifted (using nearest-neighbor padding) on the pitch axis by the given number of semitones (corresponding to CQT bins).

tismir-4-1-116-g8.png
Figure 8

Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set when (1) replacing leitmotif frames by noise or (2) shuffling them along the time axis. The modifications have been applied to either the first, middle, or last third of each leitmotif instance (Start, Middle, End), for none (Unchanged), or for all leitmotif frames (All).

DOI: https://doi.org/10.5334/tismir.116 | Journal eISSN: 2514-3298
Language: English
Submitted on: May 26, 2021
|
Accepted on: Sep 13, 2021
|
Published on: Nov 2, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Michael Krause, Meinard Müller, Christof Weiß, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.