
Figure 1
Illustration of a leitmotif (here the Ring motif L-Ri) and its manifestations as (a) leitmotif occurrences in the score, (b) leitmotif instances in several recorded performances (audio), (c) continuous leitmotif activity output by a detection system.

Figure 2
Structure of Richard Wagner’s Ring cycle and overview of 16 recorded performances, see also Zalkow et al. (2017a). Measure positions have been annotated manually for the topmost three performances (P-Ka, P-Ba, and P-Ha), which also constitute the test set in our performance split. The three middle performances (P-Sa, P-So, and P-We) constitute the validation set. All other performances are used for training.
Table 1
Overview of the 20 leitmotifs used in this study (the first ten of these motifs were previously used in Krause et al. (2020)). Score examples shown are adapted from Wagner (2013). Lengths are given as means and standard deviations over all annotated occurrences (in measures) or instances (in seconds) from all performances given in Figure 2. Counts and lengths differ from Krause et al. (2020), because we allow for concurrent motif activity in this study.
| Name (English translation) | ID | Score | # Occurrences | Length | |
|---|---|---|---|---|---|
| Measures | Seconds | ||||
| Nibelungen (Nibelungs) | L-Ni | ![]() | 562 | 0.95 ± 0.24 | 1.72 ± 0.50 |
| Ring (Ring) | L-Ri | ![]() | 297 | 1.50 ± 0.66 | 3.77 ± 2.46 |
| Nibelungenhass (Nibelungs’ hate) | L-NH | ![]() | 252 | 0.96 ± 0.17 | 3.22 ± 1.20 |
| Mime (Mime) | L-Mi | ![]() | 243 | 0.83 ± 0.25 | 0.84 ± 0.20 |
| Ritt (Ride) | L-RT | ![]() | 228 | 0.66 ± 0.17 | 1.26 ± 0.38 |
| Waldweben (Forest murmurs) | L-Wa | ![]() | 228 | 1.10 ± 0.30 | 2.65 ± 0.73 |
| Waberlohe (Swirling blaze) | L-WL | ![]() | 194 | 1.21 ± 0.39 | 4.59 ± 1.70 |
| Horn (Horn) | L-Ho | ![]() | 195 | 1.30 ± 1.02 | 2.34 ± 1.51 |
| Geschwisterliebe (Siblings’ love) | L-Ge | ![]() | 158 | 1.32 ± 0.84 | 3.13 ± 2.65 |
| Schwert (Sword) | L-Sc | ![]() | 148 | 1.88 ± 0.63 | 3.73 ± 1.99 |
| Jugendkraft (Youthful vigor) | L-Ju | ![]() | 146 | 1.23 ± 0.57 | 0.96 ± 0.38 |
| Walhall-b (Valhalla-b) | L-WH | ![]() | 143 | 1.10 ± 0.47 | 3.53 ± 2.14 |
| Riesen (Giants) | L-RS | ![]() | 136 | 0.95 ± 0.39 | 2.83 ± 1.96 |
| Feuerzauber (Magic fire) | L-Fe | ![]() | 112 | 1.18 ± 0.40 | 3.57 ± 1.09 |
| Schicksal (Fate) | L-SK | ![]() | 94 | 2.02 ± 0.47 | 8.11 ± 2.64 |
| Unmuth (Upset) | L-Un | ![]() | 92 | 1.87 ± 0.70 | 5.85 ± 3.21 |
| Liebe (Love) | L-Li | ![]() | 89 | 1.78 ± 0.51 | 5.54 ± 2.47 |
| Siegfried (Siegfried) | L-Si | ![]() | 86 | 2.88 ± 1.60 | 8.03 ± 5.46 |
| Mannen (Men) | L-Ma | ![]() | 83 | 1.15 ± 0.50 | 1.37 ± 0.70 |
| Vertrag (Contract) | L-Ve | ![]() | 83 | 2.29 ± 0.65 | 5.72 ± 2.12 |

Figure 3
Illustration of our ground truth occurrence annotations. Measures 112 to 390 from the first act of Siegfried are shown. For instance, L-Ni is active around measure 150, whereas L-SK is never active throughout this excerpt.
Table 2
Network architecture used for our RNN-based leitmotif activity detection system (adapted from Krause et al. (2020)).
| Layer | Output Shape | Parameters |
|---|---|---|
| Input | (431, 84) | |
| LSTM | (431, 128) | 109 056 |
| LSTM | (431, 128) | 131 584 |
| LSTM | (431, 128) | 131 584 |
| Batch normalization | (431, 128) | 512 |
| Dense (per frame) | (431, 21) | 2 709 |
| Output: Sigmoid | (431, 21) |
Table 3
Network architecture used for our CNN-based leitmotif activity detection system (inspired by Schlüter and Lehner (2018)). Note that all operations have stride one in time and pitch, except for MaxPool2D, which has stride three in the pitch direction. Dilation rates in time increase after each max-pooling operation.
| Layer (Kernel size), (Strides), (Dilations) | Output Shape | Parameters |
|---|---|---|
| Input | (431, 84) | |
| Expand | (431, 84, 1) | |
| Conv2D (3, 3), (1, 1), (1, 1) | (431, 84, 128) | 1 152 |
| Batch normalization | (431, 84, 128) | 512 |
| Conv2D (3, 3), (1, 1), (1, 1) | (431, 84, 64) | 73 728 |
| Batch normalization | (431, 84, 64) | 256 |
| MaxPool2D (3, 3), (1, 3), (1, 1) | (431, 29, 64) | |
| Conv2D (3, 3), (1, 1), (3, 1) | (431, 29, 128) | 73 728 |
| Batch normalization | (431, 29, 128) | 512 |
| Conv2D (3, 3), (1, 1), (3, 1) | (431, 29, 64) | 73 728 |
| Batch normalization | (431, 29, 64) | 256 |
| MaxPool2D (3, 3), (1, 3), (3, 1) | (431, 10, 64) | |
| Conv2D (3, 3), (1, 1), (9, 1) | (431, 10, 128) | 73 728 |
| Batch normalization | (431, 10, 128) | 512 |
| Conv2D (3, 3), (1, 1), (9, 1) | (431, 10, 64) | 73 728 |
| Batch normalization | (431, 10, 64) | 256 |
| MaxPool2D (3, 3), (1, 3), (9, 1) | (431, 4, 64) | |
| Conv2D (1, 4), (1, 1), (1, 1) | (431, 1, 64) | 16 384 |
| Batch normalization | (431, 1, 64) | 256 |
| Squeeze | (431, 64) | |
| Conv1D (3), (1), (27) | (431, 128) | 24 576 |
| Batch normalization | (431, 128) | 512 |
| Conv1D (3), (1), (27) | (431, 64) | 24 576 |
| Batch normalization | (431, 64) | 256 |
| MaxPool1D (3), (1), (27) | (431, 64) | |
| Dense (per frame) | (431, 21) | 1 365 |
| Output: Sigmoid | (431, 21) |
Table 4
Results for our deep learning-based leitmotif activity detection systems on the test set.
| RNN | CNN | |||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| L-Ni | 0.87 | 0.76 | 0.81 | 0.85 | 0.79 | 0.82 |
| L-Ri | 0.80 | 0.73 | 0.76 | 0.82 | 0.76 | 0.79 |
| L-NH | 0.89 | 0.78 | 0.83 | 0.91 | 0.82 | 0.86 |
| L-Mi | 0.86 | 0.86 | 0.86 | 0.87 | 0.79 | 0.83 |
| L-RT | 0.85 | 0.86 | 0.85 | 0.80 | 0.83 | 0.82 |
| L-Wa | 0.94 | 0.90 | 0.92 | 0.93 | 0.95 | 0.94 |
| L-WL | 0.86 | 0.85 | 0.85 | 0.83 | 0.85 | 0.84 |
| L-Ho | 0.80 | 0.76 | 0.78 | 0.82 | 0.80 | 0.81 |
| L-Ge | 0.89 | 0.81 | 0.85 | 0.85 | 0.81 | 0.83 |
| L-Sc | 0.74 | 0.72 | 0.73 | 0.83 | 0.72 | 0.77 |
| L-Ju | 0.82 | 0.68 | 0.74 | 0.87 | 0.78 | 0.82 |
| L-WH | 0.79 | 0.77 | 0.78 | 0.78 | 0.76 | 0.77 |
| L-RS | 0.87 | 0.84 | 0.86 | 0.86 | 0.81 | 0.84 |
| L-Fe | 0.87 | 0.88 | 0.88 | 0.93 | 0.86 | 0.89 |
| L-SK | 0.75 | 0.72 | 0.74 | 0.81 | 0.75 | 0.78 |
| L-Un | 0.79 | 0.75 | 0.77 | 0.84 | 0.81 | 0.83 |
| L-Li | 0.89 | 0.81 | 0.85 | 0.82 | 0.84 | 0.83 |
| L-Si | 0.78 | 0.75 | 0.76 | 0.83 | 0.80 | 0.81 |
| L-Ma | 0.79 | 0.81 | 0.80 | 0.87 | 0.79 | 0.83 |
| L-Ve | 0.84 | 0.73 | 0.78 | 0.83 | 0.83 | 0.83 |
| Class mean | 0.83 | 0.79 | 0.81 | 0.85 | 0.81 | 0.83 |
| Matrix mean | 0.83 | 0.78 | 0.80 | 0.85 | 0.80 | 0.82 |

Figure 4
Illustration of results for our RNN-based leitmotif activity detection system (shown for measures 112 to 390 from the first act of Siegfried in P-Ba).

Figure 5
Results for our RNN-based leitmotif activity detection system on measures 117 to 123.5 of the first act of Siegfried in P-Ba (see also Figure 3 and Figure 4; outputs of the CNN-based model are similar). A prominent instance of L-Sc is being played in the higher registers, accompanied by low-frequency tremolo. The model input is shown in the upper row. The respective output activations for the L-Sc class are plotted underneath in red (solid line). The dashed blue line corresponds to the ground truth annotations for L-Sc. The input is given to the network (a) unchanged, (b) slowed down to 175% of the original length, (c) with a pitch shift of eleven semitones, (d) with motif frames replaced by noise, and (e) with motif frames shuffled along the time axis.

Figure 6
Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set under tempo changes. The CQT input is stretched in time (using bilinear resampling) by the given percentage.

Figure 7
Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set under pitch shifts. The CQT input has been shifted (using nearest-neighbor padding) on the pitch axis by the given number of semitones (corresponding to CQT bins).

Figure 8
Results for our (a) RNN-based and (b) CNN-based leitmotif activity detection systems on the test set when (1) replacing leitmotif frames by noise or (2) shuffling them along the time axis. The modifications have been applied to either the first, middle, or last third of each leitmotif instance (Start, Middle, End), for none (Unchanged), or for all leitmotif frames (All).




















