Have a personal or library account? Click to login
Score Following as a Multi-Modal Reinforcement Learning Problem Cover

Score Following as a Multi-Modal Reinforcement Learning Problem

Open Access
|Nov 2019

Figures & Tables

tismir-2-1-31-g1.png
Figure 1

Sketch of score following in sheet music. Given the incoming audio, the score follower has to track the corresponding position in the score (image).

tismir-2-1-31-g2.png
Figure 2

Sketch of the score following MDP. The agent receives the current state of the environment St and a scalar reward signal Rt for the action taken in the previous time step. Based on the current state it has to choose an action (e.g., decide whether to increase, keep or decrease its speed in the score) in order to maximize future reward by correctly following the performance in the score.

tismir-2-1-31-g3.png
Figure 3

Markov state of the score following MDP: the current sheet sliding window and spectrogram excerpt. To capture the dynamics of the environment we also add the one step differences (Δ) w.r.t. the previous time step (state).

tismir-2-1-31-g4.png
Figure 4

Reward definition in the score following MDP. The reward Rt (range [0, 1]) decays linearly with the agent’s distance dx from the current true score position x.

tismir-2-1-31-g5.png
Figure 5

Multi-modal network architecture used for our score following agents. Given state s, the policy network predicts the action selection probability π (a|s;θ) for the allowed actions a ∈ {–Δνpxl, 0, +Δνpxl}. The value network, sharing parameters with the policy network, provides a state-value estimate ̂v(s;w) for the current state.

tismir-2-1-31-g6.png
Figure 6

Optimal tempo curve and corresponding optimal actions At for a continuous agent (piece: J. S. Bach, BWV994). The At would be the target values for training an agent with supervised, feed-forward regression.

Table 1

Network architecture used for the Nottingham dataset. Conv (3, stride-1)-16: 3 × 3 convolution, 16 feature maps and stride 1. No zero-padding is applied. We use ELU activation on all layers if not stated otherwise.

Audio (Spectrogram) 78 × 40Sheet-Image 40 × 150
Conv (3, stride-2)-16Conv (5, stride-(1, 2))-16
Conv (3, stride-2)-32Conv (3, stride-2)-32
Conv (3, stride-2)-32Conv (3, stride-2)-32
Conv (3, stride-1)-64Conv (3, stride-(1,2))-64
Concatenation + Dense (256)
Dense (256)Dense (256)
Dense (3) – SoftmaxDense (1) – Linear
Table 2

Network architecture used for MSMD. DO: Dropout; Conv (3, stride-1)-16: 3 × 3 convolution, 16 feature maps and stride 1. No zero-padding is applied. We use ELU activation on all layers if not stated otherwise.

Audio (Spectrogram) 78 × 40Sheet-Image 80 × 256
Conv (3, stride-1)-16Conv (5, stride-(1, 2))-16
Conv (3, stride-1)-16Conv (3, stride-1)-16
Conv (3, stride-2)-32Conv (3, stride-2)-32
Conv (3, stride-1)-32 + DO (0.2)Conv (3, stride-1)-32 + DO (0.2)
Conv (3, stride-2)-64Conv (3, stride-2)-32
Conv (3, stride-2)-96Conv (3, stride-2)-64 + DO (0.2)
Conv (1, stride-1)-96 + DO (0.2)Conv (3, stride-2)-96
Dense (512)Conv (1, stride-1)-96 + DO (0.2)
Dense (512)
Concatenation + Dense (512)
Dense (256) + DO (0.2)Dense (256) + DO (0.2)
Dense (3) – SoftmaxDense (1) – Linear
Table 3

Comparison of score following approaches. MIDI-ODTW considers a perfectly extracted score MIDI file and aligns it to a performance with ODTW. OMR-ODTW does the same, but uses a score MIDI file extracted by an OMR system. MM-Loc is obtained by using the method presented by Dorfer et al. (2016) with a temporal context of 4 and 2 seconds for Nottingham and MSMD, respectively. For MSMD, we use the models from the references and re-evaluate them on the cleaned data set. For A2C, PPO and REINFORCEbl we report the average over 10 evaluation runs. The mean absolute tracking error and its standard deviation are given in centimeters.

Nottingham (monophonic, 46 test pieces)MSMD (polyphonic, 94 test pieces)
MethodRtueRon|dx|¯std (|dx|)RtueRon|dx|¯std (|dx|)
MIDI-ODTW (upper bound)1.001.000.000.001.001.000.000.00
OMR-ODTW0.890.950.040.090.770.870.630.98
MM-Loc (Dorfer et al., 2018b)0.650.830.080.280.550.600.291.07
A2C (Dorfer et al., 2018b)0.960.990.080.120.760.770.690.81
REINFORCEbl0.970.990.060.090.590.701.061.07
A2C0.960.990.070.090.750.770.680.82
PPO0.990.990.060.090.810.800.650.81
tismir-2-1-31-g7.png
Figure 7

Two-dimensional t-SNE projection of the 512-dimensional embeddings taken from the network’s concatenation layer (see Figure 5). Each point in the scatter plot corresponds to an audio–score input tuple. The color encodes the predicted value ̂v(s;w). (Figure inspired by Mnih et al. (2015).)

tismir-2-1-31-g8.png
Figure 8

Two examples of policy outputs. (a) Agent is behind the target, resulting in a high probability for increasing the pixel speed (π (+Δνpxl|s;θ) = 0.795). (b) Agent is ahead of the target, suggesting a reduction of pixel speed (π (–Δνpxl|s;θ) = 0.903).

tismir-2-1-31-g9.jpg
Figure 9

Visualization of the agent’s focus on different parts of the input state for the situation shown in Figure 8b. The salience map was created via integrated gradients (Sundararajan et al., 2017), a technique to identify the most relevant input features for the agent’s decision—in this case, for decreasing its pixel speed.

Table 4

Comparison of score following approaches on real performances. To get a more robust estimate of the performance of the RL agents (REINFORCEbl, A2C and PPO), we report the average over 50 evaluation runs. MM-Loc is the supervised baseline presented by Dorfer et al. (2016). MIDI-ODTW and OMR-ODTW are the ODTW baselines described in Section 4.2. The mean absolute tracking error and its standard deviation are given in centimeters.

MethodRtueRon|dx|¯std(|dx|)
Original MIDI Synthesized (Score = Performance)
MIDI-ODTW1.001.000.000.01
OMR-ODTW0.620.800.851.12
MM-Loc0.440.450.381.14
REINFORCEbl0.560.591.151.14
A2C0.700.630.650.82
PPO0.740.680.70.87
Performance MIDI Synthesized
MIDI-ODTW0.810.940.500.76
OMR-ODTW0.500.720.901.08
MM-Loc0.250.510.360.99
REINFORCEbl0.140.311.801.48
A2C0.580.510.940.94
PPO0.560.500.941.01
Direct Out
MIDI-ODTW0.880.920.590.79
OMR-ODTW0.500.670.931.15
MM-Loc0.190.320.551.42
REINFORCEbl0.330.431.421.29
A2C0.490.550.971.06
PPO0.510.531.011.11
Room Recording
MIDI-ODTW0.810.930.640.84
OMR-ODTW0.500.650.931.09
MM-Loc0.000.190.681.58
REINFORCEbl0.080.371.521.34
A2C0.380.501.111.12
PPO0.300.431.261.24
Table 5

Hyperparameter overview.

HyperparameterValue
Adam learning rate10–4
Adam decay rates (β1, β2)(0.9, 0.999)
Patience50
Learning rate multiplier0.1
Refinements2
Time horizon tmax15
Number of actors8
Entropy regularization0.05
Discount factor γ0.9
GAE parameter λ0.95
PPO clipping parameter ɛ0.2
PPO epochs1
PPO batch size120
Table 6

Overview of the pieces from the MSMD dataset that were recorded as real performances. The pieces are played without repetitions.

ComposerPiece nameDur. (sec.)
Bach, Johann SebastianPolonaise in F major, BWV Anh. 117a47.32
Bach, Johann SebastianSinfonia in G minor, BWV 79799.69
Bach, Johann SebastianFrench Suite No. 6 in E major, Menuet, BWV 81737.21
Bach, Johann SebastianPartita in E minor, Allemande, BWV 830-286.73
Bach, Johann SebastianPrelude in C major, BWV 924a40.43
Bach, Johann SebastianMinuet in F major, BWV Anh. 11340.49
Bach, Johann SebastianMinuet in G major, BWV Anh. 11651.56
Bach, Johann SebastianMinuet in A minor, BWV Anh. 12031.32
Chopin, Frédéric FrançoisNocturne in B♭ minor, Op. 9, No. 1328.92
Mozart, Wolfgang AmadeusPiano Sonata No. 11 in A major, 1st Movt, Variation 1, KV33156.33
Mussorgsky, Modest PetrovichPictures at an Exhibition, Promenade III27.17
Schumann, RobertAlbum für die Jugend, Op. 68, 1. Melodie45.50
Schumann, RobertAlbum für die Jugend, Op. 68, 6. Armes Waisenkind73.52
Schumann, RobertAlbum für die Jugend, Op. 68, 8. Wilder Reiter24.88
Schumann, RobertAlbum für die Jugend, Op. 68, 16. Erster Verlust55.83
Schumann, RobertAlbum für die Jugend, Op. 68, 26. Untitled74.40
DOI: https://doi.org/10.5334/tismir.31 | Journal eISSN: 2514-3298
Language: English
Submitted on: Feb 1, 2019
Accepted on: Sep 12, 2019
Published on: Nov 20, 2019
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2019 Florian Henkel, Stefan Balke, Matthias Dorfer, Gerhard Widmer, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.