Score Following as a Multi-Modal Reinforcement Learning Problem

Florian Henkel; Stefan Balke; Matthias Dorfer; Gerhard Widmer

doi:10.5334/tismir.31

Figures & Tables

Sketch of score following in sheet music. Given the incoming audio, the score follower has to track the corresponding position in the score (image).

Sketch of the score following MDP. The agent receives the current state of the environment *S_t* and a scalar reward signal *R_t* for the action taken in the previous time step. Based on the current state it has to choose an action (e.g., decide whether to increase, keep or decrease its speed in the score) in order to maximize future reward by correctly following the performance in the score.

Markov state of the score following MDP: the current sheet sliding window and spectrogram excerpt. To capture the dynamics of the environment we also add the one step differences (Δ) w.r.t. the previous time step (state).

Reward definition in the score following MDP. The reward *R_t* (range [0, 1]) decays linearly with the agent’s distance *d_x* from the current true score position x.

Multi-modal network architecture used for our score following agents. Given state s, the policy network predicts the action selection probability π (*a|s;θ*) for the allowed actions a ∈ {–Δν_pxl, 0, +Δν_pxl}. The value network, sharing parameters with the policy network, provides a state-value estimate ̂v(s;w) for the current state.

Optimal tempo curve and corresponding optimal actions *A_t* for a continuous agent (piece: J. S. Bach, BWV994). The *A_t* would be the target values for training an agent with supervised, feed-forward regression.

Table 1

Network architecture used for the Nottingham dataset. Conv (3, stride-1)-16: 3 × 3 convolution, 16 feature maps and stride 1. No zero-padding is applied. We use ELU activation on all layers if not stated otherwise.

Audio (Spectrogram) 78 × 40	Sheet-Image 40 × 150
Conv (3, stride-2)-16	Conv (5, stride-(1, 2))-16
Conv (3, stride-2)-32	Conv (3, stride-2)-32
Conv (3, stride-2)-32	Conv (3, stride-2)-32
Conv (3, stride-1)-64	Conv (3, stride-(1,2))-64
Concatenation + Dense (256)
Dense (256)	Dense (256)
Dense (3) – Softmax	Dense (1) – Linear

Table 2

Network architecture used for MSMD. DO: Dropout; Conv (3, stride-1)-16: 3 × 3 convolution, 16 feature maps and stride 1. No zero-padding is applied. We use ELU activation on all layers if not stated otherwise.

Audio (Spectrogram) 78 × 40	Sheet-Image 80 × 256
Conv (3, stride-1)-16	Conv (5, stride-(1, 2))-16
Conv (3, stride-1)-16	Conv (3, stride-1)-16
Conv (3, stride-2)-32	Conv (3, stride-2)-32
Conv (3, stride-1)-32 + DO (0.2)	Conv (3, stride-1)-32 + DO (0.2)
Conv (3, stride-2)-64	Conv (3, stride-2)-32
Conv (3, stride-2)-96	Conv (3, stride-2)-64 + DO (0.2)
Conv (1, stride-1)-96 + DO (0.2)	Conv (3, stride-2)-96
Dense (512)	Conv (1, stride-1)-96 + DO (0.2)
	Dense (512)
Concatenation + Dense (512)
Dense (256) + DO (0.2)	Dense (256) + DO (0.2)
Dense (3) – Softmax	Dense (1) – Linear

Table 3

Comparison of score following approaches. MIDI-ODTW considers a perfectly extracted score MIDI file and aligns it to a performance with ODTW. OMR-ODTW does the same, but uses a score MIDI file extracted by an OMR system. MM-Loc is obtained by using the method presented by Dorfer et al. (2016) with a temporal context of 4 and 2 seconds for Nottingham and MSMD, respectively. For MSMD, we use the models from the references and re-evaluate them on the cleaned data set. For A2C, PPO and REINFORCE_bl we report the average over 10 evaluation runs. The mean absolute tracking error and its standard deviation are given in centimeters.

	Nottingham (monophonic, 46 test pieces)				MSMD (polyphonic, 94 test pieces)
Method	R_tue	R_on	$\bar{\| d_{x} \|}$	std (\|d_x\|)	R_tue	R_on	$\bar{\| d_{x} \|}$	std (\|d_x\|)
MIDI-ODTW (upper bound)	1.00	1.00	0.00	0.00	1.00	1.00	0.00	0.00
OMR-ODTW	0.89	0.95	0.04	0.09	0.77	0.87	0.63	0.98
MM-Loc (Dorfer et al., 2018b)	0.65	0.83	0.08	0.28	0.55	0.60	0.29	1.07
A2C (Dorfer et al., 2018b)	0.96	0.99	0.08	0.12	0.76	0.77	0.69	0.81
REINFORCE_bl	0.97	0.99	0.06	0.09	0.59	0.70	1.06	1.07
A2C	0.96	0.99	0.07	0.09	0.75	0.77	0.68	0.82
PPO	0.99	0.99	0.06	0.09	0.81	0.80	0.65	0.81

Two-dimensional t-SNE projection of the 512-dimensional embeddings taken from the network’s concatenation layer (see Figure 5). Each point in the scatter plot corresponds to an audio–score input tuple. The color encodes the predicted value ̂v(s;w). (Figure inspired by Mnih et al. (2015).)

Two examples of policy outputs. **(a)** Agent is behind the target, resulting in a high probability for increasing the pixel speed (π (+Δν_pxl|*s;θ*) = 0.795). **(b)** Agent is ahead of the target, suggesting a reduction of pixel speed (π (–Δν_pxl|*s;θ*) = 0.903).

Visualization of the agent’s focus on different parts of the input state for the situation shown in Figure 8b. The salience map was created via integrated gradients (Sundararajan et al., 2017), a technique to identify the most relevant input features for the agent’s decision—in this case, for decreasing its pixel speed.

Table 4

Comparison of score following approaches on real performances. To get a more robust estimate of the performance of the RL agents (REINFORCE_bl, A2C and PPO), we report the average over 50 evaluation runs. MM-Loc is the supervised baseline presented by Dorfer et al. (2016). MIDI-ODTW and OMR-ODTW are the ODTW baselines described in Section 4.2. The mean absolute tracking error and its standard deviation are given in centimeters.

Method	R_tue	R_on	$\bar{\| d_{x} \|}$	std(\|d_x\|)
Original MIDI Synthesized (Score = Performance)
MIDI-ODTW	1.00	1.00	0.00	0.01
OMR-ODTW	0.62	0.80	0.85	1.12
MM-Loc	0.44	0.45	0.38	1.14
REINFORCE_bl	0.56	0.59	1.15	1.14
A2C	0.70	0.63	0.65	0.82
PPO	0.74	0.68	0.7	0.87
Performance MIDI Synthesized
MIDI-ODTW	0.81	0.94	0.50	0.76
OMR-ODTW	0.50	0.72	0.90	1.08
MM-Loc	0.25	0.51	0.36	0.99
REINFORCE_bl	0.14	0.31	1.80	1.48
A2C	0.58	0.51	0.94	0.94
PPO	0.56	0.50	0.94	1.01
Direct Out
MIDI-ODTW	0.88	0.92	0.59	0.79
OMR-ODTW	0.50	0.67	0.93	1.15
MM-Loc	0.19	0.32	0.55	1.42
REINFORCE_bl	0.33	0.43	1.42	1.29
A2C	0.49	0.55	0.97	1.06
PPO	0.51	0.53	1.01	1.11
Room Recording
MIDI-ODTW	0.81	0.93	0.64	0.84
OMR-ODTW	0.50	0.65	0.93	1.09
MM-Loc	0.00	0.19	0.68	1.58
REINFORCE_bl	0.08	0.37	1.52	1.34
A2C	0.38	0.50	1.11	1.12
PPO	0.30	0.43	1.26	1.24

Table 5

Hyperparameter overview.

Hyperparameter	Value
Adam learning rate	10^–4
Adam decay rates (β₁, β₂)	(0.9, 0.999)
Patience	50
Learning rate multiplier	0.1
Refinements	2
Time horizon t_max	15
Number of actors	8
Entropy regularization	0.05
Discount factor γ	0.9
GAE parameter λ	0.95
PPO clipping parameter ɛ	0.2
PPO epochs	1
PPO batch size	120

Table 6

Overview of the pieces from the MSMD dataset that were recorded as real performances. The pieces are played without repetitions.

Composer	Piece name	Dur. (sec.)
Bach, Johann Sebastian	Polonaise in F major, BWV Anh. 117a	47.32
Bach, Johann Sebastian	Sinfonia in G minor, BWV 797	99.69
Bach, Johann Sebastian	French Suite No. 6 in E major, Menuet, BWV 817	37.21
Bach, Johann Sebastian	Partita in E minor, Allemande, BWV 830-2	86.73
Bach, Johann Sebastian	Prelude in C major, BWV 924a	40.43
Bach, Johann Sebastian	Minuet in F major, BWV Anh. 113	40.49
Bach, Johann Sebastian	Minuet in G major, BWV Anh. 116	51.56
Bach, Johann Sebastian	Minuet in A minor, BWV Anh. 120	31.32
Chopin, Frédéric François	Nocturne in B♭ minor, Op. 9, No. 1	328.92
Mozart, Wolfgang Amadeus	Piano Sonata No. 11 in A major, 1st Movt, Variation 1, KV331	56.33
Mussorgsky, Modest Petrovich	Pictures at an Exhibition, Promenade III	27.17
Schumann, Robert	Album für die Jugend, Op. 68, 1. Melodie	45.50
Schumann, Robert	Album für die Jugend, Op. 68, 6. Armes Waisenkind	73.52
Schumann, Robert	Album für die Jugend, Op. 68, 8. Wilder Reiter	24.88
Schumann, Robert	Album für die Jugend, Op. 68, 16. Erster Verlust	55.83
Schumann, Robert	Album für die Jugend, Op. 68, 26. Untitled	74.40

Score Following as a Multi-Modal Reinforcement Learning Problem

Figures & Tables

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Table 1

Table 2

Table 3

Figure 7

Figure 8

Figure 9

Table 4

Table 5

Table 6

Paradigm

My account