Learning Audio–Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification

Matthias Dorfer; Jan Hajič jr.; Andreas Arzt; Harald Frostel; Gerhard Widmer

doi:10.5334/tismir.12

Figures & Tables

Audio-sheet music pairs presented to the network for embedding space learning.

Example scores illustrating the range of music in MSMD, from simple to complex.

Core dataset workflow. For producing the alignment, it is necessary to “unroll” the score using individual staff systems, so that the ordering of noteheads in the score corresponds to the ordering of the notes in the MIDI file.

Table 1

MSMD statistics for the recommended train/test splits. Note that the numbers of noteheads, events, and aligned pairs do not match. This is because (a) not every notehead is supposed to be played, esp. tied notes; (b) some onsets do not get a notehead of their own, e.g. ornaments; (c) sometimes the alignment algorithm makes mistakes.

Split Name	# Pieces/Aln. Pairs	Part	# Pieces	# Pages	# Noteheads	# Events	# Aln. Pairs
all	479 / 344,742	train	360	970	316,038	310,377	308,761
		valid	19	28	6,907	6,583	6,660
		test	100	131	29,851	29,811	29,321
bach-only	173 / 108,316	train	100	251	77,834	75,283	74,769
		valid	23	40	10,805	10,379	10,428
		test	50	88	23,733	23,296	23,119
bach-out	479 / 344,742	train	281	725	235,590	233,041	231,617
		valid	25	25	4,834	4,772	4,809
		test	173	379	112,372	108,958	108,316

Overview of image augmentation strategies. The size of the sliding image window remains constant (160 × 200 pixels) but its content changes depending on the augmentations applied. The spectrogram remains the same for the augmented image versions.

Architecture of correspondence learning network. The network is trained to optimize the similarity (in embedding space) between corresponding audio and sheet image snippets by minimizing a pair-wise ranking loss.

Table 2

Audio – sheet music model. BN: Batch Normalization (Ioffe and Szegedy, 2015), ELU: Exponential Linear Unit (Clevert et al., 2015), MP: Max Pooling, Conv (3, pad-1)–16: 3 × 3 convolution, 16 feature maps and padding 1.

Sheet-Image 80 × 100	Audio (Spectrogram) 92 × 42
2 × Conv(3, pad-1)-24	2 × Conv(3, pad-1)-24
BN-ELU + MP(2)	BN-ELU + MP(2)
2 × Conv(3, pad-1)-48	2 × Conv(3, pad-1)-48
BN-ELU + MP(2)	BN-ELU + MP(2)
2 × Conv(3, pad-1)-96	2 × Conv(3, pad-1)-96
BN-ELU + MP(2)	BN-ELU + MP(2)
2 × Conv(3, pad-1)-96	2 × Conv(3, pad-1)-96
BN-ELU + MP(2)	BN-ELU + MP(2)
Conv(1, pad-0)-32-BN-LINEAR	Conv(1, pad-0)-32-BN-LINEAR
GlobalAveragePooling	GlobalAveragePooling
Embedding Layer + Ranking Loss

Sketch of sheet music-from-audio retrieval. The blue dots represent the embedded candidate sheet music snippets. The red dot is the embedding of an audio query. The larger blue dot highlights the closest sheet music snippet candidate selected as retrieval result.

Table 3

Snippet retrieval results. The table compares the influence of train/test splits and data augmentation on retrieval performance in both directions. For the audio augmentation experiments no sheet augmentation is applied and vice versa. none represents 1 sound font, with original tempo, and without sheet augmentation. We limit the number of retrieval candidates to 2000 for each of the splits to make the comparison across the different test sets fair.

Audio-to-Sheet Retrieval
	bach-only				bach-out				all
Aug.	R@1	R@25	MRR	MR	R@1	R@25	MRR	MR	R@1	R@25	MRR	MR
none	0.25	0.73	0.37	6	0.31	0.83	0.44	3	0.33	0.76	0.44	4
sheet	0.38	0.81	0.49	3	0.25	0.78	0.37	5	0.33	0.75	0.44	4
audio	0.48	0.87	0.59	2	0.38	0.83	0.50	2	0.46	0.82	0.57	2
full	0.52	0.87	0.62	1	0.46	0.86	0.57	2	0.50	0.83	0.60	2
rand-bl	0.00	0.01	0.0	1000	0.00	0.01	0.00	1000	0.00	0.01	0.00	1000
Sheet-to-Audio Retrieval
	bach-only				bach-out				all
Aug.	R@1	R@25	MRR	MR	R@1	R@25	MRR	MR	R@1	R@25	MRR	MR
none	0.34	0.81	0.46	3	0.35	0.83	0.48	3	0.39	0.80	0.51	2
sheet	0.45	0.85	0.57	2	0.28	0.80	0.42	4	0.40	0.79	0.52	2
audio	0.51	0.87	0.62	1	0.39	0.85	0.52	2	0.49	0.84	0.59	2
full	0.56	0.89	0.66	1	0.46	0.87	0.57	2	0.51	0.85	0.61	1
rand-bl	0.00	0.01	0.00	1000	0.00	0.01	0.00	1000	0.00	0.01	0.00	1000

Influence of training set size on test set retrieval performance (MRR) evaluated on the bach-split in the no-augmentation setting.

Piece retrieval concept from audio query. The entire pipeline consists of two stages: retrieval preparation and retrieval at runtime (best viewed in color, for details see Section 5).

Table 4

Piece and performance identification results on synthetic data for all three splits.

			Synthesized-to-Score				Score-to-Synthesized
Train Split	#	Aug.	Rk@1	Rk@5	Rk@10	>Rk10	Rk@1	Rk@5	Rk@10	>Rk10
bach-only	50	none	33 (0.66)	46 (0.92)	48 (0.96)	2 (0.04)	39 (0.78)	48 (0.96)	49 (0.98)	1 (0.02)
		full	41 (0.82)	49 (0.98)	50 (1.00)	0 (0.00)	47 (0.94)	50 (1.00)	50 (1.00)	0 (0.00)
bach-out	173	none	125 (0.72)	158 (0.91)	163 (0.94)	10 (0.06)	145 (0.84)	164 (0.95)	166 (0.96)	7 (0.04)
		full	143 (0.83)	163 (0.94)	167 (0.97)	6 (0.03)	149 (0.86)	169 (0.98)	172 (0.99)	1 (0.01)
all	100	none	67 (0.67)	96 (0.96)	98 (0.98)	2 (0.02)	94 (0.94)	98 (0.98)	99 (0.99)	1 (0.01)
		full	82 (0.82)	97 (0.97)	99 (0.99)	1 (0.01)	92 (0.92)	99 (0.99)	100 (1.00)	0 (0.00)

Exemplar staff line automatically extracted from a scanned score version of Chopin’s Nocturne Op. 9 No. 3 in B major (Henle Urtext Edition; reproduced with permission). The blue box indicates an example sheet snippet fed to the image part of the retrieval embedding network.

Table 5

Evaluation on real data: Piece retrieval results on scanned sheet music and recordings of real performances. The model used for retrieval is trained on the all-split with full data augmentation.

		Synthesized-to-Real-Score				Real-Score-to-Synthesized
Composer	#	Rk@1	Rk@5	Rk@10	>Rk10	Rk@1	Rk@5	Rk@10	>Rk10
Mozart	14	13 (0.93)	14 (1.00)	14 (1.00)	0 (0.00)	13 (0.93)	14 (1.00)	14 (1.00)	0 (0.00)
Beethoven	29	24 (0.83)	27 (0.93)	27 (0.93)	2 (0.07)	25 (0.86)	27 (0.93)	29 (1.00)	0 (0.00)
Chopin	150	127 (0.85)	140 (0.93)	145 (0.97)	5 (0.03)	112 (0.75)	136 (0.91)	142 (0.95)	8 (0.05)
		Performance-to-Real-Score				Real-Score-to-Performance
Composer	#	Rk@1	Rk@5	Rk@10	>Rk10	Rk@1	Rk@5	Rk@10	>Rk10
Mozart	14	5 (0.36)	14 (1.00)	14 (1.00)	0 (0.00)	12 (0.86)	13 (0.93)	13 (0.93)	1 (0.07)
Beethoven	29	16 (0.55)	25 (0.86)	27 (0.93)	2 (0.07)	20 (0.69)	28 (0.97)	28 (0.97)	1 (0.03)
Chopin	150	36 (0.24)	72 (0.48)	91 (0.61)	59 (0.39)	58 (0.39)	94 (0.63)	111 (0.74)	39 (0.26)

Learning Audio–Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification

Figures & Tables

Figure 1

Figure 2

Figure 3

Table 1

Figure 4

Figure 5

Table 2

Figure 6

Table 3

Figure 7

Figure 8

Table 4

Figure 9

Table 5

Paradigm

My account