Have a personal or library account? Click to login
Learning Audio–Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification Cover

Learning Audio–Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification

Open Access
|Sep 2018

Figures & Tables

tismir-1-1-12-g1.jpg
Figure 1

Audio-sheet music pairs presented to the network for embedding space learning.

tismir-1-1-12-g2.png
Figure 2

Example scores illustrating the range of music in MSMD, from simple to complex.

tismir-1-1-12-g3.png
Figure 3

Core dataset workflow. For producing the alignment, it is necessary to “unroll” the score using individual staff systems, so that the ordering of noteheads in the score corresponds to the ordering of the notes in the MIDI file.

Table 1

MSMD statistics for the recommended train/test splits. Note that the numbers of noteheads, events, and aligned pairs do not match. This is because (a) not every notehead is supposed to be played, esp. tied notes; (b) some onsets do not get a notehead of their own, e.g. ornaments; (c) sometimes the alignment algorithm makes mistakes.

Split Name# Pieces/Aln. PairsPart# Pieces# Pages# Noteheads# Events# Aln. Pairs
all479 / 344,742train360970316,038310,377308,761
valid19286,9076,5836,660
test10013129,85129,81129,321
bach-only173 / 108,316train10025177,83475,28374,769
valid234010,80510,37910,428
test508823,73323,29623,119
bach-out479 / 344,742train281725235,590233,041231,617
valid25254,8344,7724,809
test173379112,372108,958108,316
tismir-1-1-12-g4.png
Figure 4

Overview of image augmentation strategies. The size of the sliding image window remains constant (160 × 200 pixels) but its content changes depending on the augmentations applied. The spectrogram remains the same for the augmented image versions.

tismir-1-1-12-g5.png
Figure 5

Architecture of correspondence learning network. The network is trained to optimize the similarity (in embedding space) between corresponding audio and sheet image snippets by minimizing a pair-wise ranking loss.

Table 2

Audio – sheet music model. BN: Batch Normalization (Ioffe and Szegedy, 2015), ELU: Exponential Linear Unit (Clevert et al., 2015), MP: Max Pooling, Conv (3, pad-1)–16: 3 × 3 convolution, 16 feature maps and padding 1.

Sheet-Image 80 × 100Audio (Spectrogram) 92 × 42
2 × Conv(3, pad-1)-242 × Conv(3, pad-1)-24
BN-ELU + MP(2)BN-ELU + MP(2)
2 × Conv(3, pad-1)-482 × Conv(3, pad-1)-48
BN-ELU + MP(2)BN-ELU + MP(2)
2 × Conv(3, pad-1)-962 × Conv(3, pad-1)-96
BN-ELU + MP(2)BN-ELU + MP(2)
2 × Conv(3, pad-1)-962 × Conv(3, pad-1)-96
BN-ELU + MP(2)BN-ELU + MP(2)
Conv(1, pad-0)-32-BN-LINEARConv(1, pad-0)-32-BN-LINEAR
GlobalAveragePoolingGlobalAveragePooling
Embedding Layer + Ranking Loss
tismir-1-1-12-g6.png
Figure 6

Sketch of sheet music-from-audio retrieval. The blue dots represent the embedded candidate sheet music snippets. The red dot is the embedding of an audio query. The larger blue dot highlights the closest sheet music snippet candidate selected as retrieval result.

Table 3

Snippet retrieval results. The table compares the influence of train/test splits and data augmentation on retrieval performance in both directions. For the audio augmentation experiments no sheet augmentation is applied and vice versa. none represents 1 sound font, with original tempo, and without sheet augmentation. We limit the number of retrieval candidates to 2000 for each of the splits to make the comparison across the different test sets fair.

Audio-to-Sheet Retrieval
bach-onlybach-outall
Aug.R@1R@25MRRMRR@1R@25MRRMRR@1R@25MRRMR
none0.250.730.3760.310.830.4430.330.760.444
sheet0.380.810.4930.250.780.3750.330.750.444
audio0.480.870.5920.380.830.5020.460.820.572
full0.520.870.6210.460.860.5720.500.830.602
rand-bl0.000.010.010000.000.010.0010000.000.010.001000
Sheet-to-Audio Retrieval
bach-onlybach-outall
Aug.R@1R@25MRRMRR@1R@25MRRMRR@1R@25MRRMR
none0.340.810.4630.350.830.4830.390.800.512
sheet0.450.850.5720.280.800.4240.400.790.522
audio0.510.870.6210.390.850.5220.490.840.592
full0.560.890.6610.460.870.5720.510.850.611
rand-bl0.000.010.0010000.000.010.0010000.000.010.001000
tismir-1-1-12-g7.png
Figure 7

Influence of training set size on test set retrieval performance (MRR) evaluated on the bach-split in the no-augmentation setting.

tismir-1-1-12-g8.png
Figure 8

Piece retrieval concept from audio query. The entire pipeline consists of two stages: retrieval preparation and retrieval at runtime (best viewed in color, for details see Section 5).

Table 4

Piece and performance identification results on synthetic data for all three splits.

Synthesized-to-ScoreScore-to-Synthesized
Train Split#Aug.Rk@1Rk@5Rk@10>Rk10Rk@1Rk@5Rk@10>Rk10
bach-only50none33 (0.66)46 (0.92)48 (0.96)2 (0.04)39 (0.78)48 (0.96)49 (0.98)1 (0.02)
full41 (0.82)49 (0.98)50 (1.00)0 (0.00)47 (0.94)50 (1.00)50 (1.00)0 (0.00)
bach-out173none125 (0.72)158 (0.91)163 (0.94)10 (0.06)145 (0.84)164 (0.95)166 (0.96)7 (0.04)
full143 (0.83)163 (0.94)167 (0.97)6 (0.03)149 (0.86)169 (0.98)172 (0.99)1 (0.01)
all100none67 (0.67)96 (0.96)98 (0.98)2 (0.02)94 (0.94)98 (0.98)99 (0.99)1 (0.01)
full82 (0.82)97 (0.97)99 (0.99)1 (0.01)92 (0.92)99 (0.99)100 (1.00)0 (0.00)
tismir-1-1-12-g9.png
Figure 9

Exemplar staff line automatically extracted from a scanned score version of Chopin’s Nocturne Op. 9 No. 3 in B major (Henle Urtext Edition; reproduced with permission). The blue box indicates an example sheet snippet fed to the image part of the retrieval embedding network.

Table 5

Evaluation on real data: Piece retrieval results on scanned sheet music and recordings of real performances. The model used for retrieval is trained on the all-split with full data augmentation.

Synthesized-to-Real-ScoreReal-Score-to-Synthesized
Composer#Rk@1Rk@5Rk@10>Rk10Rk@1Rk@5Rk@10>Rk10
Mozart1413 (0.93)14 (1.00)14 (1.00)0 (0.00)13 (0.93)14 (1.00)14 (1.00)0 (0.00)
Beethoven2924 (0.83)27 (0.93)27 (0.93)2 (0.07)25 (0.86)27 (0.93)29 (1.00)0 (0.00)
Chopin150127 (0.85)140 (0.93)145 (0.97)5 (0.03)112 (0.75)136 (0.91)142 (0.95)8 (0.05)
Performance-to-Real-ScoreReal-Score-to-Performance
Composer#Rk@1Rk@5Rk@10>Rk10Rk@1Rk@5Rk@10>Rk10
Mozart145 (0.36)14 (1.00)14 (1.00)0 (0.00)12 (0.86)13 (0.93)13 (0.93)1 (0.07)
Beethoven2916 (0.55)25 (0.86)27 (0.93)2 (0.07)20 (0.69)28 (0.97)28 (0.97)1 (0.03)
Chopin15036 (0.24)72 (0.48)91 (0.61)59 (0.39)58 (0.39)94 (0.63)111 (0.74)39 (0.26)
DOI: https://doi.org/10.5334/tismir.12 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jan 25, 2018
Accepted on: Mar 20, 2018
Published on: Sep 4, 2018
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2018 Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, Gerhard Widmer, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.