Have a personal or library account? Click to login
Automatic Generation of Piano Score Following Videos Cover
By: Mengyi Shan and  T. J. Tsai  
Open Access
|Mar 2021

Figures & Tables

tismir-4-1-69-g1.png
Figure 1

Architecture of the proposed approach. The audio query is converted into a bootleg score and used to find a match in a precomputed database of sheet music bootleg scores (retrieval). The matching sheet music and the audio query are then aligned, and the predicted alignment is used to generate a score following video.

tismir-4-1-69-g2.png
Figure 2

Computing a bootleg score from audio (top) and sheet music (bottom).

tismir-4-1-69-g3.png
Figure 3

When converting MIDI data to a piano bootleg score, one can interpret black notes on the piano as sharps (lower left) or as flats (lower right). Both versions are processed during the search, and the one with a higher match score is kept.

tismir-4-1-69-g4.png
Figure 4

Illustration of Hierarchical DTW on a given piece. Four lines in the sheet music are performed in the following order: line 1, line 2, line 3, line 2, line 3, line 4. On the left, we use subsequence DTW to perform feature-level alignment with each line of sheet music. On the right, the segment-level data matrices are shown. Cseg records all subsequence scores from the four lines (indicated by the green, blue, yellow, and red rows). Tseg records the starting location of subsequence paths. Dseg records the optimal cumulative path scores at the segment level. The upper illustration of Dseg shows the possible transitions for two elements in the matrix, where the optimal transition is indicated by a highlighted arrow. The lower illustration of Dseg indicates the optimal path as a series of black dots. The optimal path induces a segmentation of the audio recording, which corresponds to the time intervals where the corresponding sheet music line should be shown.

tismir-4-1-69-g5.png
Figure 5

A single frame of the video generated from Chopin Nocturne Op. 9 No. 1. The video shows the estimated line of music and uses a red cursor to indicate the predicted location.

tismir-4-1-69-g6.png
Figure 6

Generating audio with various types of repeats. To generate data with repeats at line breaks, we segment the original audio recording at sheet music line breaks, sample boundary points without replacement, and then splice and concatenate audio segments as shown above. To generate data with repeats that can occur mid-line, we first sample lines and then randomly choose time points in those lines.

Table 1

System performance on the audio-sheet image retrieval task with all solo piano sheet music images in IMSLP. Results are reported for five different repeat benchmarks and across two types of audio.

MRR
BenchmarkSyntheticReal
No Repeat0.770.63
Repeat 10.760.63
Repeat 20.750.61
Repeat 30.750.60
D.S. al fine0.780.63
tismir-4-1-69-g7.png
Figure 7

Comparison of system performance on the audio-sheet image alignment task with various types of jumps. The bar levels indicate accuracy with a scoring collar of 0.5 sec on real audio. The short black lines indicate accuracy with scoring collars of 0 and 1.0 seconds on real audio. The top lines with the same color as the bars indicate that system’s accuracy with a scoring collar of 0.5 seconds on synthetic audio.

tismir-4-1-69-g8.png
Figure 8

Visualization of system predictions for a query with no repeats (top half) and a query with three repeats (bottom half). The gray stripes represent the duration of the whole audio recording. The black vertical lines show ground truth locations of line breaks, and the red regions indicate times when an incorrect line of sheet music is being shown. The thick blue lines indicate the positions of jumps and repeats.

Table 2

Runtime information for the alignment and retrieval subsystems. These times exclude the time required to extract features.

Alignment (All Pages)Alignment (Only Matching Pages)Retrieval
Benchmarkavg (min)std (min)avg (sec)std (sec)avg (sec)std (sec)
No Repeat8.062.827.917.43.953.57
Repeat 111.269.632.420.35.243.83
Repeat 214.999.642.123.96.773.33
Repeat 320.4103.479.869.08.3419.22
D.S. al fine13.778.246.330.75.303.49
Table 3

Runtime information for all components of the system on an average length piece (4 minutes of audio, 13 pages of sheet music).

System ComponentAMTRetrievalAlignmentVideo GenerationTotal
Time (sec)305302085
Percentage (%)3563524100
Table 4

Assessing the effect of jump locations on the audio-sheet image alignment task. Two conditions are compared: when jump locations occur only at line breaks (column 3) and when jump locations can occur anywhere in a line (column 4). Column 5 shows the performance difference between these two conditions.

SystemBenchmarkLine Breaks Only (%)Random Location (%)Difference (%)
JumpDTWNo Repeat71.571.5N/A
Repeat 171.969.0–2.9
Repeat 270.567.2–3.3
Repeat 371.465.8–5.6
D.S. al fine70.766.1–4.6
HierDTWNo Repeat84.884.8N/A
Repeat 184.581.2–3.3
Repeat 282.578.4–4.1
Repeat 381.976.4–5.5
D.S. al fine81.877.4–4.4
DOI: https://doi.org/10.5334/tismir.69 | Journal eISSN: 2514-3298
Language: English
Submitted on: Aug 28, 2020
Accepted on: Feb 4, 2021
Published on: Mar 26, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Mengyi Shan, T. J. Tsai, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.