Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription

Adrien Ycart; Lele Liu; Emmanouil Benetos; Marcus T. Pearce

doi:10.5334/tismir.57

Figures & Tables

Screenshot of the listening test website.

Table 1

Benchmark evaluation metrics for all systems, evaluated on the MAPS subsets ENSTDkCl and ENSTDkAm, with best values in bold.

System	P_f	R_f	F_f	P_n,On	R_n,On	F_n,On	P_n,OnOff	R_n,OnOff	F_n,OnOff
STF	67.2	60.0	62.7	49.8	32.0	38.3	16.5	11.3	13.2
CNN	80.2	58.2	66.1	77.0	54.9	63.2	33.5	24.6	28.0
NMF	71.3	63.3	66.4	79.6	57.0	65.7	35.7	26.4	30.0
OAF	89.0	79.5	83.8	85.9	84.1	84.9	66.9	65.5	66.2

Vote proportion in pairwise comparisons of the systems. Blue bars represent the proportion of times the system on the left was chosen over the one on the right. For each pair, the percentage in parentheses is the average F_n,On computed on the specific examples included in the comparison.

Proportion of agreement, across all examples, between raters and various evaluation metrics (F_f with various frame sizes, and F_n,On with various tolerance thresholds).

Proportion of agreement, across all examples, between raters and F_n,OnOff, with various onset and offset tolerance thresholds.

Table 2

Coefficients and p-values for the linear fixed effects model using agreement with F_n,On as dependent variable and features as fixed effects.

Feature	Coefficient	P-value
ΔF	0.539	<0.001
F_best	0.330	<0.001
Gold-MSI	–0.007	0.232
Known	0.014	0.391
Difficulty	–0.044	<0.001

Agreement between ratings and F_n,On for each reported difficulty level.

Distribution of difficulty ratings (lightest = 1, darkest = 5) for each pair of systems.

Table 3

Coefficients and p-values for the linear fixed effects model using difficulty as dependent variable and features as fixed effects.

Feature	Coefficient	P-value
ΔF	–1.564	<0.001
F_best	–0.608	<0.001
Gold-MSI	–0.227	<0.001
Known	–0.153	0.002
Agree	–0.423	<0.001

Table 4

Coefficients and p-values for the linear fixed effects model using agreement with F_n,On as dependent variable and features as fixed effects, on confident answers only.

Feature	Coefficient	P-value
ΔF	0.584	<0.001
F_best	0.349	<0.001
Gold-MSI	–0.014	0.011
Known	0.002	0.912
Difficulty	–0.036	<0.001

Proportion of agreement depending on the difference in F_n,On between the two options, computed on confident answers only.

Table 5

Coefficients and p-values for the linear fixed effects model using agreement among raters as dependent variable and features as fixed effects.

Feature	Coefficient	P-value
ΔF	0.496	<0.001
F_best	–0.092	0.423
Gold-MSI_avg	–0.071	0.004
Gold-MSI_std	–0.016	0.778
Difficulty_avg	–0.176	0.003

A_conf measure for each tested configuration, averaged across folds. The dotted line represents A_conf for F_n,On. Descriptions of each configuration are given in Table 6. Colors represent the p-value when testing whether each metric is different from the “All” configuration. Asterisks represent results significantly different from All (*: p < 0.1, **: p < 0.05, ***: p < 0.01).

Table 6

Description of each tested feature configuration.

Configuration	Removed features
All	None
NoBench	Benchmark metrics
NoFeatures	All features, except benchmark metrics
NoHighLow	Mistakes in highest and lowest notes
NoLoud	Loudness of false negatives
NoOutKey	Out-of-key false positives
NoRepeat	Repeated and merged notes
NoSpecific	Specific pitch mistakes
NoPoly	Polyphony level difference
NoRhythm	Rhythm histogram flatness and rhythm dispersion
NoFramewise	Framewise benchmark metrics, framewise highest and lowest note mistakes, framewise specific pitch errors, polyphony level difference, consonance measures
NoSpecOut	Specific pitch mistakes and out-of-key false positives

Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription

Figures & Tables

Figure 1

Table 1

Figure 2

Figure 3

Figure 4

Table 2

Figure 5

Figure 6

Table 3

Table 4

Figure 7

Table 5

Figure 8

Table 6

Paradigm

My account