Have a personal or library account? Click to login
We are Not Groupies… We are Band Aids’: Assessment Reliability in the AI Song Contest Cover

We are Not Groupies… We are Band Aids’: Assessment Reliability in the AI Song Contest

Open Access
|Dec 2021

Figures & Tables

Table 1

Entries and final places for the AI Song Contest 2020.

PlaceCountryTeamSong
1AustraliaUncanny ValleyBeautiful the World
2GermanyDadabots × Portrait XOI’ll Marry You Punk Come
3The NetherlandsCan AI Kick ItAbbus
4FranceAlgomus & FriendsI Keep Counting
5The NetherlandsCOMPUTD/Shuman & Angel-EyeI Write a Song
6United KingdomBrentryHope Rose High
7BelgiumPolarisPrincess
8BelgiumBeatrootsViolent Delights Have Violent Ends
9FranceDataDadaJe secoue le monde
10SwedenKTH/KMH + DoremirCome To Ge Ther
11GermanyOVGneUrovisionTraveller in Time
12GermanyLigaturOffshore in Deep Water
13SwitzerlandNew PianoPainful Words
tismir-4-1-102-g1.png
Figure 1

Development of voters’ average scores over time and final jury scores. The voting sites were open from 10 April 2020 through 10 May 2020; the jury scores and final results were announced in a live broadcast on 12 May 2020. Each song’s final score was the sum of its average voter score and its score from the jury.2 The jury favourite, ‘I’ll Marry You Punk Come’, was a notable area of disagreement between the jury and the voters.

tismir-4-1-102-g2.png
Figure 2

Distribution of votes across teams and voters’ frequency of voting. Most voters voted either for all thirteen teams (11%) or just one (67%). One-time voters were distributed quite unevenly across the entries and constituted 45% of all votes for the extreme case of ‘Beautiful the World’.

Table 2

Prior and hyper-prior distributions for the hierarchical Rasch models. The choices are weakly informative with regularising tails.

ParameterDescription
Priors
          γn ~ N(µγ, σγ)Logit three-inflation
          θn ~ N(0, σθ)Song quality
          δi ~ N(µδ, σδ)Criterion difficulty
          λj ~ N(0, σλ)Voter or judge severity
          τk ~ N(0, στ)Rating-threshold offset
          ζik ~ N(0, σζ)Partial-credit interaction
Hyper-Priors
          µγ ~ N(0, 1)Mean logit three-inflation
          µδ ~ N(0, 1)Intercept
          σγ ~ N+(0, 1)SD logit three-inflation
          σθ ~ N+(0, 1)SD song quality
          σδ ~ N+(0, 1)SD criterion difficulty
          σλ ~ N+(0, 1)SD voter or judge severity
          στ ~ N+(0, 1)SD threshold offset
          σζ ~ N+(0, 1)SD partial-credit interaction
Table 3

Approximate information criteria under leave-one-song-out cross-validation (LOO-IC; lower is better). The observations are weighted such that voters’ ratings (on scales of 0 to 3) and the judges’ ratings (on scales of 0 to 2) contribute equally to the likelihood. The intercept-only model serves as a simple baseline. Leave-one-song-out cross-validation is conservative, and only three models outperform the baseline (in italics and bold); the many-facet partial-credit model without three inflation performs best.

ModelFacetsThree InflationParameter CountLOO-IC
Intercept OnlySingleNo15144 944
Rating ScaleSingleNo30147 378
Rating ScaleSingleYes45150 455
Partial CreditSingleNo49153 181
Partial CreditSingleYes64144 044
Rating ScaleManyNo3860145 056
Rating ScaleManyYes3875144 796
Partial CreditManyNo3879143 402
Partial CreditManyYes3894151 248
tismir-4-1-102-g3.png
Figure 3

Rasch calibrations for the AI Song Contest evaluation scheme. (A–C) present kernel density estimates of the calibrations for song quality, voters’ criterion difficulty, and the jury’s criterion difficulty, all on a standard T scale (M = 50, SD = 10). Plot labels are followed by point estimates of the calibrations (posterior medians) as well as the reliability coefficients for these estimates (in parentheses). The density estimates are marked with their medians and the 2.5% and 97.5% quantiles (i.e., a 95% credible interval). For the rating criteria, the densities for each step of the scale are shown individually. (D) presents density estimates for the severities of the voters who voted for each entry, coloured by tail probability (dark blue for the median ranging to yellow at the extrema of the distributions). ‘Groupies’ are prominently visible, especially for ‘Beautiful the World’ and ‘I Write a Song’. (E) is the same visualisation but excluding all voters who gave perfect scores or perfect zeros. The groupie effect disappears.

tismir-4-1-102-g4.png
Figure 4

Posterior predictive checks on the distribution of ratings. Observed data appear as histograms; black lines cover 95% of the corresponding histograms from 2000 simulated data sets using parameter values sampled from the posterior distribution. For the voters, we provide an analysis per song, but in order to preserve the anonymity of judges, we only provide aggregated data for the jury. In general, the model seems to be well calibrated, but there are a few notable miscalibrations for ‘I’ll Marry You Punk Come’ and ‘Traveller in Time’.

DOI: https://doi.org/10.5334/tismir.102 | Journal eISSN: 2514-3298
Language: English
Submitted on: Mar 1, 2021
Accepted on: Jul 5, 2021
Published on: Dec 3, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 John Ashley Burgoyne, Hendrik Vincent Koops, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.