Have a personal or library account? Click to login
Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes Cover

Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes

Open Access
|Jun 2020

Figures & Tables

Table 1

Lyric alignment datasets comparison.

DatasetNumber of songsLanguageAudio typeGranularity
(Iskandar et al., 2006)No training, 3 test songsEnglishPolyphonicSyllables
(Wong et al., 2007)14 songs divided into 70 segments of 20s lengthCantonesePolyphonicWords
(Müller et al., 2007)100 songsEnglishPolyphonicWords
(Kan et al., 2008)20 songsEnglishPolyphonicSections, Lines
(Mesaros and Virtanen, 2010)Training: 49 fragments ~25 seconds for adapting a phonetic model Testing: 17 songsEnglishTraining: a capella Testing: vocals after source separationLines
(Hansen, 2012)9 pop music songsEnglishBoth accompanied and a capellaWords, lines
(Mauch et al., 2012)20 pop music songsEnglishPolyphonicWords
DAMP dataset, (Smith, 2013)34k amateur versions of 301 songsEnglishA capellaNot time-aligned lyrics, only textual lyrics
DAMPB dataset, (Kruspe, 2016)A subset of DAMP with 20 performances of 301 songsEnglishA capellaWords, Phonemes
(Dzhambazov, 2017)70 fragments of 20 secondsChinese TurkishPolyphonicPhonemes
(Lee and Scott, 2017)20 pop music songsEnglishPolyphonicWords
(Gupta et al., 2018)A subset of DAMP with 35662 segments of 10s lengthEnglishA capellaLines
Jamendoaligned, (Ramona et al., 2008) (Stoller et al., 2019)20 Creative commons songsEnglishPolyphonicWords
DALI v1 (Meseguer-Brocal et al., 2018)5358 songs in full durationManyPolyphonicNotes, words lines and paragraphs
DALI v27756 songs in full durationManyPolyphonicNotes, words, phonemes, lines and paragraphs
tismir-3-1-30-g1.png
Figure 1

[Left] Our inputs are karaoke-user annotations presented as a tuple of {time (start and duration), musical note, text}. [Right] Our method automatically finds the corresponding full-audio track and globally aligns the vocal melody and the lyrics to it. The close-up of the spectrogram illustrates the alignment for a small excerpt at two levels of granularity: notes and lines.

Table 2

DALI dataset general overview.

VSongsArtistsGenresLanguagesDecades
1.053582274613010
2.077562866633210
Table 3

Statistics for the different DALI datasets. One song can have several genres.

VAverage songs per artistAverage duration per songFull durationTop 3 genresTop 3 languagesTop 3 decades
1.02.36Audio: 231.95s With vocals: 118.87sAudio: 344.9hrs With vocals: 176.9hrsPop: 2662 Rock: 2079 Alternative: 869ENG: 4018 GER: 434 FRA: 2132000s: 2318 1990s: 1020 2010s: 668
2.02.71Audio: 226.78s With vocals: 114.73sAudio: 488.1hrs With vocals: 247.2hrsPop: 3726 Rock: 2794 Alternative: 1241ENG: 5913 GER: 615 FRA: 3042000s: 3248 1990s: 1409 2010s: 1153
Table 4

Proposed split with respect to the time correlation values. NCCt is defined at Section 4.3.

CorrelationsTracks
TestNCCt >= .941.0: 167 2.0: 402
Validation.94 > NCCt >= .9251.0: 423 2.0: 439
Train.925 > NCCt >= .81.0: 4768 2.0: 6915
Table 5

Overview of terms: definition of each term used in this article. NCCt is defined at Section 4.3.

TermDefinition
Notestime-aligned symbolic vocal melody annotations.
Annotationbasic alignment unit as a tuple of: time (start and duration in frames), musical note (with 0 = C3) and text.
A file with annotationsgroup of annotations that define the alignment of a particular song.
Offset_time(o)the start of the annotations.
Frame rate(fr)the reciprocal of the annotation grid size.
Voice annotation sequence (vas(t) ∊{0,1})a vector that defines when the singing voice (SV) is active according to the karaoke-users’ annotations.
Predictions (p^(t)[0,1])probability sequence indicating whether or not singing voice is active at any frame, provided by our singing voice detection.
Labelslabel sequence of well-known ground truth datasets checked by the MIR community.
TeacherSV detection (SVD) system used for selecting audio candidates and aligning the annotations to them.
Studentnew SVD system trained on vas(t) of the subset selected by the Teacher after NCC(o^,  f^r)    Tcorr.
tismir-3-1-30-g2.png
Figure 2

[Left part] Target lyrics lines and paragraphs as provided in WASABI. [Right part] The melody paragraphs pm are created by merging the melody lines lm into an existing target paragraph pt. Note how line l11t in p2t has no direct counterpart in lm* and verse p3m does not appear in any pt*.

tismir-3-1-30-g3.png
Figure 3

Architecture of our Singing Voice Detection (SVD) system using CNNs.

tismir-3-1-30-g4.png
Figure 4

The input is an vaso,fr(t) (blue part – top left area) and a set of audio candidates retrieved from YouTube. The similarity estimation method uses an SVD model to convert each candidate in a p^(t) (orange part – lower left area). We measure the similarity between the vaso,fr (t) and each p^(t) using the cross-correlation method argmaxfr,o NCC(o, fr) described in Section 4.3 (red part – right area). The output is the audio file with the highest NCC(ô, f̂r) and the annotations aligned to it, according to the parameters f̂r and ô.

tismir-3-1-30-g5.png
Figure 5

Creating the DALI dataset using the teacher-student paradigm.

tismir-3-1-30-g6.png
Figure 6

We create three SVD systems (teachers) using the ground truth datasets (Jamendo, MedleyDB and Both). The three systems generate three new datasets (DALI v0) used to train three new SVD systems (the first-generation students). Now, we use the best student, J+M, to define DALI v1 (Meseguer-Brocal et al., 2018). We train a second-generation student using DALI v1 and create DALI v2.

Table 6

Singing voice detection performance, measured as mean accuracy and standard deviation. Number of tracks is shown in parentheses. Nomenclature: T = Teacher, S = Student, J = Jamendo, M = MedleyDB, J+M = Jamendo + MedleyDB, 2G = second generation; in brackets we specify the name of the teacher used for training a student.

Test sets
SVD system
J_Test (16)M_Test (36)J_Test+Train (77)M_Test+Train (98)
T_J_Train (61)
S [T_J_Train] (2673)
88.95% ± 5.71
87.08% ± 6.75
83.27% ± 16.6
82.05% ± 15.3

87.87% ± 6.34
81.83% ± 16.8
84.00% ± 13.9
T_M_Train (98)
S [T_M_Train] (1596)
76.61% ± 12.5
82.73% ± 10.6
84.14% ± 17.4
79.89% ± 17.8
76.32% ± 11.2
84.12% ± 9.00

82.03% ± 16.4
T_J+M_Train (159)
S [T_J+M_Train] (2440)
83.63% ± 7.13
87.79% ± 8.82
83.24% ± 13.9
85.87% ± 13.6

89.09% ± 6.21

86.78% ± 12.3
2G [S [T_J+M_Train]] (5253)93.37% ± 3.6188.64% ± 13.092.70% ± 3.8588.90% ± 11.7
Table 7

Alignment performance for the teachers and students: mean offset deviation in seconds, mean frame rate deviation in frames, and pos is position in the classification.

mean offset rankposmean offsetdmean fr rankposmean frd
T_J_Train (61)
S [T_J_Train] (2673)
2.79 ± .48
2.37 ± .19
4
3
0.082 ± 0.17
0.046 ± 0.05
1.18 ± .41
1.06 ± .23
4
3
0.51 ± 1.24
0.25 ± 0.88
T_M_Train (98)
S [T_M_Train] (1596)
4.85 ± .50
4.29 ± .37
7
6
0.716 ± 2.74
0.164 ± 0.10
1.89 ± .72
1.30 ± .48
7
5
2.65 ± 2.96
0.88 ± 1.85
T_J+M_Train (159)
S [T_J+M_Train] (2440)
3.42 ± .58
2.23 ± .07
5
2
0.370 ± 1.55
0.043 ± 0.05
1.47 ± .68
1.04 ± .19
6
2
1.29 ± 2.29
0.25 ± 0.85
2G [S [T_J+M_Train]] (5253)1.82 ± .0710.036 ± 0.061.01 ± .1010.21 ± 0.83
tismir-3-1-30-g7.png
Figure 7

Local errors still present in DALI. [Left] Misalignments in time due to an imperfect annotation. [Right] An individual mis-annotated note. These problems remain for future versions of the dataset.

DOI: https://doi.org/10.5334/tismir.30 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jan 24, 2019
Accepted on: Apr 9, 2020
Published on: Jun 11, 2020
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2020 Gabriel Meseguer-Brocal, Alice Cohen-Hadria, Geoffroy Peeters, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.