Have a personal or library account? Click to login
JSD: A Dataset for Structure Analysis in Jazz Music Cover

Figures & Tables

tismir-5-1-131-g1.png
Figure 1

(a) Above: overview of the Jazz Structure dataset (JSD). (b) Below: running example “Jordu” by Clifford Brown. The figure shows a novelty function and structure annotations within a web-based interface (T = theme; the pictograms indicate the current soloist and the accompaniment).

tismir-5-1-131-g2.png
Figure 2

Statistics for the large-scale annotations of the SALAMI database. (a) Distribution of the number of segments per recording. The total number of segments is 12634. In average, each recording consists of 10.37 segments. (b) Distribution of segment durations (seconds). In average, a segment has a duration of 25.21 seconds.

tismir-5-1-131-g3.png
Figure 3

Examples of structure annotations on the chorus and solo level.

tismir-5-1-131-g4.png
Figure 4

Raw annotation format for “Jordu” as contained in the JSD. Each row of the CSV file corresponds to a segment. The columns indicate the start time, the end time, the label, and the instrumentation of each segment.

Table 1

Overview of annotated (chorus-level) segments for the 340 recordings. From the segments, we derive 4365 segment boundaries (these include the 4025 start positions of each segment plus the 340 end positions of the last segments) from which 3005 are musical and 1360 non-musical.

Type# SegmentsTotal duration (min)
Intro22959.76
Theme813546.74
Solo22231325.31
Outro8035.15
Silence68036.93
Σ 40252003.89
tismir-5-1-131-g5.png
Figure 5

(a) Distribution of number of (chorus-level) segments per recording (silence segments are discarded). The total number of segments is 3345 (sum of Intro, Theme, Solo, and Outro segments, excluding silence segments). On average, a recording consists of 3345/340 9.84 segments. (b) Distribution of segment durations (seconds) of all 3345 segments. On average, a segment has a duration of 35 seconds.

Table 2

List of instrument types occurring in JSD. The abbreviations are used as instrument identifiers. The last four columns indicate the number of solos (#Solo), the number of solo choruses (#Chorus), the number of transcribed solos (#Trans.), and the percentage of transcribed solos (%Trans.). Note that the number of solo choruses is not identical to the number of solo segments from Table 1 (2467 vs. 2223). The former can be higher since there can be multiple soloists in a single solo section. (e.g., drums and bass).

#Abbr.Instrument#Solo#Chorus#Trans.%Trans.
0clClarinet23351565.22
1bclBass clarinet414250
2ssSoprano saxophone25742392
3asAlto saxophone1072398074.77
4tsTenor saxophone24572715864.49
5bsBaritone saxophone21351152.38
6tpTrumpet17037610260
7flnFlugelhorn2400
8corCornet18241583.33
9tbTrombone38832668.42
10pPiano22245662.70
11keyKeyboard3800
12vibVibraphone15281280
13vocVocals81500
14flFlute4600
15gGuitar3988615.38
16bjoBanjo1100
17vcVioloncello1200
18bBass6111300
19drDrums6513100
20percPercussion2800
1074246745633.75
tismir-5-1-131-g6.png
Figure 6

(a) Accumulated duration of all solos (minutes) per instrument. The total duration of all solos is 1325 minutes. (b) Statistics on durations of solo sections (seconds) broken down by instrument. The outlier “Impressions” by John Coltrane (containing a 13-minute long saxophone solo) is not shown.

Table 3

Layer structure of the CNN-based approach.

Layer TypeSizeOutput Shape
InputLayer(116, 80, 1)
Batch Normalization(116, 80, 1)
Conv2D (ReLU)8×6(109, 75, 32)
MaxPooling2D3×6(36, 12, 32)
Conv2D (ReLU)6×3(31, 10, 64)
Flatten(19,840)
Dropout (50%)(19,840)
Dense (ReLU)(128)
Dropout (50%)(128)
Dense (Sigmoid)(1)
Table 4

Overview of the splits for the datasets SALAMI, JSD, and SALAMI+JSD. The numbers refer to recordings (with the corresponding percentage given in brackets).

DatasetTraining SetVal. SetTest SetΣ
SALAMI (S)772 (56.8%)100 (7.4%)487 (35.8%)1359
JSD (J)244 (71.7%)28 (8.16%)68 (20.1%)340
SALAMI+JSD (S+J)1016 (59.9%)128 (7.5%)555 (32.7%)1699
Table 5

Evaluation results for boundary detection on the test sets of (a) SALAMI and (b) JSD. The shown precision, recall, and F-measure values are averaged over the respective test set tracks.

(a) Evaluation results for SALAMI.
τ = 0.5 sτ = 3.0 s
P0.5R0.5F0.5P3R3F3
UllrichS, short0.4220.4900.422
CNNS, short0.3570.4140.3580.4190.7500.512
CNNS, long0.2340.2230.2130.5630.6720.580
CNNJ, short0.2310.0750.1000.4320.4200.386
CNNJ, long0.1360.0490.0660.4940.2330.287
CNNS+J, short0.3470.4230.3570.4840.6600.522
CNNS+J, long0.2420.2260.2210.5080.7290.571
Footeshort0.2270.2740.2230.4670.6100.477
Footelong0.1990.1670.1690.5340.4660.463
Baseline (equal)0.0420.0410.0430.2370.2310.244
(b) Evaluation results for JSD.
τ = 0.5 sτ = 3.0 s
P0.5R0.5F0.5P3R3F3
CNNS, short0.1860.2300.1890.2970.6100.382
CNNS, long0.1220.1260.1180.4230.5790.465
CNNJ, short0.3030.1250.1650.4280.5560.452
CNNJ, long0.1930.1170.1390.6150.4390.482
CNNS+J, short0.2420.2690.2320.4090.5310.428
CNNS+J, long0.1990.1690.1660.4010.6820.485
Footeshort0.1860.2470.1920.4360.6010.454
Footelong0.2160.1850.1840.5480.5050.488
Baseline (equal)0.0510.0510.0510.2250.2250.225
tismir-5-1-131-g7.png
Figure 7

Overview of the evaluation results for all recordings contained in the JSD’s test set. The link (red arrow) leads to the details page as depicted in Figure 8.

tismir-5-1-131-g8.png
Figure 8

(a) Evaluation web page showing the output of all methods for the running example “Jordu” by Clifford Brown. (b) Evaluation results of Foote’s method with the input SSM based on MFCCs. (c) Evaluation results of a CNN consisting of the novelty curve of five networks and the bagged novelty curve.

DOI: https://doi.org/10.5334/tismir.131 | Journal eISSN: 2514-3298
Language: English
Submitted on: Mar 6, 2022
Accepted on: Jun 28, 2022
Published on: Nov 7, 2022
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2022 Stefan Balke, Julian Reck, Christof Weiß, Jakob Abeßer, Meinard Müller, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.