Have a personal or library account? Click to login
Four-way Classification of Tabla Strokes with Transfer Learning Using Western Drums Cover

Four-way Classification of Tabla Strokes with Transfer Learning Using Western Drums

Open Access
|Sep 2023

Figures & Tables

tismir-6-1-150-g1.jpg
Figure 1

The tabla set – (left) bayan or dagga and (right) dayan or tabla.

Table 1

The four target tabla stroke categories, their acoustic characteristics, constituent stroke types (resonant/damped/no hit) on each drum (bass and treble) and some typical bols in each category.

TABLA STROKE CATEGORYACOUSTIC CHARACTERISTICSCONSTITUENT STROKE TYPESBOLS
BASSTREBLE
Damped (D)No sustained harmonics,
burst of energy at onset
No hit
Damped
Damped
Damped
No hit
Damped
Ti, Ta, Te, Re, Da, Tak, Tra
Ke, Kat
Kda
Resonant Treble (RT)Strong onset followed by sustains
F0 (>150 Hz) and harmonics
No hit
Damped
Resonant ResonantNa, Tin, Tun, Din
Tin (Ke on bayan)
Resonant Bass (RB)Weak onset burst followed by sustained
F0 (~100 Hz) and few short-lived harmonics
Resonant
Resonant
No hit
Damped
Ghe
Dhe, Dhi, Dhet
Resonant Both (B)Combined characteristics of resonant treble and bassResonantResonantDha, Dhin
Table 2

Description of the train and test datasets for drums (Gillet and Richard, 2006; Southall et al., 2017b; Dittmar and Gartner, 2014) and tabla (Rohit et al., 2021).

INSTRUMENTDATASETSINSTRUMENTSDURATION
TablaTrain/Val: Solo1076 min.
Test: Accompaniment320 min.
DrumsTrain: ENST+MDB26163 min.
Test: IDMT4123 min.
Table 3

The list of bols used to train the models for each of the atomic strokes. In bold are bols common to more than one atomic stroke.

MODELBOLS USED FOR TRAINING
DTi, Ta, Te, Re, Da, Tak, Tra, Ke, Kat, Kda
RT-anyNa, Tin, Tun, Din, Dha, Dhin
RB-anyGhe, Dhe, Dhi, Dhet, Dha, Dhin
tismir-6-1-150-g2.png
Figure 2

Schematic of the four-way classification system using three one-way CNN models to predict presence of atomic strokes (D, RT-any and RB-any) in a given audio frame. If both RT and RB onsets are detected then the onset is marked B. Any D that co-occurs with RT or RB is ignored.

tismir-6-1-150-g3.png
Figure 3

General CNN model architecture for all experiments.

Table 4

Hyperparameter values for the CNN model architecture (of Figure 3) used for each stroke category.

MODELHYPERPARAMETERS
DN1=16, N2=32, N3=256
RT-anyN1=32, N2=64, N3=128
RB-anyN1=16, N2=32, N3=128
tismir-6-1-150-g4.png
Figure 4

The transfer learning approach where the three one-way CNNs are first pretrained to predict single drum onsets and then fine-tuned on corresponding tabla stroke category data. Model input is a small portion of a track’s spectrogram and the output is the onset prediction for the center frame. Figure inspired by Vogl et al. (2017). The final spectrogram pairs demonstrate the acoustic similarity across the mapped source and target classes.

tismir-6-1-150-g5.png
Figure 5

Distribution of drum and tabla strokes in the training and test datasets across the various drum and tabla stroke categories.

tismir-6-1-150-g6.png
Figure 6

Spectrograms of resonant both tabla strokes showing the common F0 modulation types for the constituent resonant bass stroke: (a) No modulation, (b) Up modulation, and (c) Down modulation (ignore accompanying resonant treble harmonics and bursts of energy between 0.45–0.75 seconds in (a) and (c) from onsets of subsequent strokes on the treble drum).

tismir-6-1-150-g7.png
Figure 7

The five cluster centroids resulting from the k-means model fitted on resonant bass F0 contours of the tabla training dataset. The legend shows the fraction of data points assigned to each cluster.

Table 5

Mean CV F-score of RT, RB and B classification using different tolerance windows for combining simultaneous RT-any and RB-any onsets to B. The models were pretrained on unmodified drum data and fine-tuned on tabla data.

Tolerance (ms)10204080160
Mean F-score (RT, RB, B)69.073.373.873.967.0
Table 6

CV f-scores comparing stroke classification performance of the three atomic strokes (D, RT-any and RB-any) with the differently trained models of this work. ‘Untrained’ represents a model with random weights. Indentation in ‘Method’ column represents nested experiments. Values in bold are the highest in each column. An asterisk represents a significant difference (p < 0.001) between the f-scores of the best fine-tuned (selected from rows 3–3g) and the corresponding retrained model (of row 4).

METHODDRT-ANYRB-ANY
1Untrained24.019.015.6
2Drum-pretrained55.637.128.8
3Fine-tuned88.392.987.7
3a    + HH only88.3
3b        + RF aug88.5
3c    + SD-KD data repeat93.1*88.1
3d        + PS aug93.088.8
3e            + Bass mod aug87.4
3f        + RS aug92.787.2
3g        + PS, RS aug92.486.8
4Retrained88.291.988.3
4a    + Bass mod aug88.2
tismir-6-1-150-g8.png
Figure 8

Automatic transcription for the RB-any in a short tabla solo segment. Top: model output activations versus time (horizontal dashed line is the peak-picking threshold); bottom: estimated onset locations (solid blue) and ground truth (dashed red). From left to right – untrained, drum-pretrained, tabla fine-tuned and tabla retrained models.

Table 7

ADT performance (test f-scores) on the IDMT-set comparing SOTA from ADT literature, against our models trained using different drum augmentation methods.

METHODHHSDKDMEAN
Southall et al 2017 (CNN)83.1
Our models82.785.390.286.1
    + HH only80.2
        + RF aug84.5
    + SD-KD data repeat85.593.9
        + PS aug80.293.7
        + RS aug82.790.3
        + PS, RS aug81.190.6
Table 8

Four-way CV/test classification f-scores for the transfer learning and retraining methods. The best set of D, RT-any, RB-any models combines the individual highest scoring D, RT-any and RB-any methods from the fine-tuned models of Table 6 (i.e. the values in bold in Table 6). Overall f-score is across the four categories. As expected from our understanding of transfer learning, all values in row 2 (fine-tuned model) are significantly higher than those in row 1 (p < 0.001). Values in bold are highest across rows 2 and 3. The asterisk marks the only significant difference (p = 0.002 in this case) between f-scores of corresponding fine-tuned and retrained models of this work (i.e. rows 2 and 3).

METHODDRTRBBOVERALL
1Drum-pretrained57.8/44.040.9/48.716.2/17.53.6/2.144.8/36.2
2Best set of D, RT-any, RB-any models (Table 6)89.2/83.086.1*/86.073.2/63.689.2/81.586.7/81.2
3Retrained89.2/83.684.3/86.673.7/66.989.0/82.786.3/82.1
4Retrained Rohit et al. (2021)88.2/83.883.7/84.671.2/34.087.9/82.085.5/79.5
DOI: https://doi.org/10.5334/tismir.150 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 18, 2022
Accepted on: Jun 24, 2023
Published on: Sep 20, 2023
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2023 Rohit M. Ananthanarayana, Amitrajit Bhattacharjee, Preeti Rao, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.