Four-way Classification of Tabla Strokes with Transfer Learning Using Western Drums

Rohit M. Ananthanarayana; Amitrajit Bhattacharjee; Preeti Rao

doi:10.5334/tismir.150

Figures & Tables

The tabla set – (left) bayan or dagga and (right) dayan or tabla.

Table 1

The four target tabla stroke categories, their acoustic characteristics, constituent stroke types (resonant/damped/no hit) on each drum (bass and treble) and some typical bols in each category.

TABLA STROKE CATEGORY	ACOUSTIC CHARACTERISTICS	CONSTITUENT STROKE TYPES		BOLS
TABLA STROKE CATEGORY	ACOUSTIC CHARACTERISTICS	BASS	TREBLE	BOLS
Damped (D)	No sustained harmonics, burst of energy at onset	No hit Damped Damped	Damped No hit Damped	Ti, Ta, Te, Re, Da, Tak, Tra Ke, Kat Kda
Resonant Treble (RT)	Strong onset followed by sustains F0 (>150 Hz) and harmonics	No hit Damped	Resonant Resonant	Na, Tin, Tun, Din Tin (Ke on bayan)
Resonant Bass (RB)	Weak onset burst followed by sustained F0 (~100 Hz) and few short-lived harmonics	Resonant Resonant	No hit Damped	Ghe Dhe, Dhi, Dhet
Resonant Both (B)	Combined characteristics of resonant treble and bass	Resonant	Resonant	Dha, Dhin

Table 2

Description of the train and test datasets for drums (Gillet and Richard, 2006; Southall et al., 2017b; Dittmar and Gartner, 2014) and tabla (Rohit et al., 2021).

INSTRUMENT	DATASETS	INSTRUMENTS	DURATION
Tabla	Train/Val: Solo	10	76 min.
Test: Accompaniment	3	20 min.
Drums	Train: ENST+MDB	26	163 min.
Test: IDMT	4	123 min.

Table 3

The list of bols used to train the models for each of the atomic strokes. In bold are bols common to more than one atomic stroke.

MODEL	BOLS USED FOR TRAINING
D	Ti, Ta, Te, Re, Da, Tak, Tra, Ke, Kat, Kda
RT-any	Na, Tin, Tun, Din, Dha, Dhin
RB-any	Ghe, Dhe, Dhi, Dhet, Dha, Dhin

Schematic of the four-way classification system using three one-way CNN models to predict presence of atomic strokes (D, RT-any and RB-any) in a given audio frame. If both RT and RB onsets are detected then the onset is marked B. Any D that co-occurs with RT or RB is ignored.

General CNN model architecture for all experiments.

Table 4

Hyperparameter values for the CNN model architecture (of Figure 3) used for each stroke category.

MODEL	HYPERPARAMETERS
D	N₁=16, N₂=32, N₃=256
RT-any	N₁=32, N₂=64, N₃=128
RB-any	N₁=16, N₂=32, N₃=128

The transfer learning approach where the three one-way CNNs are first pretrained to predict single drum onsets and then fine-tuned on corresponding tabla stroke category data. Model input is a small portion of a track’s spectrogram and the output is the onset prediction for the center frame. Figure inspired by Vogl et al. (2017). The final spectrogram pairs demonstrate the acoustic similarity across the mapped source and target classes.

Distribution of drum and tabla strokes in the training and test datasets across the various drum and tabla stroke categories.

Spectrograms of resonant both tabla strokes showing the common F0 modulation types for the constituent resonant bass stroke: (a) No modulation, (b) Up modulation, and (c) Down modulation (ignore accompanying resonant treble harmonics and bursts of energy between 0.45–0.75 seconds in (a) and (c) from onsets of subsequent strokes on the treble drum).

The five cluster centroids resulting from the k-means model fitted on resonant bass F0 contours of the tabla training dataset. The legend shows the fraction of data points assigned to each cluster.

Table 5

Mean CV F-score of RT, RB and B classification using different tolerance windows for combining simultaneous RT-any and RB-any onsets to B. The models were pretrained on unmodified drum data and fine-tuned on tabla data.

Tolerance (ms)	10	20	40	80	160
Mean F-score (RT, RB, B)	69.0	73.3	73.8	73.9	67.0

Table 6

CV f-scores comparing stroke classification performance of the three atomic strokes (D, RT-any and RB-any) with the differently trained models of this work. ‘Untrained’ represents a model with random weights. Indentation in ‘Method’ column represents nested experiments. Values in bold are the highest in each column. An asterisk represents a significant difference (p < 0.001) between the f-scores of the best fine-tuned (selected from rows 3–3g) and the corresponding retrained model (of row 4).

	METHOD	D	RT-ANY	RB-ANY
1	Untrained	24.0	19.0	15.6
2	Drum-pretrained	55.6	37.1	28.8
3	Fine-tuned	88.3	92.9	87.7
3a	+ HH only	88.3	–	–
3b	+ RF aug	88.5	–	–
3c	+ SD-KD data repeat	–	93.1*	88.1
3d	+ PS aug	–	93.0	88.8
3e	+ Bass mod aug	–	–	87.4
3f	+ RS aug	–	92.7	87.2
3g	+ PS, RS aug	–	92.4	86.8
4	Retrained	88.2	91.9	88.3
4a	+ Bass mod aug	–	–	88.2

Automatic transcription for the RB-any in a short tabla solo segment. Top: model output activations versus time (horizontal dashed line is the peak-picking threshold); bottom: estimated onset locations (solid blue) and ground truth (dashed red). From left to right – untrained, drum-pretrained, tabla fine-tuned and tabla retrained models.

Table 7

ADT performance (test f-scores) on the IDMT-set comparing SOTA from ADT literature, against our models trained using different drum augmentation methods.

METHOD	HH	SD	KD	MEAN
Southall et al 2017 (CNN)	–	–	–	83.1
Our models	82.7	85.3	90.2	86.1
+ HH only	80.2	–	–	–
+ RF aug	84.5	–	–	–
+ SD-KD data repeat	–	85.5	93.9	–
+ PS aug	–	80.2	93.7	–
+ RS aug	–	82.7	90.3	–
+ PS, RS aug	–	81.1	90.6	–

Table 8

Four-way CV/test classification f-scores for the transfer learning and retraining methods. The best set of D, RT-any, RB-any models combines the individual highest scoring D, RT-any and RB-any methods from the fine-tuned models of Table 6 (i.e. the values in bold in Table 6). Overall f-score is across the four categories. As expected from our understanding of transfer learning, all values in row 2 (fine-tuned model) are significantly higher than those in row 1 (p < 0.001). Values in bold are highest across rows 2 and 3. The asterisk marks the only significant difference (p = 0.002 in this case) between f-scores of corresponding fine-tuned and retrained models of this work (i.e. rows 2 and 3).

	METHOD	D	RT	RB	B	OVERALL
1	Drum-pretrained	57.8/44.0	40.9/48.7	16.2/17.5	3.6/2.1	44.8/36.2
2	Best set of D, RT-any, RB-any models (Table 6)	89.2/83.0	86.1*/86.0	73.2/63.6	89.2/81.5	86.7/81.2
3	Retrained	89.2/83.6	84.3/86.6	73.7/66.9	89.0/82.7	86.3/82.1
4	Retrained Rohit et al. (2021)	88.2/83.8	83.7/84.6	71.2/34.0	87.9/82.0	85.5/79.5

Four-way Classification of Tabla Strokes with Transfer Learning Using Western Drums

Figures & Tables

Figure 1

Table 1

Table 2

Table 3

Figure 2

Figure 3

Table 4

Figure 4

Figure 5

Figure 6

Figure 7

Table 5

Table 6

Figure 8

Table 7

Table 8

Paradigm

My account