Have a personal or library account? Click to login
Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss Cover

Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

Open Access
|Jul 2025

Figures & Tables

tismir-8-1-221-g1.png
Figure 1

Proposed system for multimodal classification from pose and audio time series extracted from 12‑s video examples. We show here unimodal classification from video (A) and audio (B), including gradient reversal (GR). Blocks D1 and D2 are auxiliary blocks used in GR. GR is discussed in Section 3.3.2. (C) denotes the multimodal classification experiments, which are discussed in detail in Section 3.4.

Table 1

Summary statistics for our dataset.1

Number of singers11 (5M, 6F)
Number of ragas9
Number of alap recordings199
Total recording time (mins)609
Average time per alap (mins)03:18
Table 2

The pitch sets employed by the nine ragas.

RagaScale
Bageshree (Bag)S R g m P D n
BaharS R g m P D n N
Bilaskhani Todi (Bilas)S r g m P d n
Jaunpuri (Jaun)S R g m P d n
KedarS R G m M P D N
MarwaS r G M D N
Miyan ki Malhar (MM)S R g m P D n N
NandS R G m M P D N
ShreeS r G M P d N

[i] Lower‑case letters refer to the lower (flatter) alternative; upper‑case letters refer to the higher (sharper) pitch in each case (Clayton et al., 2022).

tismir-8-1-221-g2.png
Figure 2

Schematic representation of unseen singer–raga Split 1. All 12‑s snips belonging to the green singer–raga combinations are in train and those in blue are in validation.

Table 3

Count of 12‑s segments for train and validation for the three splits.

TotalTrainValTrain %Val %
Split 11827314170410377.522.4
Split 21827314179409477.522.4
Split 31825314269400478.121.9
tismir-8-1-221-g3.png
Figure 3

Architecture for unimodal classification (without gradient reversal (GR)).

tismir-8-1-221-g4.png
Figure 4

Detailed structure of the inception block of Figure 3. ‘k Conv (n), S’ indicates a convolution layer with n k‑sized kernels and a stride S. ‘p’ indicates the pooling size of the pool layer and ‘P’ indicates the type of pooling. k, p, P, and the number of filters were determined by hyperparameter tuning. S = 1 for the video and S = 2 for the audio model. The inception block is similar to that used in the work of Clayton et al. (2022).

tismir-8-1-221-g5.png
Figure 5

Gradient reversal (GR) schematic diagram—we show this with respect to gesture features. However, GR can be used even for audio features. The D1 block here corresponds to the D1 block in Figure 1. All layers are trainable.

tismir-8-1-221-g6.png
Figure 6

Proposed framework to understand the different options available in multimodal fusion.

Table 4

A framework to study the body of existing multimodal fusion techniques.

ParameterOptionsCommentsUsed in
Place of fusionSource fusionFuse input featuresChen et al. (2014); Clayton et al. (2022); Gavahi et al. (2023)
Latent fusionFuse hidden layers of network(Tang et al. 2022); Clayton et al. (2022); Jin et al. (2020)
Decision fusionCombine predictions of modelClayton et al. (2022); Nemati et al. (2019)
Operation of fusionConcatenationNeed to have compatible dimensionsRajinikanth et al. (2020)
Element‑wise additionRaza et al. (2020)
Depthwise stackingChu (2024)
Include multimodal lossNoDo not include multimodal lossZhou et al. (2020)
YesChoices for multimodal loss discussed later
Architecture of fusion layerCNN‑basedCNNRajinikanth et al. (2020)
Attention‑basedAttention across fused layersPraveen et al. (2022)
Transformer‑basedTransformer‑based modelsGong et al. (2022)
Training scheduleFrozen unimodal layersGammulle et al. (2021)
Trainable unimodal layersLi et al. (2023a)
Parts of a networkLoss is used to update weights of network partsYang et al. (2022)
Alternate training epochsDifferent epochs update different network partsGong et al. (2022)
Multimodal loss optionsType of lossUnsupervised—multimodal loss does not use label informationYang et al. (2022)
SupervisedFranceschini et al. (2022)
Samples used in lossMatched positive examplesYang et al. (2022)
All samples in batchFranceschini et al. (2022); Li et al. (2023a)
Additional negative samplesOramas et al. (2018); Puttagunta et al. (2023)
Layers used in lossFusion layerFranceschini et al. (2022); Mai et al. (2022)
Layer prior to fusion layerYang et al. (2021)
Unimodal output softmax layerYang et al. (2022)
Multiple layers of networkFan et al. (2016); Wang et al. (2023)

[i] CNN: Convolutional Neural Network

tismir-8-1-221-g7.png
Figure 7

General structure of multimodal loss–based fusion. The blocks in grey are from the best models of the individual modalities. The grey blocks are kept frozen or trainable in various experiments. If trainable, then they are initiated from the weights of the best unimodal models. The blue boxes reflect the layers with which we compute the crossentropy losses with respect to the ground truth (shown in green). The final predicted output with respect to which we report accuracies corresponds to the multimodal softmax block. Other hidden layers part of the architecture are shown in white.

Table 5

Hyperparameter configurations for different models.

Used in ModelParameterValues
Audio and video unimodal. source fusion, latent fusionTemporal resolution10, 20 ms
No. of conv Layers1, 2
No. of conv filters4, 8, 16, 32, 64, 128
Kernel size3, 5, 7
Num inception blocks1
Inception_filters4, 8, 16, 32, 64, 128
Regularization (L2) weight0–1e‑4
Dropout rate0–0.5
Learning rate0.01, 0.001, 1e‑4
*Multimodal decision fusionSupport vector machine—regularizer weight0.01, 0.1, 1, 10, 100
Support vector machine—kernelRBF
Models with multimodal lossCommon embed dimension2–128
Temperature (only for BPCL)0–1
Regularization (L2) weight0–1e‑4
Dropout rate0–0.5
Learning rate0.01, 0.001, 1e‑4

[i] BPCL: Batch Pairwise Contrastive Loss

Table 6

Unimodal singer classification accuracy (%) for validation data using various features.

ModalityFeatureSplit 1Split 2Split 3Mean
AudioF0+VM48.145.952.348.8
VideoVA‑W88.689.689.289.1
VA‑WE91.892.793.292.5
PVA‑W95.996.096.196.0
PVA‑WE97.897.397.497.5

[i] VM: Voicing Mask, P: Position, V: Velocity, A: Acceleration, W: Wrist, E: Elbow

Table 7

Unimodal accuracies (%) on validation data with and without gradient reversal (GR) for different feature combinations.

ModalityFeatureSplit1Split2Split3Mean
RagaSingerRagaSingerRagaSingerRagaSinger
AudioF0 (RF)57.256.758.557.4
F063.065.860.963.2
F0 + VM83.084.981.383.1
F0 + VM + GR86.1*13.284.311.182.711.584.311.9
VideoPVA‑WE (RF)10.711.911.011.2
VA W13.815.314.614.5
VA‑W + GR17.8*11.818.0*12.117.0*14.217.6*12.7
VA‑WE14.214.613.414.1
VA‑WE + GR16.1*13.516.5*12.915.812.616.1*13.0
PVA‑W11.113.713.512.7
PVA‑W + GR17.8*14.217.9*13.817.1*13.517.6*13.8
PVA‑WE10.711.511.011.1
PVA‑WE + GR18.4*12.419.8*10.118.3*12.818.8*12.1

[i] The first row in each modality results corresponds to a random forest model on the stated feature (F0 for audio, PVA‑WE for video). Singer scores are accuracies on the auxiliary singer classification arm for GR models and hence not relevant for models without GR. Bold numbers indicate the best val. accuracy for each split in each modality. Bold feature indicates the best feature in each modality by mean across splits—these models are used in multimodal experiments reported in Table 8. (*) indicates where the model with GR is statistically better () than the model without it.

[ii] GR: Gradient Reversal, VM: Voicing Mask, P: Position, V: Velocity, A: Acceleration, W: Wrist, E: Elbow

Table 8

Different multimodal fusion approaches and their split‑wise validation accuracies.

ModelPlace of fusionTraining ScheduleMultimodal LossAccuracies(%)
Unimodal weightsLayers updated by multimodal lossType of lossSamples used in lossSplit 1Split 2Split 3Mean
SFRaw featuresTrainableNA60.261.556.459.4
DFUnimodal softmaxFrozen83.786.1*83.284.3
LFConv. O/PFrozen conv.76.173.673.074.2
MCLInception O/PTrainable conv. + inceptionBoth modalitiesUnsup.Paired78.979.180.179.4
VideoUnsup.Paired79.179.380.579.6
Frozen conv. + inceptionBoth ModalitiesUnsup.Paired87.6*86.483.185.7*
VideoUnsup.Paired87.8*86.5*83.586.0*
MCL + NSInception O/PFrozen conv. + inceptionVideoUnsup.Paired + random Vec.80.479.579.979.9
Sup.Paired + neg.86.9*84.982.484.7
Sup.Paired + hard neg.87.1*85.582.685.1
BPCLInception O/PFrozen conv. + inceptionBoth modalitiesUnsup.All samples in batch87.9*86.1*84.7*86.2*

[i] For latent fusion, operation of fusion is depthwise stacking; for all others, it is concatenation. Except for source fusion, weights of best unimodal models including gradient reversal (GR) are either frozen or trainable layers initialized with them. The best unimodal models from Table 7 are ‘F0 + VM + GR’ and ‘PVA‑WE + GR’ for audio and video, respectively. (*) indicates where the multimodal model results are statistically better () than the corresponding results for audio alone. Bold indicates the best‑performing model for that split.

[ii] SF: Source Fusion, LF: Latent Fusion, DF: Decision Fusion, MCL: Multimodal Contrastive Loss, MCL + NS: Multimodal Contrastive Loss with Negative Sampling, BPCL: Batch Pairwise Contrastive Loss

Table 9

Average accuracy (%) across 30 splits. (*) indicates statistically significant with respect to audio ().

Split TypeAudio (%)Video (%)MM (%)
Addl‑Splits78.719.880.4*
Test‑Splits76.012.977.1*
tismir-8-1-221-g8.png
Figure 8

Normalized confusion matrices on validation data across three splits. Numbers represent percentage of the total validation data. The models are the best unimodal and multimodal models viz. F0 + VM + GR for audio, PVA‑WE + GR for video, and BPCL for multimodal.

tismir-8-1-221-g9.png
Figure 9

Histogram indicating percentage of the validation data predicted correctly (1) or incorrectly (0) by audio, video, and multimodal models. For example, 011 indicates incorrect prediction by audio but correct predictions by video and multimodal classifiers. The models are the best unimodal and multimodal models viz. F0 + VM + GR for audio, PVA‑WE + GR for video, and BPCL for multimodal.

Table 10

Accuracy (%) for 12‑s tokens with total silence duration greater/less than 2 s across the validation data of three splits.

Silence% SamplesAudioVideoMM
<=2s84.686.318.787.7*
>2s15.473.319.077.6**
Overall100.084.318.886.2*

[i] (*) denotes multimodal accuracy is better than audio accuracy with , (**) indicates .

DOI: https://doi.org/10.5334/tismir.221 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 2, 2024
Accepted on: Jun 9, 2025
Published on: Jul 21, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Sujoy Roychowdhury, Preeti Rao, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.