Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

Sujoy Roychowdhury; Preeti Rao

doi:10.5334/tismir.221

Full Article

1 Introduction

Ragas form a tonal framework for composition and improvisation in Indian art music. A raga can be viewed as falling somewhere between a scale and a tune in terms of its defining grammar, which specifies the tonal material, tonal hierarchy, and characteristic melodic phrases (Powers and Widdess, 2001; Rao and Rao, 2014). Raga characteristics have been extensively explored via features computed from the predominant pitch contour extracted from vocal performance audio (Ganguli and Rao, 2018; Koduri et al., 2012). It is well known, however, that vocalists performing Indian art music use a wide range of manual gestures to accompany their singing: the relationship between their hand movements and the acoustic content of their music has been compared to that between gesture and speech (Clayton, 2007; Leante, 2009; Leante, 2013; Leante, 2018; Rahaim, 2009). The earliest known study of the relationship between gesture and music, at least for Indian classical music, goes back to 1934 (Roy, 1934), where the author proposes a non‑exhaustive set of eight melodic curves he calls ‘aesthetic units’ and suggests that these melodic units are reflected in the motions of the hand during a musical performance. Empirical studies have related gesture to perceived effort and apparent manipulation of imagined objects by singers (Paschalidou et al., 2016), and they have also demonstrated an increase in coordinated head movement between soloists and accompanists at cadential moments (Clayton et al., 2019). Whilst gestures are important, research in musicology (Rahaim, 2009) suggests that, in most cases, gestures do not relate to the meaning of song texts, especially in the khyal and dhrupad genres, but rather complement vocal improvisation. They emphasize that gestures in music, unlike those in dance, are not taught or rehearsed, although there are often striking resemblances in gestural patterns across musical lineages. The sound–gesture relationship has also been explored in the related Carnatic (South Indian) music tradition (Pearson, 2013; Pearson, 2016; Pearson and Pouw, 2022). Existing studies have incorporated both empirical analysis and ethnographic enquiry.

The empirical study of movement is made possible by various combinations of motion‑capture and video‑based tracking of individual body parts. In the case of motion capture, the difficulty of data collection, particularly in natural contexts, limits the scope of research. Capture of full‑body position information directly from video, such as is now possible using pose‑estimation algorithms, significantly increases the scope of multimodal analysis, with the possibility of collecting movement data from natural performance contexts extending over long durations. This makes it possible to explore sound–movement relationships of many kinds.

In recently published work, Clayton et al. (2024) look at raga classification and singer classification from randomly selected clips of a fixed duration of 12 s for recordings of three singers singing nine ragas using video only. They use OpenPose (Cao et al., 2018) for 2D keypoint detection and 3D keypoints via a monocular pose‑estimation algorithm for all upper‑body joints. For the prediction tasks, they use an action‑recognition paradigm with a multiscale temporal graph convolutional neural network (MS‑G3D) (Liu et al., 2020). With a separate model for each singer, with one take of each raga being in test and others in train, they obtain a 38.2% mean accuracy. None of these splits have any unseen singer–raga combination. With a different (unseen singer) split, they obtain a maximum of 17.2% accuracy using their three‑singers dataset. On the three‑way singer identification task, they report an accuracy of 100%.

On the other hand, multimodal raga classification was explored in the study by Clayton et al (2022). These authors consider the same data of three singers performing nine ragas and use audio, video, and multimodal techniques for raga classification. They classify randomly chosen, 12‑s clips of video (front view only) into one of the ragas with 2D wrist keypoints of both hands, obtained from OpenPose. They consider the pitch from the audio and include a voicing mask, which is 1 for voiced segments and 0 otherwise. Converting the wrist keypoints and audio features into time series, they use deep learning–based classifiers. They demonstrate that using the fusion of selected intermediate layer embeddings from the network for each modality helps obtain improved performance over an audio‑only classification of 82.8% to a multimodal accuracy of 85.1%. Their research is, to the best of our knowledge, the first work toward multimodal raga classification. However, their study is limited to three singers and, more importantly, the same singer–raga combination was present both in train and test data, achieving a nine‑way raga classification accuracy from gesture of 35%. Each singer performed two three‑minute takes (of alap) for each raga, pieces of which are expected to be far more similar to each other in audio and gesture compared to the alap of the same raga by a different singer. In the unseen singer context, video‑classification accuracies were observed to be near‑chance levels.

Both the above‑reviewed works used the unmetered alap, or the improvised raga exposition section, of a concert, thus restricting the dependence of gesture, if any, to the melodic aspects of the singing. They investigate the possibility that gesture is sufficiently closely related to the melodic movement of Hindustani ragas and that movement data may be used to help predict the identity of the raga being sung. A greatly expanded version of the dataset with 11 singers performing alap in the previous nine ragas, in the khyal genre, was introduced in the work of Nadkarni et al. (2023). The dataset is available, with consents from singers obtained, in an Open Science Framework (OSF) project (Rao et al., 2024). Their goal, however, was to model audio–gesture correspondence at the level of melodic motifs using musicologically motivated features. Instead of raw positioning, they utilized kinematic parameters such as position (P), velocity (V), and acceleration (A) to represent gesture, similar to in previous work by Paschalidou et al. (2016) and Pearson and Pouw (2022). We adopt this dataset for the current work on the classification of raga from randomly chosen, fixed‑duration excerpts of audio and video. Dataset and media‑processing details are available on github.

The audiovisual recordings include synchronized videos from three views and audio from a separate high‑quality microphone. Pitch (fundamental frequency, F0) is extracted from the audio and normalized using the tonic for each individual singer. Wrist and elbow keypoints extracted from each view are used to estimate the 3D coordinates. In recent work, Roychowdhury et al. (2024) use the same dataset as Nadkarni et al. (2023) and evaluate multiple keypoint detection, reconstruction, and multiview fusion techniques. Their results validate the choice of VideoPose3D as the framework of 3D reconstruction.

The recordings are split into randomly chosen 12‑s clips, matching the typical duration of musical phrases. Using these 12‑s clips, we attempt raga classification from each piece of audio and video (gesture) information alone; for this, we train convolutional neural network (CNN) based time series classifiers similar to those used by Clayton et al. (2022). Our train–val split is chosen such that the same singer–raga combination is not present across the train and validation sets. We call this unseen singer–raga combination, i.e., while the validation set has each of the singer and the raga present in the train set, the singer–raga combinations are mutually exclusive between train and validation sets. Given the expected singer dependence on gestures, we investigate gradient reversal (GR)–based methods to disentangle raga from singer information, which gives us some improvement in classification accuracy. For combining audio and video information, we propose using a unified framework for different multimodal fusion approaches that facilitates systematic investigation in the context of our music‑classification task. An overall schematic of our proposed solution is provided in Figure 1. Our objective is to achieve an improved performance via multimodal classification over that of the more dominant audio modality.

Proposed system for multimodal classification from pose and audio time series extracted from 12‑s video examples. We show here unimodal classification from video **(A)** and audio **(B)**, including gradient reversal (GR). Blocks D1 and D2 are auxiliary blocks used in GR. GR is discussed in Section 3.3.2. **(C)** denotes the multimodal classification experiments, which are discussed in detail in Section 3.4.

The new contributions in this work are as follows:

Use of an expanded dataset over that of Clayton et al. (2022) and Clayton et al. (2024) for the similar raga–classification task but with the practically more interesting context of unseen singer–raga combination train–validation splits.
Use of GR techniques to disentangle singer information from embeddings for raga classification.
A comprehensive review and investigation of multimodal fusion in order to benefit from any complementarity across the two modalities for potential improvements in classification accuracy over that with audio alone.

The rest of the paper is organized as follows. We briefly describe the dataset and the splits we have used in this study in Section 2. In Section 3, we discuss the unimodal classification and then propose a framework to study multimodal fusion classification. We discuss the different multimodal fusion experiments conducted by us in the later part of Section 3. We present our experimental results in Section 4 and discuss them in Section 5.

2 Dataset

As presented in Table 1, 11 professional singers (5 male, 6 female) performed two takes of alap in each of nine ragas. The ragas, as listed in Table 2, offer a cross‑section of raga features in aspects such as the mood or character with which they are associated (serious, joyful, etc.), typical speed and complexity of melodic movement, and predominant melodic range (i.e., favoring the upper or lower tetrachord). All singers self‑declared themselves to be right‑handed.

Table 1

Summary statistics for our dataset.^¹

Number of singers	11 (5M, 6F)
Number of ragas	9
Number of alap recordings	199
Total recording time (mins)	609
Average time per alap (mins)	03:18

Table 2

The pitch sets employed by the nine ragas.

Raga	Scale
Bageshree (Bag)	S R g m P D n
Bahar	S R g m P D n N
Bilaskhani Todi (Bilas)	S r g m P d n
Jaunpuri (Jaun)	S R g m P d n
Kedar	S R G m M P D N
Marwa	S r G M D N
Miyan ki Malhar (MM)	S R g m P D n N
Nand	S R G m M P D N
Shree	S r G M P d N

[i] Lower‑case letters refer to the lower (flatter) alternative; upper‑case letters refer to the higher (sharper) pitch in each case (Clayton et al., 2022).

Of interest to us are the melodic features extracted from the audio—namely, the singer F0 contour and voicing information—and the kinematic parameters from video pose estimation given by position (P), velocity (V), and acceleration (A) in 3D, sampled at 10‑ms intervals, are available at the dataset github link.

2.1 Data splits

Following Clayton et al. (2022), the audio–gesture synchronized data for each recording is split into 12‑s segments. The start times of the 12‑s segments are separated by a randomly selected value in the interval [0.8, 2.4] s.

We then divide the entire set of 12‑s segments into train and validation subsets. Our train–val split strategy is based on choosing two ragas from each singer as part of val set and the others in the train set to create a relatively balanced dataset across singers and ragas. We create three distinct splits with this strategy. Figure 2 provides the distribution of singer–raga labels across the train and val sets for Split 1, with actual numbers given in Table 3 for the three splits. We see that approximately 22% of the data are in validation sets.

Schematic representation of unseen singer–raga Split 1. All 12‑s snips belonging to the green singer–raga combinations are in train and those in blue are in validation.

Table 3

Count of 12‑s segments for train and validation for the three splits.

	Total	Train	Val	Train %	Val %
Split 1	18273	14170	4103	77.5	22.4
Split 2	18273	14179	4094	77.5	22.4
Split 3	18253	14269	4004	78.1	21.9

3 Methodology

We describe in this section the methodology for unimodal and multimodal classification.

3.1 Features

In the unimodal classification tasks, we use raw features drawn from either the audio or the video. The features (or a subset of which) we use in our classification are:

Audio features—we use two audio features viz. pitch contour and voicing mask. The pitch contour is normalized using the singer’s tonic. The voicing mask (VM) is 0 when the frame is unvoiced (and assigned a default pitch value of ) and is otherwise equal to 1 (i.e., for voiced segments). The voicing mask is useful since the default value of −3000 is arbitrary and the neural network can be confused between real pitch values and default values.
Gesture features—we use the 3D P–V–A features for the left and right wrists (W) and elbows (E). We use the individual components in , , and coordinates of P, V, and A as features; thus, we use nine features for each keypoint (wrist/elbow) for each hand. As a result, we have gesture features. The and coordinates are in pixel coordinates and with respect to torso and used as per Pavllo et al. (2019).

For a 12‑s excerpt, we obtain 1200 samples for each of the time series corresponding to the feature value at 10‑ms intervals.

3.2 Architecture

For our classification tasks, we use the architecture shown in Figure 3, following earlier work on raga classification in the study by Clayton et al. (2022). It consists of convolutional layers followed by batch normalization (BN) and rectified linear units (ReLUs). This is followed by a 1D inception layer block (Szegedy et al., 2016). Figure 4 shows a detailed structure of the inception block used in the architecture. The inception layer, having convolutional filters of different kernel sizes, provides the capability of extracting features at different temporal resolutions. The inception layer is followed by dense layers and the final softmax layer for classification.

Architecture for unimodal classification (without gradient reversal (GR)).

Detailed structure of the inception block of Figure 3. ‘k Conv (n), S’ indicates a convolution layer with n k‑sized kernels and a stride S. ‘p’ indicates the pooling size of the pool layer and ‘P’ indicates the type of pooling. k, p, P, and the number of filters were determined by hyperparameter tuning. S = 1 for the video and S = 2 for the audio model. The inception block is similar to that used in the work of Clayton et al. (2022).

3.3 Disentanglement—GR

Raga classification from gesture, especially for unseen singers, has been shown to be almost at chance by Clayton et al. (2022) and Clayton et al. (2024). On the other hand, strong singer dependence has been suggested by both musicological studies (Rahaim, 2009) and Music Information Retrieval (MIR) results (Clayton et al., 2024; Roychowdhury et al., 2024). To improve the raga‑classification results from gesture, we considered looking at methods able to disentangle singer information from the embeddings obtained in video‑based classification. As there does not exist much literature on multimodal music classification, we looked at similar work in other domains, such as emotion classification.

3.3.1 Background—disentanglement in multimodal emotion classification

Emotion classification from speech has been done based on audio (Badshah et al., 2017; Mao et al., 2014), videos of the face (Fan et al., 2016), and multimodal features (Rajagopalan et al., 2016; Zadeh et al., 2017). Gesture in the expression of emotion is also idiosyncratic, and various researchers have tried to disentangle speaker information from the features to improve emotion classification (Peri et al., 2021; Tu et al., 2019).

Disentanglement of speaker information, or domain information, for emotion recognition has been done by Tu et al. (2019). In their approach, which they call domain adversarial training, they use two separate classifier layers on top of a convolutional network for obtaining an embedding. The authors use GR, introduced by Ganin and Lempitsky (2015) for digit recognition. The GR layer multiplies the gradient from the domain classifier by a negative number before feeding it to the feature identification layer.

3.3.2 GR—raga classification

We use a GR approach to try to disentangle singer information from the representations to improve the raga‑classification task, as depicted in Figure 5, which shows the relevant architecture and losses. Thus, if , , and are the model parameters for the feature extraction (including inception layer), raga‑classification branch, and singer‑classification branch, respectively, the GR layer multiplies by the gradients for the singer classifier () in the feature layer. is the categorical cross‑entropy loss for the singer labels. The effect of the GR layer is that the model tries to confuse the speaker classifier while improving the accuracy of the raga classifier.

Gradient reversal (GR) schematic diagram—we show this with respect to gesture features. However, GR can be used even for audio features. The D1 block here corresponds to the D1 block in Figure 1. All layers are trainable.

To obtain a disentangled representation, the singer classifier subnetwork is preceded by a GR layer, which multiplies the gradients by a negative value , whereas, in the forward pass, there is no change, and it acts like an identity mapping.

Thus, the total loss is given by

1

where is the model parameters and L represents the categorical cross‑entropy loss between the actual and predicted labels— and are the actual and predicted raga labels, whereas and are the actual and predicted singer labels. represents the features learned by the network.

In Figure 5, we show gesture features, but this is applicable for audio features too. The value of is not constant, but, similar to Osumi et al. (2019), we choose a hyperparameter and update based on the number of network updates (i.e., number of batches) processed by the following equation:

2

At the beginning of the training, is smaller and increases with further batches processed. This ensures that very large losses at the beginning of the training in the singer classifier do not start affecting the raga classifier network. is a hyperparameter chosen between 0 and 1.

3.4 Multimodal fusion: background and methods

To exploit the complementary nature of the audio and video modalities, it is important to fuse the information from the two modalities. As the problem of multimodal classification in music settings is relatively new, we looked at the literature in other video tasks. Individual modalities can perform very differently at the final task, and it can be challenging to obtain any improvements over the stronger modality. Drawing of complementary information across the individual modalities has been attempted via contrastive losses by Franceschini et al. (2022), Mai et al. (2022), and Yang et al. (2022). The loss may be applied at different layers (Fan et al., 2016; Franceschini et al., 2022; Wang et al., 2023) to exploit the expected correspondence between modalities in different ways.

We observe that there are a number of design choices available for fusing information from multiple modalities. We propose a framework for the different options in multimodal fusion; a schematic diagram of the same is given in Figure 6. We suggest a hierarchical structure consisting of design decisions at multiple levels, going from top to bottom in Figure 6. Each of the boxes in the schematic refers to a decision parameter with multiple options. Table 4 presents the decision parameters, options, and some studies that have used these options.

Proposed framework to understand the different options available in multimodal fusion.

Table 4

A framework to study the body of existing multimodal fusion techniques.

Parameter	Options	Comments	Used in
Place of fusion	Source fusion	Fuse input features	Chen et al. (2014); Clayton et al. (2022); Gavahi et al. (2023)
	Latent fusion	Fuse hidden layers of network	(Tang et al. 2022); Clayton et al. (2022); Jin et al. (2020)
	Decision fusion	Combine predictions of model	Clayton et al. (2022); Nemati et al. (2019)
Operation of fusion	Concatenation	Need to have compatible dimensions	Rajinikanth et al. (2020)
	Element‑wise addition		Raza et al. (2020)
	Depthwise stacking		Chu (2024)
Include multimodal loss	No	Do not include multimodal loss	Zhou et al. (2020)
	Yes	Choices for multimodal loss discussed later
Architecture of fusion layer	CNN‑based	CNN	Rajinikanth et al. (2020)
	Attention‑based	Attention across fused layers	Praveen et al. (2022)
	Transformer‑based	Transformer‑based models	Gong et al. (2022)
Training schedule	Frozen unimodal layers		Gammulle et al. (2021)
	Trainable unimodal layers		Li et al. (2023a)
	Parts of a network	Loss is used to update weights of network parts	Yang et al. (2022)
	Alternate training epochs	Different epochs update different network parts	Gong et al. (2022)
Multimodal loss options	Type of loss	Unsupervised—multimodal loss does not use label information	Yang et al. (2022)
		Supervised	Franceschini et al. (2022)
	Samples used in loss	Matched positive examples	Yang et al. (2022)
		All samples in batch	Franceschini et al. (2022); Li et al. (2023a)
		Additional negative samples	Oramas et al. (2018); Puttagunta et al. (2023)
	Layers used in loss	Fusion layer	Franceschini et al. (2022); Mai et al. (2022)
		Layer prior to fusion layer	Yang et al. (2021)
		Unimodal output softmax layer	Yang et al. (2022)
		Multiple layers of network	Fan et al. (2016); Wang et al. (2023)

[i] CNN: Convolutional Neural Network

3.4.1 Fusion without multimodal loss

For fusion without considering multimodal loss, we try the following techniques:

Source fusion—we concatenate the audio and gesture features and have convolution, inception, and dense layers using the combined features.
Latent fusion—we consider the weights from convolutional layers (including BN and ReLUs) of the best unimodal models and add common inception layers and dense layers. We stack depthwise the output from the convolutional layers in each modality.
Decision fusion—we consider the softmax output vector of each modality and train a classifier on the concatenation of the two. We train support vector machine models for the classification task and use 10‑fold cross‑validation for hyperparameter tuning.

3.4.2 Fusion based on multimodal loss

We consider different variations of a multimodal loss, which we use for multimodal fusion. The general architecture for this is represented in Figure 7. We consider the best unimodal models (including those having GR) up to the inception layer. We flatten the inception layer and project them into a common space—we refer to this as unimodal embedding layers in subsequent sections. We either keep these layers frozen or initialize the trainable layers with the weights of the best unimodal models. We add dense layers and softmax layers for audio and video outputs. In addition, we concatenate the flattened unimodal layers and learn a multimodal softmax for classification. We report multimodal accuracies based on the output of this layer.

General structure of multimodal loss–based fusion. The blocks in grey are from the best models of the individual modalities. The grey blocks are kept frozen or trainable in various experiments. If trainable, then they are initiated from the weights of the best unimodal models. The blue boxes reflect the layers with which we compute the crossentropy losses with respect to the ground truth (shown in green). The final predicted output with respect to which we report accuracies corresponds to the multimodal softmax block. Other hidden layers part of the architecture are shown in white.

Let , , and be the predicted softmax output from audio, video, and multimodal arms, respectively, for the sample. Let be the actual class label for the sample, expressed as a one‑hot encoded vector.

Then, the audio, video, and gesture categorical cross‑entropy losses , , and , respectively, are given by

3

4

5

where is the number of classes and is the number of samples in the training dataset.

One important thing to note is that, since our unimodal networks are independently hyperparameter‑tuned, the inception layer output may not be of the same dimension. We need to transform these embeddings to a common dimension, learned via hyperparameter tuning, to apply a multimodal loss across them. As we need to ensure that these unimodal embeddings still represent suitable information for classification, we need and in our multimodal experiments.

The unimodal embeddings and are obtained from the flattened output of the inception layers.

6

7

Where represents a transformation of the flattened inception layer outputs and for audio and video, respectively, to a common space.

3.4.3 Multimodal contrastive loss (MCL) between paired positive examples

For MCL, similar to Yang et al. (2022), we consider paired examples of embeddings from each modality.

8

Where is the index of the sample and is the number of the samples.

The total loss is given by

9

We consider different training schedules, as follows:

is used to update weights of layers of both modalities.
is used to update weights of only video layer—since this is the ‘weaker’ modality and we want the contrastive loss to improve the weaker modality. This is similar to the proposed architecture in the work of Yang et al. (2022).

In each case, we consider the unimodal layers to be either frozen or trainable, initialized with the weights from the best unimodal model.

3.4.4 MCL with negative sampling

Similar to Oramas et al. (2018), we consider negative samples for each modality in addition to contrastive loss between modalities.

We would like the embedding for the sample from the audio modality, , to be as dissimilar as possible from a randomly sampled embedding from the video modality, . Similarly, the embedding from the video modality should be dissimilar from a random audio embedding . This encourages well‑distributed cross‑modal embeddings.

To account for this, we introduce two additional negative sampling losses:

10

11

is a margin between 0 and 1.

Our total loss is given by

12

We consider three approaches to choose negative samples, as follows:

Random vectors similar to Oramas et al. (2018). This is an unsupervised multimodal loss.
Random vectors from a different class. Thus, this is a supervised multimodal loss.
Hard negative sampling (Smirnov et al., 2018)—where we choose an example from a different class closest to the current sample.

3.4.5 Batch pairwise contrastive loss (BPCL)

Following Franceschini et al. (2022), we consider a contrastive loss between all samples in a batch. We call this loss BPCL. Whereas MCL with negative sampling uses one negative sample, this approach uses a contrastive loss involving all samples in the batch, arising from the original context of this method in an unsupervised setting.

For any sample in the batch, the loss for that sample for audio‑to‑video pair is given as

13

where is the number of samples in the batch, is a temperature parameter, and denotes the cosine similarity between two vectors. The term in the numerator therefore corresponds to the exponential of the cosine similarity, scaled by a temperature parameter, between modalities for positive paired sample, and the term in the denominator is that for all negative pairs. The total loss across samples is given by

14

and the total BPCL is

15

Overall, we minimize

16

3.5 Environment

Experiments were done on a single DGX A100 GPU having 80 GB GPU RAM. Both pytorch and tensorflow were used as the framework for training the deep learning models.

The hyperparameters for different models are given in Table 5 based on previous work (Clayton et al., 2022; Smith, 2018). Hyperparameter tuning of unimodal models takes about 12 hours for training on one A100 GPU and about an hour for training with frozen hyperparameters. Hyperparameter tuning of the models with multimodal loss takes about one hour on an A100 GPU and about 15 minutes with frozen hyperparameters.

Table 5

Hyperparameter configurations for different models.

Used in Model	Parameter	Values
Audio and video unimodal. source fusion, latent fusion	Temporal resolution	10, 20 ms
	No. of conv Layers	1, 2
	No. of conv filters	4, 8, 16, 32, 64, 128
	Kernel size	3, 5, 7
	Num inception blocks	1
	Inception_filters	4, 8, 16, 32, 64, 128
	Regularization (L2) weight	0–1e‑4
	Dropout rate	0–0.5
	Learning rate	0.01, 0.001, 1e‑4
*Multimodal decision fusion	Support vector machine—regularizer weight	0.01, 0.1, 1, 10, 100
*Multimodal decision fusion	Support vector machine—kernel	RBF
Models with multimodal loss	Common embed dimension	2–128
	Temperature (only for BPCL)	0–1
	Regularization (L2) weight	0–1e‑4
	Dropout rate	0–0.5
	Learning rate	0.01, 0.001, 1e‑4

[i] BPCL: Batch Pairwise Contrastive Loss

4 Experimental Results

In this section, we present the results of our experiments and an analysis of the same.

4.1 Singer classification: unimodal results

Although not our primary task, we consider unimodal singer classification (without GR)—this is to get an understanding of the amount of singer information present in the data. These results are presented in Table 6. We see that the audio‑based classification is around 50%. Since our features are F0 and voicing mask and we have normalized for singer tonic, a lot of singer features are already removed from the data. On the other hand, video features lead to much higher accuracies for different feature combinations. This confirms the strong singer‑dependency of gestures—as brought out by musicological studies like Rahaim (2009). We are thus motivated to try and remove singer dependencies from the gesture embeddings by GR.

Table 6

Unimodal singer classification accuracy (%) for validation data using various features.

Modality	Feature	Split 1	Split 2	Split 3	Mean
Audio	F0+VM	48.1	45.9	52.3	48.8
Video	VA‑W	88.6	89.6	89.2	89.1
	VA‑WE	91.8	92.7	93.2	92.5
	PVA‑W	95.9	96.0	96.1	96.0
	PVA‑WE	97.8	97.3	97.4	97.5

[i] VM: Voicing Mask, P: Position, V: Velocity, A: Acceleration, W: Wrist, E: Elbow

4.2 Raga classification: Unimodal results

In Table 7, we present the raga‑classification results with best‑obtained hyperparameters for different features across modalities. Additionally, to contrast with the deep learning model of this work, we test with a random forest classifier on the same input features. An audio feature set based on the pitch contour shape represented by Daubechies4 wavelet coefficients (Rowe and Abbott, 1995) and the feature dimension is reduced by principal component analysis (PCA), choosing as many dimensions which explain 90% of the variance on the train data. Similarly, for video features, we compute the Daubechies4 wavelets for each of the (position, velocity, acceleration) PVA features, combine them, and finally reduce dimensions via PCA.

Table 7

Unimodal accuracies (%) on validation data with and without gradient reversal (GR) for different feature combinations.

Modality	Feature	Split1		Split2		Split3		Mean
		Raga	Singer	Raga	Singer	Raga	Singer	Raga	Singer
Audio	F0 (RF)	57.2	‑	56.7	‑	58.5	‑	57.4	‑
	F0	63.0	‑	65.8	‑	60.9	‑	63.2	‑
	F0 + VM	83.0	‑	84.9	‑	81.3	‑	83.1	‑
	F0 + VM + GR	86.1*	13.2	84.3	11.1	82.7	11.5	84.3	11.9
Video	PVA‑WE (RF)	10.7	‑	11.9	‑	11.0	‑	11.2	‑
	VA W	13.8	‑	15.3	‑	14.6	‑	14.5	‑
	VA‑W + GR	17.8*	11.8	18.0*	12.1	17.0*	14.2	17.6*	12.7
	VA‑WE	14.2	‑	14.6	‑	13.4	‑	14.1	‑
	VA‑WE + GR	16.1*	13.5	16.5*	12.9	15.8	12.6	16.1*	13.0
	PVA‑W	11.1	‑	13.7	‑	13.5	‑	12.7	‑
	PVA‑W + GR	17.8*	14.2	17.9*	13.8	17.1*	13.5	17.6*	13.8
	PVA‑WE	10.7	‑	11.5	‑	11.0	‑	11.1	‑
	PVA‑WE + GR	18.4*	12.4	19.8*	10.1	18.3*	12.8	18.8*	12.1

[i] The first row in each modality results corresponds to a random forest model on the stated feature (F0 for audio, PVA‑WE for video). Singer scores are accuracies on the auxiliary singer classification arm for GR models and hence not relevant for models without GR. Bold numbers indicate the best val. accuracy for each split in each modality. Bold feature indicates the best feature in each modality by mean across splits—these models are used in multimodal experiments reported in Table 8. (*) indicates where the model with GR is statistically better () than the model without it.

[ii] GR: Gradient Reversal, VM: Voicing Mask, P: Position, V: Velocity, A: Acceleration, W: Wrist, E: Elbow

For GR–based models, we also present the singer classification accuracy on the singer classification auxiliary branch—D1 and D2 in Figure 1. We observe that, for the same configuration, the application of GR improves classification accuracy by a few percentage points. We also note that using all gesture features and GR gives the best mean accuracy across splits. Recalling that we have 11 singers and thus a chance accuracy of approx. 9%, we observe that, even for video features, the singer dependency has decreased to a little above chance, indicating its near‑complete removal. All the same, the limited improvement in raga classification with the disentangled video features suggests that gestural consistency across singers for the same raga is probably relatively weak. To compare the classification results with and without GR we check for statistical significance by representing the prediction for each sample by 1 if correct and 0 if wrong. We then use McNemar’s test (Lachenbruch, 2014) to test for statistical significance. Results with star (*) in Table 7 show those examples where the model with GR is statistically better () than that without it, for the same split or across all splits for mean scores. We observe that, for audio, the improvement is not statistically significant.

We consider the best‑performing audio and video models in the subsequent multimodal experiments. These are the ‘F0 + VM + GR’ for audio and ‘PVA‑WE + GR’ for audio and video, respectively.

4.3 Raga classification: multimodal results

In Table 8, we report the classification accuracies for various fusion strategies discussed earlier. Except for source fusion, we use the best separately‑trained unimodal models, including GR and use their weights as frozen layers or initialize trainable layers with them.

Table 8

Different multimodal fusion approaches and their split‑wise validation accuracies.

Model	Place of fusion	Training Schedule		Multimodal Loss		Accuracies(%)
Model	Place of fusion	Unimodal weights	Layers updated by multimodal loss	Type of loss	Samples used in loss	Split 1	Split 2	Split 3	Mean
SF	Raw features	Trainable	NA			60.2	61.5	56.4	59.4
DF	Unimodal softmax	Frozen				83.7	86.1*	83.2	84.3
LF	Conv. O/P	Frozen conv.				76.1	73.6	73.0	74.2
MCL	Inception O/P	Trainable conv. + inception	Both modalities	Unsup.	Paired	78.9	79.1	80.1	79.4
			Video	Unsup.	Paired	79.1	79.3	80.5	79.6
		Frozen conv. + inception	Both Modalities	Unsup.	Paired	87.6*	86.4	83.1	85.7*
			Video	Unsup.	Paired	87.8*	86.5*	83.5	86.0*
MCL + NS	Inception O/P	Frozen conv. + inception	Video	Unsup.	Paired + random Vec.	80.4	79.5	79.9	79.9
				Sup.	Paired + neg.	86.9*	84.9	82.4	84.7
				Sup.	Paired + hard neg.	87.1*	85.5	82.6	85.1
BPCL	Inception O/P	Frozen conv. + inception	Both modalities	Unsup.	All samples in batch	87.9*	86.1*	84.7*	86.2*

[i] For latent fusion, operation of fusion is depthwise stacking; for all others, it is concatenation. Except for source fusion, weights of best unimodal models including gradient reversal (GR) are either frozen or trainable layers initialized with them. The best unimodal models from Table 7 are ‘F0 + VM + GR’ and ‘PVA‑WE + GR’ for audio and video, respectively. (*) indicates where the multimodal model results are statistically better () than the corresponding results for audio alone. Bold indicates the best‑performing model for that split.

[ii] SF: Source Fusion, LF: Latent Fusion, DF: Decision Fusion, MCL: Multimodal Contrastive Loss, MCL + NS: Multimodal Contrastive Loss with Negative Sampling, BPCL: Batch Pairwise Contrastive Loss

The source and latent fusion methods do not employ any multimodal loss function in training. These are seen to perform much worse than the audio‑only model. Source fusion does not use any information from the unimodal models, and latent fusion only uses the frozen convolutional layer weights. Our results indicate that the larger number of (relatively weak) video features overpowers the audio features, leading to overall poorer performance. In decision fusion, we obtain a performance similar to audio‑only accuracy, but not better.

When we look at the models with multimodal fusion incorporated, we see that, for MCL, keeping the weights trainable reduces the performance because we have a very large number of trainable parameters across modalities and the overall results become weaker. The MCL with frozen convolutional and inception layers, however, helps in improving over audio accuracy, particularly when the fusion loss does not update the audio embedding weights—the performance in the latter case is competitive on average with the best model (BPCL) and is the best‑performing model for one split. Adding random negative sampling does not benefit the training process, but supervised negative samples from other classes, especially hard negative samples, are beneficial. The BPCL model takes multiple pairwise losses and thus brings in the contrastive loss across modalities in a much stronger way. We see that this approach thus has the best average performance.

We check for statistical significance for the multimodal model against that of the audio models using McNemar’s test. Results with star (*) in Table 8 are those where the multimodal model is better than the corresponding ones for audio alone. We see that BPCL performs consistently better (p < 0.05) than audio.

In summary, although the video modality is much ‘weaker’ in the raga‑classification task, we observe that the decision fusion models are competitive with the audio modality, and some of the fusion loss–based models outperform the audio‑only results, successfully exploiting complementary information. The steep difference in performance across the two modalities, however, limits the overall improvement in multimodal settings.

4.4 Additional splits and statistical testing

To compare the performance of best models in each modality, we make 30 additional splits of train and val using the same strategy as described in Section 2.1. We call this set of 30 train and val splits Addl‑Splits.

We also create 30 splits with a distinct test set. Here, for each singer, we consider one raga to be in test, two to be in val, and the rest in train. We ensure that the two ragas do not appear in val set of multiple singers and all ragas appear in the val set of some singer. We thus have approximately 11% of data in the test set, 22% in the val set, and the rest in the train set. We call this set of 30 splits Test‑Splits.

Our results in Tables 7 and 8 are based on hyperparameters tuned for each individual split. To reduce overall compute time for testing on the new larger set of splits, we use the best of the three sets of hyperparameters obtained on the original three splits by picking the set that provides the best mean performance across the three splits for a given modality. We then train individual models using these hyperparameters on each split in Addl‑Splits and Test‑Splits. For the latter, we evaluate on the test set. The results for the additional splits are presented in Table 9 for our best feature combinations (viz. F0 + VM + GR, PVA‑WE + GR, and BPCL for audio, video and multimodal respectively). We observe that the accuracies for audio and multimodal for Addl‑Splits are less than those in Tables 7 and 8, respectively. This is because hyperparameters have not been chosen specifically for any split. However, the multimodal accuracies are statistically better than those for audio for both Addl‑Splits and Test‑Splits.

Table 9

Average accuracy (%) across 30 splits. (*) indicates statistically significant with respect to audio ().

Split Type	Audio (%)	Video (%)	MM (%)
Addl‑Splits	78.7	19.8	80.4*
Test‑Splits	76.0	12.9	77.1*

5 Discussion

From Table 7, we see the best audio model is ‘F0 + VM + GR’ and the video model is ‘PVA‑WE + GR.’ The best multimodal model from Table 8 is BPCL. Figure 8 shows the normalized confusion matrices for these best unimodal and multimodal models from the combined validation data of the three splits.

Normalized confusion matrices on validation data across three splits. Numbers represent percentage of the total validation data. The models are the best unimodal and multimodal models viz. F0 + VM + GR for audio, PVA‑WE + GR for video, and BPCL for multimodal.

From the audio confusion matrix, we can observe some important aspects of misclassification. We note that there are some misclassifications between Miya ki Malhar (MM) and Bahar. These ragas have similar scales and melodic movement. Other confusions exist between Kedar and Nand (same scale), Bageshree (Bag), and Bahar (the latter uses one additional note) (Rao and van der Meer, 2010). These observations are in line with those made on a smaller dataset in Clayton et al. (2022).

The video confusion matrix is much harder to read and interpret because of the poorer performance. When we look at the multimodal classification matrix, we observe that Nand–Kedar misclassification improves. Bahar predicted as MM reduces (from 1.3% to 0.7%), whereas MM predicted as Bahar remains unchanged at 1.2%. Bageshree being predicted as Bahar remains unchanged at 0.7%, but the reverse deteriorates a bit in multimodal classification (from 0.4% to 0.7%).

Figure 9 shows the histogram of classification correctness with reference to the combined unimodal and multimodal predictions at the 12‑s excerpt level for the validation dataset. As expected from overall accuracies, most samples are correct in audio and multimodal and wrong in video. However, for 4.9% of cases (001, 011 keys), the multimodal prediction corrects the wrong predictions of the audio modality. We also note that only 1.4% of samples are present where audio and multimodal are both wrong but video is correct, indicating that the multimodal approaches have done most of the possible corrections.

Histogram indicating percentage of the validation data predicted correctly (1) or incorrectly (0) by audio, video, and multimodal models. For example, 011 indicates incorrect prediction by audio but correct predictions by video and multimodal classifiers. The models are the best unimodal and multimodal models viz. F0 + VM + GR for audio, PVA‑WE + GR for video, and BPCL for multimodal.

We also carry out an analysis of errors to arrive at the insight that 12‑s tokens that contain sizeable silence (in the audio segment) benefit the most from multimodal classification. The explanation for this is speculated to be linked to the fact that, while singing pauses may not signal raga identity, the singer’s gestures tend to be uninterrupted and continuous across such regions. The observed accuracies for segments containing different durations of silence are presented in Table 10. More detailed results regarding the variation of audio and video performance with duration of silence are provided in the supplementary material.

Table 10

Accuracy (%) for 12‑s tokens with total silence duration greater/less than 2 s across the validation data of three splits.

Silence	% Samples	Audio	Video	MM
<=2s	84.6	86.3	18.7	87.7*
>2s	15.4	73.3	19.0	77.6**
Overall	100.0	84.3	18.8	86.2*

[i] (*) denotes multimodal accuracy is better than audio accuracy with , (**) indicates .

6 Conclusion and Future Work

Gestures are an integral part of Hindustani vocal performances. From the empirical observations of available musicological studies, gestures are highly idiosyncratic of the singer, making it challenging to draw correspondences between the movement and the underlying raga‑related characteristics in a cross‑singer setting. In this study, we utilize deep learning models to exploit any implicit correlations between melody and gesture for a raga‑classification task. By looking at tonic‑normalized pitch contours and 3D coordinates of position, velocity, and acceleration of wrist and elbow joints, we presented raga classifications with each modality. We also try to disentangle the singer information from the embeddings by using GR, and our results show a small improvement in accuracy for each modality. Further, a range of multimodal fusion methods that differ in terms of the place of fusion and type of loss and training protocols are tested. We find that the best multimodal fusion method is BPCL, which improves significantly upon audio‑only accuracies. Our study, although limited to raga classification, provides useful pointers to the use of feature disentanglement and multimodal fusion in more general MIR settings.

Our gesture analyses utilized wrist and elbow keypoints. It remains to be seen whether other keypoints such as shoulder and finger joints can contribute usefully. Further, a more semantic segmentation of the audiovisual time series could prove more informative over the fixed duration excerpts of this work. Apart from the potential to provide musicologically interesting insights, the deep learning methods of this work can contribute to the development of a digital avatar for audiovisual vocal performance driven by the specification of melodic characteristics such as raga.

7 Reproducibility

Code to reproduce the work is available at: https://github.com/DAP-Lab/multimodal_raga_processing. The same repository has the Supplementary Material in the ‘Supplementary Material’ folder.

Competing Interests

PR is a member of the editorial team and has recused themselves from any editorial involvement with this article.

Note

[10] https://github.com/DAP-Lab/hindustani_raga_dataset_processing.