Have a personal or library account? Click to login
Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss Cover

Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

Open Access
|Jul 2025

Abstract

The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification.

DOI: https://doi.org/10.5334/tismir.221 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 2, 2024
Accepted on: Jun 9, 2025
Published on: Jul 21, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Sujoy Roychowdhury, Preeti Rao, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.