Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

Sujoy Roychowdhury; Preeti Rao

doi:10.5334/tismir.221

Abstract

The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification.

References

Badshah, A. M., Ahmad, J., Rahim, N., and Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 International Conference on Platform Technology and Service (PlatCon) (pp. 1–5). IEEE.
Back to article
Cao, Z., Hidalgo, G., Simon, T., Wei, S.‑E., and Sheikh, Y. (2019). Openpose: Realtime multi‑person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 172–186.
Back to article
Chen, J., Chen, Z., Chi, Z., and Fu, H. (2014). Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction (pp. 508–513).
Back to article
Chu, K. (2024). Application of animation products via multimodal information and semantic analogy. Multimedia Tools and Applications, 83(9), 26031–26054.
Back to article
Clayton, M. (2017). Time, gesture, and attention in a khyāl performance. In Ethnomusicology: A Contemporary Reader (Vol. II, pp. 249–266). Routledge.
Back to article
Clayton, M., Jakubowski, K., and Eerola, T. (2019). Interpersonal entrainment in indian instrumental music performance: Synchronization and movement coordination relate to tempo, dynamics, metrical and cadential structure. Musicae Scientiae, 23(3), 304–331.
Back to article
Clayton, M., Li, J., Clarke, A., and Weinzierl, M. (2023). Hindustani raga and singer classification using 2D and 3D pose estimation from video recordings. Journal of New Music Research, 52(4), 285–300.
Back to article
Clayton, M., Rao, P., Shikarpur, N., Roychowdhury, S., and Li, J. (2022). Raga classification from vocal performances using multimodal analysis. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR, Bengaluru, India (pp. 283–290).
Back to article
Fan, Y., Lu, X., Li, D., and Liu, Y. (2016). Video‑based emotion recognition using CNN‑RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 445–450).
Back to article
Franceschini, R., Fini, E., Beyan, C., Conti, A., Arrigoni, F., and Ricci, E. (2022). Multimodal emotion recognition with modality‑pairwise unsupervised contrastive loss. In 2022 26th International Conference on Pattern Recognition (ICPR) (pp. 2589–2596). IEEE.
Back to article
Gammulle, H., Denman, S., Sridharan, S., and Fookes, C. (2021). Tmmf: Temporal multi‑modal fusion for single‑stage continuous gesture recognition. IEEE Transactions on Image Processing, 30, 7689–7701.
Back to article
Ganguli, K. K., and Rao, P. (2018). On the distributional representation of ragas: Experiments with allied raga pairs. Transactions of the International Society. for Music Information Retrieval, 1(1), 79–95.
Back to article
Ganin, Y., and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (pp. 1180–1189). PMLR.
Back to article
Gavahi, K., Foroumandi, E., and Moradkhani, H. (2023). A deep learning‑based framework for multi‑source precipitation fusion. Remote Sensing of Environment, 295, 113723.
Back to article
Gong, Y., Liu, A. H., Rouditchenko, A., and Glass, J. (2022). Uavm: Towards unifying audio and visual models. IEEE Signal Processing Letters, 29, 2437–2441.
Back to article
Jin, B. T., Abdelrahman, L., Chen, C. K., and Khanzada, A. (2020). Fusical: Multimodal fusion for video sentiment. In Proceedings of the 2020 International Conference on Multimodal Interaction (pp. 798–806).
Back to article
Koduri, G., Gulati, S., Rao, P., and Serra, X. (2012). Raga recognition based on pitch distribution methods. Journal of New Music Research, 41(4), 337–350.
Back to article
Lachenbruch, P. A. (2014). Mcnemar test. Wiley StatsRef: Statistics Reference Online.
Back to article
Leante, L. (2009). The lotus and the king: Imagery, gesture and meaning in a hindustani rāg. Ethnomusicology Forum, 18(2), 185–206.
Back to article
Leante, L. (2013). Gesture and imagery in music performance: Perspectives from north indian classical music. In T. Shephard and A. Leonard (Eds.), The Routledge companion to music and visual culture (pp. 145–152). Routledge.
Back to article
Leante, L. (2018). The cuckoo’s song : Imagery and movement in monsoon ragas. In I. Rajamani, M. Pernau, and K. R. B. Schofield (Eds.), Monsoon feelings : A history of emotions in the rain. Niyogi Books.
Back to article
Li, P., Liu, G., He, J., Zhao, Z., and Zhong, S. (2023). Masked vision and language pre‑training with unimodal and multimodal contrastive losses for medical visual question answering. In International Conference on Medical Image Computing and Computer‑Assisted Intervention (pp. 374–383). Springer.
Back to article
Liu, Z., Zhang, H., Chen, Z., Wang, Z., and Ouyang, W. (2020). Disentangling and unifying graph convolutions for skeleton‑based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 143–152).
Back to article
Mai, S., Zeng, Y., Zheng, S., and Hu, H. (2022). Hybrid contrastive learning of tri‑modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing, 14(3), 2276–2289.
Back to article
Mao, Q., Dong, M., Huang, Z., and Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8), 2203–2213.
Back to article
Nadkarni, S., Roychowdhury, S., Rao, P., and Clayton, M. (2023). Exploring the correspondence of melodic contour with gesture in raga alap singing. In Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR, Milan, Italy.
Back to article
Nemati, S., Rohani, R., Basiri, M. E., Abdar, M., Yen, N. Y., and Makarenkov, V. (2019). A hybrid latent space data fusion method for multimodal emotion recognition. IEEE Access, 7, 172948–172964.
Back to article
Oramas, S., Barbieri, F., Nieto, O., and Serra, X. (2018). Multimodal deep learning for music genre classification. Transactions of the International Soceity for Music Information Retrieval, 1(1), 4–21.
Back to article
Osumi, K., Yamashita, T., and Fujiyoshi, H. (2019). Domain adaptation using a gradient reversal layer with instance weighting. In 2019 16th International Conference on Machine Vision Applications (MVA) (pp. 1–5). IEEE.
Back to article
Paschalidou, S., and Clayton, M. (2015). Towards a sound‑gesture analysis in hindustani dhrupad vocal music: Effort and raga space. Proceedings of ICMEM, 23, 25.
Back to article
Paschalidou, S., Eerola, T., and Clayton, M. (2016). Voice and movement as predictors of gesture types and physical effort in virtual object interactions of classical indian singing. In Proceedings of the 3rd International Symposium on Movement and Computing, MOCO ‘16, New York, NY, USA. Association for Computing Machinery.
Back to article
Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. (2019). 3D human pose estimation in video with temporal convolutions and semi‑supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7753–7762).
Back to article
Pearson, L. (2013). Gesture and the sonic event in karnatak music. Empirical Musicology Review, 8(1), 2–14.
Back to article
Pearson, L. (2016). Coarticulation and gesture: An analysis of melodic movement in south indian raga performance. Music Analysis, 35(3), 280–313.
Back to article
Pearson, L., and Pouw, W. (2022). Gesture–vocal coupling in karnatak music performance: A neuro–bodily distributed aesthetic entanglement. Annals of the New York Academy of Sciences, 1515(1), 219–236.
Back to article
Peri, R., Parthasarathy, S., Bradshaw, C., and Sundaram, S. (2021). Disentanglement for audio‑visual emotion recognition using multitask setup. In ICASSP 2021‑2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6344–6348). IEEE.
Back to article
Powers, H. S., and Widdess, R. (2001). India, subcontinent of, chapter iii: Theory and practice of classical music. In S. Sadie (Ed.), New grove dictionary of music (2nd ed., pp. 170–210). Macmillan.
Back to article
Praveen, R. G., de Melo, W. C., Ullah, N., Aslam, H., Zeeshan, O., Denorme, T., Pedersoli, M., Koerich, A. L., Bacon, S., and Cardinal, P. (2022). A joint cross‑attention model for audio‑visual fusion in dimensional emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2486–2495).
Back to article
Puttagunta, R. S., Li, Z., Bhattacharyya, S., and York, G. (2023). Appearance label balanced triplet loss for multi‑modal aerial view object classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 534–542).
Back to article
Rahaim, M. (2009). Gesture, melody, and the paramparic body in Hindustani vocal music. University of California.
Back to article
Rajagopalan, S. S., Morency, L.‑P., Baltrusaitis, T., and Goecke, R. (2016). Extending long short‑term memory for multi‑view structured learning. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14 (pp. 338–353). Springer.
Back to article
Rajinikanth, V., Joseph Raj, A. N., Thanaraj, K. P., and Naik, G. R. (2020). A customized vgg19 network with concatenation of deep and handcrafted features for brain tumor detection. Applied Sciences, 10(10), 3429.
Back to article
Rao, P., Roychowdhury, S., and Clayton, M. (2024). Hindustani vocal alap audiovisual correspondence. https://osf.io/qjkzs/.
Back to article
Rao, S., and Rao, P. (2014). An overview of hindustani music in the context of computational musicology. Journal of New Music Research, 43(1), 24–33.
Back to article
Rao, S., and van der Meer, W. (2010). Music in motion. https://autrimncpa.wordpress.com/.
Back to article
Raza, A., Huo, H., and Fang, T. (2020). Pfaf‑net: Pyramid feature network for multimodal fusion. IEEE Sensors Letters, 4(12), 1–4.
Back to article
Rowe, A. C., and Abbott, P. C. (1995). Daubechies wavelets and mathematica. Computers in Physics, 9(6), 635–648.
Back to article
Roy, R. L. (1934). Hindustani ragas. The Musical Quarterly, XX(3), 320–333.
Back to article
Roychowdhury, S., Rao, P., and Chandran, S. (2024). Human pose estimation for expressive movement descriptors in vocal music performances. In Proceedings of the 25th International Society for Music Information Retrieval Conference, ISMIR, San Francisco, USA.
Back to article
Smirnov, E., Melnikov, A., Oleinik, A., Ivanova, E., Kalinovskiy, I., and Luckyanets, E. (2018). Hard example mining with auxiliary embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 37–46).
Back to article
Smith, L. N. (2018). A disciplined approach to neural network hyper‑parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820.
Back to article
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Back to article
Tang, K., Ma, Y., Miao, D., Song, P., Gu, Z., Tian, Z., and Wang, W. (2022). Decision fusion networks for image classification. IEEE Transactions on Neural Networks and Learning Systems.
Back to article
Tu, M., Tang, Y., Huang, J., He, X., and Zhou, B. (2019). Towards adversarial learning of speaker‑invariant representation for speech emotion recognition. arXiv preprint arXiv:1903.09606.
Back to article
Wang, Y., Peng, J., Zhang, J., Yi, R., Wang, Y., and Wang, C. (2023). Multimodal industrial anomaly detection via hybrid fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8032–8041).
Back to article
Yang, J., Bisk, Y., and Gao, J. (2021). Taco: Token‑aware cascade contrastive learning for video‑text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 11562–11572).
Back to article
Yang, L., Wu, Z., Hong, J., and Long, J. (2022). Mcl: A contrastive learning method for multimodal data fusion in violence detection. IEEE Signal Processing Letters.
Back to article
Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.‑P. (2017). Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
Back to article
Zhou, T., Fu, H., Chen, G., Shen, J., and Shao, L. (2020). Hi‑net: Hybrid‑fusion network for multi‑modal mr image synthesis. IEEE Transactions on Medical Imaging, 39(9), 2772–2781.
Back to article

Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

Abstract

Paradigm

My account