Have a personal or library account? Click to login
Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss Cover

Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

Open Access
|Jul 2025

References

  1. Badshah, A. M., Ahmad, J., Rahim, N., and Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 International Conference on Platform Technology and Service (PlatCon) (pp. 15). IEEE.
  2. Cao, Z., Hidalgo, G., Simon, T., Wei, S.‑E., and Sheikh, Y. (2019). Openpose: Realtime multi‑person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 172186.
  3. Chen, J., Chen, Z., Chi, Z., and Fu, H. (2014). Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction (pp. 508513).
  4. Chu, K. (2024). Application of animation products via multimodal information and semantic analogy. Multimedia Tools and Applications, 83(9), 2603126054.
  5. Clayton, M. (2017). Time, gesture, and attention in a khyāl performance. In Ethnomusicology: A Contemporary Reader (Vol. II, pp. 249266). Routledge.
  6. Clayton, M., Jakubowski, K., and Eerola, T. (2019). Interpersonal entrainment in indian instrumental music performance: Synchronization and movement coordination relate to tempo, dynamics, metrical and cadential structure. Musicae Scientiae, 23(3), 304331.
  7. Clayton, M., Li, J., Clarke, A., and Weinzierl, M. (2023). Hindustani raga and singer classification using 2D and 3D pose estimation from video recordings. Journal of New Music Research, 52(4), 285300.
  8. Clayton, M., Rao, P., Shikarpur, N., Roychowdhury, S., and Li, J. (2022). Raga classification from vocal performances using multimodal analysis. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR, Bengaluru, India (pp. 283290).
  9. Fan, Y., Lu, X., Li, D., and Liu, Y. (2016). Video‑based emotion recognition using CNN‑RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 445450).
  10. Franceschini, R., Fini, E., Beyan, C., Conti, A., Arrigoni, F., and Ricci, E. (2022). Multimodal emotion recognition with modality‑pairwise unsupervised contrastive loss. In 2022 26th International Conference on Pattern Recognition (ICPR) (pp. 25892596). IEEE.
  11. Gammulle, H., Denman, S., Sridharan, S., and Fookes, C. (2021). Tmmf: Temporal multi‑modal fusion for single‑stage continuous gesture recognition. IEEE Transactions on Image Processing, 30, 76897701.
  12. Ganguli, K. K., and Rao, P. (2018). On the distributional representation of ragas: Experiments with allied raga pairs. Transactions of the International Society. for Music Information Retrieval, 1(1), 7995.
  13. Ganin, Y., and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (pp. 11801189). PMLR.
  14. Gavahi, K., Foroumandi, E., and Moradkhani, H. (2023). A deep learning‑based framework for multi‑source precipitation fusion. Remote Sensing of Environment, 295, 113723.
  15. Gong, Y., Liu, A. H., Rouditchenko, A., and Glass, J. (2022). Uavm: Towards unifying audio and visual models. IEEE Signal Processing Letters, 29, 24372441.
  16. Jin, B. T., Abdelrahman, L., Chen, C. K., and Khanzada, A. (2020). Fusical: Multimodal fusion for video sentiment. In Proceedings of the 2020 International Conference on Multimodal Interaction (pp. 798806).
  17. Koduri, G., Gulati, S., Rao, P., and Serra, X. (2012). Raga recognition based on pitch distribution methods. Journal of New Music Research, 41(4), 337350.
  18. Lachenbruch, P. A. (2014). Mcnemar test. Wiley StatsRef: Statistics Reference Online.
  19. Leante, L. (2009). The lotus and the king: Imagery, gesture and meaning in a hindustani rāg. Ethnomusicology Forum, 18(2), 185206.
  20. Leante, L. (2013). Gesture and imagery in music performance: Perspectives from north indian classical music. In T. Shephard and A. Leonard (Eds.), The Routledge companion to music and visual culture (pp. 145152). Routledge.
  21. Leante, L. (2018). The cuckoo’s song : Imagery and movement in monsoon ragas. In I. Rajamani, M. Pernau, and K. R. B. Schofield (Eds.), Monsoon feelings : A history of emotions in the rain. Niyogi Books.
  22. Li, P., Liu, G., He, J., Zhao, Z., and Zhong, S. (2023). Masked vision and language pre‑training with unimodal and multimodal contrastive losses for medical visual question answering. In International Conference on Medical Image Computing and Computer‑Assisted Intervention (pp. 374383). Springer.
  23. Liu, Z., Zhang, H., Chen, Z., Wang, Z., and Ouyang, W. (2020). Disentangling and unifying graph convolutions for skeleton‑based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 143152).
  24. Mai, S., Zeng, Y., Zheng, S., and Hu, H. (2022). Hybrid contrastive learning of tri‑modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing, 14(3), 22762289.
  25. Mao, Q., Dong, M., Huang, Z., and Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8), 22032213.
  26. Nadkarni, S., Roychowdhury, S., Rao, P., and Clayton, M. (2023). Exploring the correspondence of melodic contour with gesture in raga alap singing. In Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR, Milan, Italy.
  27. Nemati, S., Rohani, R., Basiri, M. E., Abdar, M., Yen, N. Y., and Makarenkov, V. (2019). A hybrid latent space data fusion method for multimodal emotion recognition. IEEE Access, 7, 172948172964.
  28. Oramas, S., Barbieri, F., Nieto, O., and Serra, X. (2018). Multimodal deep learning for music genre classification. Transactions of the International Soceity for Music Information Retrieval, 1(1), 421.
  29. Osumi, K., Yamashita, T., and Fujiyoshi, H. (2019). Domain adaptation using a gradient reversal layer with instance weighting. In 2019 16th International Conference on Machine Vision Applications (MVA) (pp. 15). IEEE.
  30. Paschalidou, S., and Clayton, M. (2015). Towards a sound‑gesture analysis in hindustani dhrupad vocal music: Effort and raga space. Proceedings of ICMEM, 23, 25.
  31. Paschalidou, S., Eerola, T., and Clayton, M. (2016). Voice and movement as predictors of gesture types and physical effort in virtual object interactions of classical indian singing. In Proceedings of the 3rd International Symposium on Movement and Computing, MOCO ‘16, New York, NY, USA. Association for Computing Machinery.
  32. Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. (2019). 3D human pose estimation in video with temporal convolutions and semi‑supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 77537762).
  33. Pearson, L. (2013). Gesture and the sonic event in karnatak music. Empirical Musicology Review, 8(1), 214.
  34. Pearson, L. (2016). Coarticulation and gesture: An analysis of melodic movement in south indian raga performance. Music Analysis, 35(3), 280313.
  35. Pearson, L., and Pouw, W. (2022). Gesture–vocal coupling in karnatak music performance: A neuro–bodily distributed aesthetic entanglement. Annals of the New York Academy of Sciences, 1515(1), 219236.
  36. Peri, R., Parthasarathy, S., Bradshaw, C., and Sundaram, S. (2021). Disentanglement for audio‑visual emotion recognition using multitask setup. In ICASSP 2021‑2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 63446348). IEEE.
  37. Powers, H. S., and Widdess, R. (2001). India, subcontinent of, chapter iii: Theory and practice of classical music. In S. Sadie (Ed.), New grove dictionary of music (2nd ed., pp. 170210). Macmillan.
  38. Praveen, R. G., de Melo, W. C., Ullah, N., Aslam, H., Zeeshan, O., Denorme, T., Pedersoli, M., Koerich, A. L., Bacon, S., and Cardinal, P. (2022). A joint cross‑attention model for audio‑visual fusion in dimensional emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 24862495).
  39. Puttagunta, R. S., Li, Z., Bhattacharyya, S., and York, G. (2023). Appearance label balanced triplet loss for multi‑modal aerial view object classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 534542).
  40. Rahaim, M. (2009). Gesture, melody, and the paramparic body in Hindustani vocal music. University of California.
  41. Rajagopalan, S. S., Morency, L.‑P., Baltrusaitis, T., and Goecke, R. (2016). Extending long short‑term memory for multi‑view structured learning. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14 (pp. 338353). Springer.
  42. Rajinikanth, V., Joseph Raj, A. N., Thanaraj, K. P., and Naik, G. R. (2020). A customized vgg19 network with concatenation of deep and handcrafted features for brain tumor detection. Applied Sciences, 10(10), 3429.
  43. Rao, P., Roychowdhury, S., and Clayton, M. (2024). Hindustani vocal alap audiovisual correspondence. https://osf.io/qjkzs/.
  44. Rao, S., and Rao, P. (2014). An overview of hindustani music in the context of computational musicology. Journal of New Music Research, 43(1), 2433.
  45. Rao, S., and van der Meer, W. (2010). Music in motion. https://autrimncpa.wordpress.com/.
  46. Raza, A., Huo, H., and Fang, T. (2020). Pfaf‑net: Pyramid feature network for multimodal fusion. IEEE Sensors Letters, 4(12), 14.
  47. Rowe, A. C., and Abbott, P. C. (1995). Daubechies wavelets and mathematica. Computers in Physics, 9(6), 635648.
  48. Roy, R. L. (1934). Hindustani ragas. The Musical Quarterly, XX(3), 320333.
  49. Roychowdhury, S., Rao, P., and Chandran, S. (2024). Human pose estimation for expressive movement descriptors in vocal music performances. In Proceedings of the 25th International Society for Music Information Retrieval Conference, ISMIR, San Francisco, USA.
  50. Smirnov, E., Melnikov, A., Oleinik, A., Ivanova, E., Kalinovskiy, I., and Luckyanets, E. (2018). Hard example mining with auxiliary embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 3746).
  51. Smith, L. N. (2018). A disciplined approach to neural network hyper‑parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820.
  52. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  53. Tang, K., Ma, Y., Miao, D., Song, P., Gu, Z., Tian, Z., and Wang, W. (2022). Decision fusion networks for image classification. IEEE Transactions on Neural Networks and Learning Systems.
  54. Tu, M., Tang, Y., Huang, J., He, X., and Zhou, B. (2019). Towards adversarial learning of speaker‑invariant representation for speech emotion recognition. arXiv preprint arXiv:1903.09606.
  55. Wang, Y., Peng, J., Zhang, J., Yi, R., Wang, Y., and Wang, C. (2023). Multimodal industrial anomaly detection via hybrid fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 80328041).
  56. Yang, J., Bisk, Y., and Gao, J. (2021). Taco: Token‑aware cascade contrastive learning for video‑text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1156211572).
  57. Yang, L., Wu, Z., Hong, J., and Long, J. (2022). Mcl: A contrastive learning method for multimodal data fusion in violence detection. IEEE Signal Processing Letters.
  58. Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.‑P. (2017). Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
  59. Zhou, T., Fu, H., Chen, G., Shen, J., and Shao, L. (2020). Hi‑net: Hybrid‑fusion network for multi‑modal mr image synthesis. IEEE Transactions on Medical Imaging, 39(9), 27722781.
DOI: https://doi.org/10.5334/tismir.221 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 2, 2024
Accepted on: Jun 9, 2025
Published on: Jul 21, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Sujoy Roychowdhury, Preeti Rao, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.