Skip to main content
Have a personal or library account? Click to login
Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset Cover

Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Open Access
|Mar 2026

References

  1. Adami, E. (2016). Introducing multimodality. In The Oxford handbook of language and society (pp. 451472).
  2. Akbari, H., Yuan, L., Qian, R., Chuang, W.‑H., Chang, S.‑F., Cui, Y., and Gong, B. (2021). Vatt: Transformers for multimodal self‑supervised learning from raw video, audio and text. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 34, pp. 2420624221). Curran Associates, Inc.
  3. Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., and Tran, D. (2020). Self‑supervised learning by cross‑modal audio‑video clustering. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 33, pp. 97589770). Curran Associates, Inc.
  4. Arandjelovic, R., and Zisserman, A. (2017). Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 609617).
  5. Aytar, Y., Vondrick, C., and Torralba, A. (2016). SoundNet: Learning sound representations from unlabeled video. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 29). Curran Associates, Inc.
  6. Bagher Zadeh, A., Liang, P. P., Poria, S., Cambria, E., and Morency, L.‑P. (2018). Multimodal language analysis in the wild: CMU‑MOSEI dataset and interpretable dynamic fusion graph. In I. Gurevych and Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 22362246). Melbourne, Australia: Association for Computational Linguistics.
  7. Baltrušaitis, T., Ahuja, C., and Morency, L.‑P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423443.
  8. Barreiro, D. L. (2010). Sonic image and acousmatic listening. Organised Sound, 15(1), 3542.
  9. Batziou, E., Michail, E., Avgerinakis, K., Vrochidis, S., Patras, I., and Kompatsiaris, I. (2018). Visual and audio analysis of movies video for emotion detection. In Emotional Impact of Movies task MediaEval 2018.
  10. Biewald, L. (2020). Experiment tracking with weights and biases. Software. https://www.wandb.com.
  11. Chen, H., Xie, W., Vedaldi, A., and Zisserman, A. (2020). VGGSound: A large‑scale audio‑visual dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 721725). Barcelona, Spain: IEEE.
  12. Chen, K., Du, X., Zhu, B., Ma, Z., Berg‑Kirkpatrick, T., and Dubnov, S. (2022). HTS‑AT: A hierarchical token‑semantic audio transformer for sound classification and detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 646650). Singapore: IEEE.
  13. Chion, M. (1983). Guide des objets sonores: Pierre Schaeffer et la recherche musicale. Bibliothèque de recherche musicale. Buchet/Chastel.
  14. Chion, M., and Murch, W. (2019a). Audio‑Vision: Sound on Screen. Columbia University Press.
  15. Chion, M., and Murch, W. (2019b). LINES AND POINTS: Horizontal and Vertical Perspectives on Audiovisual Relations (pp. 3565). Columbia University Press.
  16. Chion, M., and Murch, W. (2019c). The Three Listening Modes (pp. 2234). Columbia University Press.
  17. Christodoulou, A.‑M., Lartillot, O., and Jensenius, A. R. (2024). Multimodal music datasets? Challenges and future goals in music processing. International Journal of Multimedia Information Retrieval, 13(3), 37.
  18. Couprie, P. (2003). La musique électroacoustique: analyse morphologique et représentation analytique [Theses]. Paris‑Sorbonne.
  19. Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Ma, J., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., and Wray, M. (2022). Rescaling egocentric vision: Collection, pipeline and challenges for epic‑kitchens‑100. International Journal of Computer Vision (IJCV), 130, 3355.
  20. Delalande, F., Formosa, M., Frémiot, M., Gobin, P., Malbosc, P., Mandelbrojt, J., and Pedler, E. (1996). Les unités sémiotiques temporelles-éléments nouveaux d’analyse musicale. Edition MIM.
  21. Deldjoo, Y., Kille, B., Schedl, M., Lommatzsch, A., and Shen, J. (2019). The 2019 multimedia for recommender system task: MovieREC and NewsREEL at MediaEval. In MediaEval, Sophia Antipolis France.
  22. Deng, J., Dong, W., Socher, R., Li, L.‑J., Li, K., and Fei‑Fei, L. (2009). ImageNet: A large‑scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 248255). Miami Beach, FL, USA: IEEE.
  23. Duan, H., Xia, Y., Mingze, Z., Tang, L., Zhu, J., and Zhao, Z. (2023). Cross‑modal prompts: Adapting large pre‑trained models for audio‑visual downstream tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 36, pp. 5607556094). Curran Associates, Inc.
  24. Falcon, W. and The PyTorch Lightning team. (2019). PyTorch Lightning.
  25. Ganaie, M. A., Hu, M., Malik, A. K., Tanveer, M., and Suganthan, P. N. (2022). Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115, 105151.
  26. Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological Psychology, 5(1), 129.
  27. Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017). Audio Set: An ontology and human‑labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 776780). New Orleans, LA, USA. ISSN: 2379–190X.
  28. Geslin, Y., Mullon, P., and Jacob, M. (2002). Ecrins: An audio‑content description environment for sound samples. In proceedings of the 2002 International Computer Music Conference (ICMC), Gothenburg, Sweden.
  29. He, J., Zhou, C., Ma, X., Berg‑Kirkpatrick, T., and Neubig, G. (2022). Towards a unified view of parameter‑efficient transfer learning. In International Conference on Learning Representations (ICLR), (virtual conference).
  30. Holbrook, U. A. (2022). Sonic design and spatial features. In Proceedings of the International Conference in Sonic Design: Explorations between Art and Science (pp. 175191). Springer.
  31. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. (2019). Parameter‑efficient transfer learning for NLP. In K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research (pp. 27902799). Long Beach, CA, USA: PMLR.
  32. Hu, E. J., Shen, Y., Wallis, P., Allen‑Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). Lora: Low‑rank adaptation of large language models. ICLR, 1(2), 3.
  33. Huang, Y., Lin, J., Zhou, C., Yang, H., and Huang, L. (2022). Modality competition: What makes joint training of multi‑modal network fail in deep learning? (Provably). In Proceedings of the 39th International Conference on Machine Learning (ICML) (Vol. 162, pp. 9226–9259). Proceedings of Machine Learning Research, Baltimore, MD, USA: PMLR.
  34. Huh, J., Chalk, J., Kazakos, E., Damen, D., and Zisserman, A. (2023). Epic‑sounds: A large‑scale dataset of actions that sound. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 15).
  35. Jensenius, A. R. (2022). Sound Actions: Conceptualizing Musical Instruments. MIT Press.
  36. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., and Zisserman, A. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  37. Kim, S., Yang, H., Kim, Y., Hong, Y., and Park, E. (2024). Hydra: Multi‑head low‑rank adaptation for parameter efficient fine‑tuning. Neural Networks, 178, 106414.
  38. Kim, W., Son, B., and Kim, I. (2021). Vilt: Vision‑and‑language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML) (pp. 55835594). PMLR.
  39. Korbar, B., Tran, D., and Torresani, L. (2018). Cooperative learning of audio and video models from self‑supervised synchronization. In Advances in Neural Information Processing Systems (NeurIPS), 31.
  40. Kwok, C. Y., Li, S., Yip, J. Q., and Chng, E. S. (2024). Low‑resource language adaptation with ensemble of PEFT approaches. In Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 16). Macau, China: IEEE.
  41. Lartillot, O. (2024). Musicological and technological perspectives on computational analysis of electroacoustic music. In A. R. Jensenius (Ed.), Proceedings of the International Conference in Sonic Design: Explorations Between Art and Science (pp. 271297). Cham: Springer Nature Switzerland.
  42. Lin, T.‑H., Wang, H.‑S., Weng, H.‑Y., Peng, K.‑C., Chen, Z.‑C., and Lee, H.‑y. (2024). PEFT for speech: Unveiling optimal placement, merging strategies, and ensemble techniques. In IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (pp. 705709). Seoul, Korea: IEEE.
  43. Lin, Y.‑B., Sung, Y.‑L., Lei, J., Bansal, M., and Bertasius, G. (2023). Vision transformers are parameter‑efficient audio‑visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 22992309).
  44. Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., and Guo, B. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1200912019). New Orleans, LA, USA.
  45. Lux, M., Riegler, M., Dang‑Nguyen, D.‑T., Pirker, J., Potthast, M., and Halvorsen, P. (2019). GameStory task at MediaEval 2019. In MediaEval, Sophia Antipolis, France.
  46. Ma, S., Zeng, Z., McDuff, D., and Song, Y. (2021). Active contrastive learning of audio‑visual video representations. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria (virtual conference).
  47. Mallya, A., Davis, D., and Lazebnik, S. (2018). Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.
  48. McGurk, H., and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746748.
  49. Meredith, M. A., and Stein, B. E. (1986). Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. Journal of Neurophysiology, 56(3), 640662.
  50. Nakatani, T., and Okuno, H. G. (1998). Sound ontology for computational auditory scene analysis. In Proceedings of the 15th National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence (pp. 10041010) Madison, WI, USA.
  51. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML) (pp. 689696). Bellevue, WA, USA.
  52. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., and Freeman, W. T. (2016). Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 24052413). Las Vegas, NV, USA.
  53. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., . . . Chintala, S. (2019). PyTorch: An imperative style, high‑performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché‑Buc, E. Fox, and R. Garnett (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 32). Curran Associates, Inc.
  54. Pearson, K. (1901). LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559572.
  55. Peng, X., Wei, Y., Deng, A., Wang, D., and Hu, D. (2022). Balanced multimodal learning via on‑the‑fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 82388247).
  56. Poignant, J., Bredin, H., and Barras, C. (2015). Multimodal person discovery in broadcast TV at MediaEval 2015. In MediaEval, Wurzen, Germany.
  57. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research (pp. 87488763). Vienna, Austria (virtual conference). PMLR.
  58. Randolph, J. J. (2005). Free‑marginal multirater kappa (multirater κfree): An alternative to Fleiss fixed‑marginal multirater kappa. In Proceedings of the Joensuu Learning and Instruction Symposium, Joensuu, Finland.
  59. Roy, S. (2004). L’analyse des musiques électroacoustiques: modèles et propositions. L’Harmattan.
  60. Sager, S., Borth, D., Elizalde, B., Schulze, C., Raj, B., Lane, I., and Dengel, A. (2016). AudioSentibank: Large‑scale semantic ontology of acoustic concepts for audio content analysis. arXiv preprint arXiv:1607.03766.
  61. Salamon, J., Jacoby, C., and Bello, J. P. (2014). A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia (pp. 10411044). Lisbon, Portugal.
  62. Schaeffer, P. (1969). Traité des objets musicaux. Revue de Métaphysique et de Morale, 74(3).
  63. Schaeffer, P., Reibel, G., Ferreyra, B., Chiarucci, H., Bayle, F., Tanguy, A., Ducarme, J.‑L., Pontefract, J.‑F., and Schwarz, J. (1967). Solfège de l’objet sonore. INA GRM.
  64. Schafer, R. M. (1993). The soundscape: Our sonic environment and the tuning of the world. Simon and Schuster.
  65. Schomaker, L. (1995). A taxonomy of multimodal interaction in the human information processing system. Technical report, NICI institute, Nijmegen.
  66. Smalley, D. (1986). Spectro‑morphology and structuring processes. In The Language of Electroacoustic Music (pp. 6193). Springer.
  67. Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., and Liu, J. (2022). Human action recognition from various data modalities: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 32003225.
  68. Thoresen, L., and Hedman, A. (2007). Spectromorphological analysis of sound objects: An adaptation of pierre schaeffer’s typomorphology. Organised Sound, 12(2), 129141.
  69. Thoresen, L., and Hedman, A. (2015). Emergent Musical Forms: Aural Explorations. University of Western Ontario.
  70. Tian, Y., Hu, D., and Xu, C. (2021). Cyclic co‑learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 27452754).
  71. Tian, Y., Li, D., and Xu, C. (2020). Unified multisensory perception: Weakly‑supervised audio‑visual video parsing. In A. Vedaldi, H. Bischof, T. Brox, and J.‑M. Frahm (Eds.), Computer Vision (ECCV 2020) (pp. 436454). Springer International Publishing.
  72. Tian, Y., Shi, J., Li, B., Duan, Z., and Xu, C. (2018). Audio‑visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 247263).
  73. Wei, Y., and Hu, D. (2024). Mmpareto: Boosting multimodal learning with innocent unimodal assistance. In Proceedings of the 41st International Conference on Machine Learning (ICML) (pp. 5255952572). PMLR.
  74. Wei, Y., Hu, D., Tian, Y., and Li, X. (2022). Learning in audio‑visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579.
  75. Xia, W., Zhao, X., Pang, X., Zhang, C., and Hu, D. (2023). Balanced audiovisual dataset for imbalance analysis. arXiv preprint arXiv:2302.10912.
  76. Xia, Y., and Zhao, Z. (2022). Cross‑modal background suppression for audio‑visual event localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(pp. 1998919998). New Orleans, LA, USA.
  77. Xu, L., Xie, H., Qin, S.‑Z. J., Tao, X., and Wang, F. L. (2023). Parameter‑efficient fine‑tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148.
  78. Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. (2021). Filip: Fine‑grained interactive language‑image pre‑training. arXiv preprint arXiv:2111.07783.
  79. Yuhas, B. P., Goldstein, M. H., and Sejnowski, T. J. (1989). Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11), 6571.
  80. Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., Kusupati, A., Hessel, J., Farhadi, A., and Choi, Y. (2022). MERLOT reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1637516387). New Orleans, LA, USA.
  81. Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., Guo, D., Kong, L., Wang, M., and Zhong, Y. (2022). Audio–visual segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 386403). Springer.
DOI: https://doi.org/10.5334/tismir.223 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 1, 2024
Accepted on: Sep 17, 2025
Published on: Mar 11, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Jinyue Guo, Jim Tørresen, Alexander Refsum Jensenius, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.