Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Jinyue Guo; Jim Tørresen; Alexander Refsum Jensenius

doi:10.5334/tismir.223

Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Transactions of the International Society for Music Information Retrieval

Volume 9 (2026): Issue 1

By: Jinyue Guo , Jim Tørresen and Alexander Refsum Jensenius

Open Access

|Mar 2026

Adami, E. (2016). Introducing multimodality. In The Oxford handbook of language and society (pp. 451–472).
Search in Google Scholar Back to article
Akbari, H., Yuan, L., Qian, R., Chuang, W.‑H., Chang, S.‑F., Cui, Y., and Gong, B. (2021). Vatt: Transformers for multimodal self‑supervised learning from raw video, audio and text. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 34, pp. 24206–24221). Curran Associates, Inc.
Search in Google Scholar Back to article
Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., and Tran, D. (2020). Self‑supervised learning by cross‑modal audio‑video clustering. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 33, pp. 9758–9770). Curran Associates, Inc.
Search in Google Scholar Back to article
Arandjelovic, R., and Zisserman, A. (2017). Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 609–617).
Search in Google Scholar Back to article
Aytar, Y., Vondrick, C., and Torralba, A. (2016). SoundNet: Learning sound representations from unlabeled video. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 29). Curran Associates, Inc.
Search in Google Scholar Back to article
Bagher Zadeh, A., Liang, P. P., Poria, S., Cambria, E., and Morency, L.‑P. (2018). Multimodal language analysis in the wild: CMU‑MOSEI dataset and interpretable dynamic fusion graph. In I. Gurevych and Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2236–2246). Melbourne, Australia: Association for Computational Linguistics.
Search in Google Scholar Back to article
Baltrušaitis, T., Ahuja, C., and Morency, L.‑P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.
Search in Google Scholar Back to article
Barreiro, D. L. (2010). Sonic image and acousmatic listening. Organised Sound, 15(1), 35–42.
Search in Google Scholar Back to article
Batziou, E., Michail, E., Avgerinakis, K., Vrochidis, S., Patras, I., and Kompatsiaris, I. (2018). Visual and audio analysis of movies video for emotion detection. In Emotional Impact of Movies task MediaEval 2018.
Search in Google Scholar Back to article
Biewald, L. (2020). Experiment tracking with weights and biases. Software. https://www.wandb.com.
Search in Google Scholar Back to article
Chen, H., Xie, W., Vedaldi, A., and Zisserman, A. (2020). VGGSound: A large‑scale audio‑visual dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 721–725). Barcelona, Spain: IEEE.
Search in Google Scholar Back to article
Chen, K., Du, X., Zhu, B., Ma, Z., Berg‑Kirkpatrick, T., and Dubnov, S. (2022). HTS‑AT: A hierarchical token‑semantic audio transformer for sound classification and detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 646–650). Singapore: IEEE.
Search in Google Scholar Back to article
Chion, M. (1983). Guide des objets sonores: Pierre Schaeffer et la recherche musicale. Bibliothèque de recherche musicale. Buchet/Chastel.
Search in Google Scholar Back to article
Chion, M., and Murch, W. (2019a). Audio‑Vision: Sound on Screen. Columbia University Press.
Search in Google Scholar Back to article
Chion, M., and Murch, W. (2019b). LINES AND POINTS: Horizontal and Vertical Perspectives on Audiovisual Relations (pp. 35–65). Columbia University Press.
Search in Google Scholar Back to article
Chion, M., and Murch, W. (2019c). The Three Listening Modes (pp. 22–34). Columbia University Press.
Search in Google Scholar Back to article
Christodoulou, A.‑M., Lartillot, O., and Jensenius, A. R. (2024). Multimodal music datasets? Challenges and future goals in music processing. International Journal of Multimedia Information Retrieval, 13(3), 37.
Search in Google Scholar Back to article
Couprie, P. (2003). La musique électroacoustique: analyse morphologique et représentation analytique [Theses]. Paris‑Sorbonne.
Search in Google Scholar Back to article
Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Ma, J., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., and Wray, M. (2022). Rescaling egocentric vision: Collection, pipeline and challenges for epic‑kitchens‑100. International Journal of Computer Vision (IJCV), 130, 33–55.
Search in Google Scholar Back to article
Delalande, F., Formosa, M., Frémiot, M., Gobin, P., Malbosc, P., Mandelbrojt, J., and Pedler, E. (1996). Les unités sémiotiques temporelles-éléments nouveaux d’analyse musicale. Edition MIM.
Search in Google Scholar Back to article
Deldjoo, Y., Kille, B., Schedl, M., Lommatzsch, A., and Shen, J. (2019). The 2019 multimedia for recommender system task: MovieREC and NewsREEL at MediaEval. In MediaEval, Sophia Antipolis France.
Search in Google Scholar Back to article
Deng, J., Dong, W., Socher, R., Li, L.‑J., Li, K., and Fei‑Fei, L. (2009). ImageNet: A large‑scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 248–255). Miami Beach, FL, USA: IEEE.
Search in Google Scholar Back to article
Duan, H., Xia, Y., Mingze, Z., Tang, L., Zhu, J., and Zhao, Z. (2023). Cross‑modal prompts: Adapting large pre‑trained models for audio‑visual downstream tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 36, pp. 56075–56094). Curran Associates, Inc.
Search in Google Scholar Back to article
Falcon, W. and The PyTorch Lightning team. (2019). PyTorch Lightning.
Search in Google Scholar Back to article
Ganaie, M. A., Hu, M., Malik, A. K., Tanveer, M., and Suganthan, P. N. (2022). Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115, 105151.
Search in Google Scholar Back to article
Gaver, W. W. (1993). What in the world do we hear?: An ecological approach to auditory event perception. Ecological Psychology, 5(1), 1–29.
Search in Google Scholar Back to article
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017). Audio Set: An ontology and human‑labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 776–780). New Orleans, LA, USA. ISSN: 2379–190X.
Search in Google Scholar Back to article
Geslin, Y., Mullon, P., and Jacob, M. (2002). Ecrins: An audio‑content description environment for sound samples. In proceedings of the 2002 International Computer Music Conference (ICMC), Gothenburg, Sweden.
Search in Google Scholar Back to article
He, J., Zhou, C., Ma, X., Berg‑Kirkpatrick, T., and Neubig, G. (2022). Towards a unified view of parameter‑efficient transfer learning. In International Conference on Learning Representations (ICLR), (virtual conference).
Search in Google Scholar Back to article
Holbrook, U. A. (2022). Sonic design and spatial features. In Proceedings of the International Conference in Sonic Design: Explorations between Art and Science (pp. 175–191). Springer.
Search in Google Scholar Back to article
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. (2019). Parameter‑efficient transfer learning for NLP. In K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research (pp. 2790–2799). Long Beach, CA, USA: PMLR.
Search in Google Scholar Back to article
Hu, E. J., Shen, Y., Wallis, P., Allen‑Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). Lora: Low‑rank adaptation of large language models. ICLR, 1(2), 3.
Search in Google Scholar Back to article
Huang, Y., Lin, J., Zhou, C., Yang, H., and Huang, L. (2022). Modality competition: What makes joint training of multi‑modal network fail in deep learning? (Provably). In Proceedings of the 39th International Conference on Machine Learning (ICML) (Vol. 162, pp. 9226–9259). Proceedings of Machine Learning Research, Baltimore, MD, USA: PMLR.
Search in Google Scholar Back to article
Huh, J., Chalk, J., Kazakos, E., Damen, D., and Zisserman, A. (2023). Epic‑sounds: A large‑scale dataset of actions that sound. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5).
Search in Google Scholar Back to article
Jensenius, A. R. (2022). Sound Actions: Conceptualizing Musical Instruments. MIT Press.
Search in Google Scholar Back to article
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., and Zisserman, A. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Search in Google Scholar Back to article
Kim, S., Yang, H., Kim, Y., Hong, Y., and Park, E. (2024). Hydra: Multi‑head low‑rank adaptation for parameter efficient fine‑tuning. Neural Networks, 178, 106414.
Search in Google Scholar Back to article
Kim, W., Son, B., and Kim, I. (2021). Vilt: Vision‑and‑language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML) (pp. 5583–5594). PMLR.
Search in Google Scholar Back to article
Korbar, B., Tran, D., and Torresani, L. (2018). Cooperative learning of audio and video models from self‑supervised synchronization. In Advances in Neural Information Processing Systems (NeurIPS), 31.
Search in Google Scholar Back to article
Kwok, C. Y., Li, S., Yip, J. Q., and Chng, E. S. (2024). Low‑resource language adaptation with ensemble of PEFT approaches. In Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1–6). Macau, China: IEEE.
Search in Google Scholar Back to article
Lartillot, O. (2024). Musicological and technological perspectives on computational analysis of electroacoustic music. In A. R. Jensenius (Ed.), Proceedings of the International Conference in Sonic Design: Explorations Between Art and Science (pp. 271–297). Cham: Springer Nature Switzerland.
Search in Google Scholar Back to article
Lin, T.‑H., Wang, H.‑S., Weng, H.‑Y., Peng, K.‑C., Chen, Z.‑C., and Lee, H.‑y. (2024). PEFT for speech: Unveiling optimal placement, merging strategies, and ensemble techniques. In IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (pp. 705–709). Seoul, Korea: IEEE.
Search in Google Scholar Back to article
Lin, Y.‑B., Sung, Y.‑L., Lei, J., Bansal, M., and Bertasius, G. (2023). Vision transformers are parameter‑efficient audio‑visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2299–2309).
Search in Google Scholar Back to article
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., and Guo, B. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 12009–12019). New Orleans, LA, USA.
Search in Google Scholar Back to article
Lux, M., Riegler, M., Dang‑Nguyen, D.‑T., Pirker, J., Potthast, M., and Halvorsen, P. (2019). GameStory task at MediaEval 2019. In MediaEval, Sophia Antipolis, France.
Search in Google Scholar Back to article
Ma, S., Zeng, Z., McDuff, D., and Song, Y. (2021). Active contrastive learning of audio‑visual video representations. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria (virtual conference).
Search in Google Scholar Back to article
Mallya, A., Davis, D., and Lazebnik, S. (2018). Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.
Search in Google Scholar Back to article
McGurk, H., and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746–748.
Search in Google Scholar Back to article
Meredith, M. A., and Stein, B. E. (1986). Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. Journal of Neurophysiology, 56(3), 640–662.
Search in Google Scholar Back to article
Nakatani, T., and Okuno, H. G. (1998). Sound ontology for computational auditory scene analysis. In Proceedings of the 15th National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence (pp. 1004–1010) Madison, WI, USA.
Search in Google Scholar Back to article
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML) (pp. 689–696). Bellevue, WA, USA.
Search in Google Scholar Back to article
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., and Freeman, W. T. (2016). Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2405–2413). Las Vegas, NV, USA.
Search in Google Scholar Back to article
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., . . . Chintala, S. (2019). PyTorch: An imperative style, high‑performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché‑Buc, E. Fox, and R. Garnett (Eds.), Advances in Neural Information Processing Systems (NeurIPS) (Vol. 32). Curran Associates, Inc.
Search in Google Scholar Back to article
Pearson, K. (1901). LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572.
Search in Google Scholar Back to article
Peng, X., Wei, Y., Deng, A., Wang, D., and Hu, D. (2022). Balanced multimodal learning via on‑the‑fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 8238–8247).
Search in Google Scholar Back to article
Poignant, J., Bredin, H., and Barras, C. (2015). Multimodal person discovery in broadcast TV at MediaEval 2015. In MediaEval, Wurzen, Germany.
Search in Google Scholar Back to article
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research (pp. 8748–8763). Vienna, Austria (virtual conference). PMLR.
Search in Google Scholar Back to article
Randolph, J. J. (2005). Free‑marginal multirater kappa (multirater κfree): An alternative to Fleiss fixed‑marginal multirater kappa. In Proceedings of the Joensuu Learning and Instruction Symposium, Joensuu, Finland.
Search in Google Scholar Back to article
Roy, S. (2004). L’analyse des musiques électroacoustiques: modèles et propositions. L’Harmattan.
Search in Google Scholar Back to article
Sager, S., Borth, D., Elizalde, B., Schulze, C., Raj, B., Lane, I., and Dengel, A. (2016). AudioSentibank: Large‑scale semantic ontology of acoustic concepts for audio content analysis. arXiv preprint arXiv:1607.03766.
Search in Google Scholar Back to article
Salamon, J., Jacoby, C., and Bello, J. P. (2014). A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia (pp. 1041–1044). Lisbon, Portugal.
Search in Google Scholar Back to article
Schaeffer, P. (1969). Traité des objets musicaux. Revue de Métaphysique et de Morale, 74(3).
Search in Google Scholar Back to article
Schaeffer, P., Reibel, G., Ferreyra, B., Chiarucci, H., Bayle, F., Tanguy, A., Ducarme, J.‑L., Pontefract, J.‑F., and Schwarz, J. (1967). Solfège de l’objet sonore. INA GRM.
Search in Google Scholar Back to article
Schafer, R. M. (1993). The soundscape: Our sonic environment and the tuning of the world. Simon and Schuster.
Search in Google Scholar Back to article
Schomaker, L. (1995). A taxonomy of multimodal interaction in the human information processing system. Technical report, NICI institute, Nijmegen.
Search in Google Scholar Back to article
Smalley, D. (1986). Spectro‑morphology and structuring processes. In The Language of Electroacoustic Music (pp. 61–93). Springer.
Search in Google Scholar Back to article
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., and Liu, J. (2022). Human action recognition from various data modalities: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3200–3225.
Search in Google Scholar Back to article
Thoresen, L., and Hedman, A. (2007). Spectromorphological analysis of sound objects: An adaptation of pierre schaeffer’s typomorphology. Organised Sound, 12(2), 129–141.
Search in Google Scholar Back to article
Thoresen, L., and Hedman, A. (2015). Emergent Musical Forms: Aural Explorations. University of Western Ontario.
Search in Google Scholar Back to article
Tian, Y., Hu, D., and Xu, C. (2021). Cyclic co‑learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2745–2754).
Search in Google Scholar Back to article
Tian, Y., Li, D., and Xu, C. (2020). Unified multisensory perception: Weakly‑supervised audio‑visual video parsing. In A. Vedaldi, H. Bischof, T. Brox, and J.‑M. Frahm (Eds.), Computer Vision (ECCV 2020) (pp. 436–454). Springer International Publishing.
Search in Google Scholar Back to article
Tian, Y., Shi, J., Li, B., Duan, Z., and Xu, C. (2018). Audio‑visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 247–263).
Search in Google Scholar Back to article
Wei, Y., and Hu, D. (2024). Mmpareto: Boosting multimodal learning with innocent unimodal assistance. In Proceedings of the 41st International Conference on Machine Learning (ICML) (pp. 52559–52572). PMLR.
Search in Google Scholar Back to article
Wei, Y., Hu, D., Tian, Y., and Li, X. (2022). Learning in audio‑visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579.
Search in Google Scholar Back to article
Xia, W., Zhao, X., Pang, X., Zhang, C., and Hu, D. (2023). Balanced audiovisual dataset for imbalance analysis. arXiv preprint arXiv:2302.10912.
Search in Google Scholar Back to article
Xia, Y., and Zhao, Z. (2022). Cross‑modal background suppression for audio‑visual event localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(pp. 19989–19998). New Orleans, LA, USA.
Search in Google Scholar Back to article
Xu, L., Xie, H., Qin, S.‑Z. J., Tao, X., and Wang, F. L. (2023). Parameter‑efficient fine‑tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148.
Search in Google Scholar Back to article
Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. (2021). Filip: Fine‑grained interactive language‑image pre‑training. arXiv preprint arXiv:2111.07783.
Search in Google Scholar Back to article
Yuhas, B. P., Goldstein, M. H., and Sejnowski, T. J. (1989). Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11), 65–71.
Search in Google Scholar Back to article
Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., Kusupati, A., Hessel, J., Farhadi, A., and Choi, Y. (2022). MERLOT reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16375–16387). New Orleans, LA, USA.
Search in Google Scholar Back to article
Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., Guo, D., Kong, L., Wang, M., and Zhong, Y. (2022). Audio–visual segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 386–403). Springer.
Search in Google Scholar Back to article

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.223 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Sep 1, 2024

Accepted on: Sep 17, 2025

Published on: Mar 11, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

multi‑modal neural network,

audio–video learning,

auditory–visual theory,

dataset,

event classification,

parameter‑efficient fine‑tuning

© 2026 Jinyue Guo, Jim Tørresen, Alexander Refsum Jensenius, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 9 (2026): Issue 1