Skip to main content
Have a personal or library account? Click to login
Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset Cover

Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Open Access
|Mar 2026

Abstract

Musicologists, psychologists, and computer scientists study relationships between auditory and visual stimuli from very different perspectives and using various terminologies and methodologies. This article aims to bridge the gap between phenomenological sound theory, auditory–visual theory, and audio–video processing and machine learning. We introduce the SoundActions dataset, a collection of 365 audio–video recordings of (primarily) short sound actions. Each recording has been human‑labeled and annotated according to Pierre Schaeffer’s theory of reduced listening, which describes the property of the sound itself (e.g., ‘an impulsive sound’) instead of the source (e.g., ‘a bird sound’). With these reduced‑type labels in the audio–video dataset, we conducted two experiments: (1) fine‑tuning the latest audio–video transformer model on the reduced‑type labels in the SoundActions dataset, proving that the model can recognize reduced‑type labels, and observing that the modality‑imbalance phenomenon is similar to the added value theory by Michel Chion and (2) proposing the Ensemble of Perception Mode Adapters method inspired by Pierre Schaeffer’s three listening modes, improving the audio–video model also on reduced‑type tasks.

DOI: https://doi.org/10.5334/tismir.223 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 1, 2024
Accepted on: Sep 17, 2025
Published on: Mar 11, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Jinyue Guo, Jim Tørresen, Alexander Refsum Jensenius, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.