Skip to main content
Have a personal or library account? Click to login
Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset Cover

Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Open Access
|Mar 2026

Figures & Tables

Figure 1

A sketch map of the three main sound types of Pierre Schaeffer’s typology (impulsive, sustained, and iterative) with related action profiles (Jensenius, 2022). Dashed vertical lines show the perceived onset/end points.

Table 1

Popular labeled audio‑visual datasets. In label modality, ‘audio&video’ means some labels are based on audio information while others are based on video information, while ‘combined’ means labels are based on the combined information of both modalities. In ‘Perception mode’ and ‘Label ontology’ columns, we use colors as an additional indicator of the perception types: red stands for causal labels, blue stands for reduced labels, and green stands for semantic labels. While ‘emotion’ was not considered in Schaeffer’s listening mode framework, we loosely regard it as a mixture of causal and semantic information.

DatasetYear# of ClipsTotal DurationSourceLabel ModalityPerception ModeLabel Ontology
AudioSet20172M+5,800+ hYouTubeaudiocausalevents
Kinetics‑4002017306k+850+ hYouTubecombinedcausalactions
EPIC‑KITCHENS201839,59455 horiginalaudio&videocausalobjects/actions
CMU‑MOSEI20182,1992+ hYouTubecombinedcausal/semanticemotion
AVE20184,14311+ hYouTubecombinedcausalevents
LLP202011,84932.9 hYouTubeaudio&videocausalevents
VGGSound2020200k+550+ hYouTubecombinedcausalevents
SSW6020229.2k25.7 horiginalcombinedcausalevents
SoundActions3651 horiginalcombinedcausal+reducedevents, objects, actions, environment, enjoyability, perception type
Figure 2

Thumbnails of the SoundActions dataset.

Figure 3

Histogram of durations of SoundActions samples. Sample counts are shown in logarithmic style to show the long tail.

Figure 4

Recording a sound action using a lightweight setup with a mobile phone equipped with a USB microphone.

Table 2

Statistics of the reduced type labels in SoundActions dataset.

PerceptionTypeEnjoyability
Class countsImpulsive: 124Yes: 49
Sustained: 84Neutral: 203
Iterative: 125No: 72
No majority: 32No majority: 41
Multirater κfree0.460.20
Agreed ≥ 291.5%88.5%
Agreed = 343.3%25.4%
Figure 5

Three factors of the audio–video fine‑tuning experiment: (1) fine‑tuning range, (2) modality combination, and (3) label type. All combinations of the three factors were tested on the SoundActions dataset with a fivefold cross‑validation.

Figure 6

Left: Classic audio‑visual adapter structure used by LAVisH (Lin et al., 2023) and DG‑SCT (Duan et al., 2023). Right: Ensemble of Perception Mode Adapters (EoPMA model). The causal adapters are trained on the AVE dataset (Tian et al., 2018); the reduced adapters are trained on reduced labels in the SoundActions dataset.

Table 3

Five‑fold fine‑tuning results of DG‑SCT with EoPMA on SoundActions.

PerceptionTypeEnjoyability
Fine‑Tune TypeFine‑Tune ModalityValidation ModalityGroup No.Validation Accuracy Mean (Each Fold)Group No.Validation Accuracy Mean (Each Fold)
clsavavA10.54(0.540.570.550.510.51)B10.58(0.570.550.580.570.60)
aaA20.55(0.530.540.560.580.55)B20.57(0.600.580.560.560.59)
vvA30.44(0.450.420.510.380.45)B30.57(0.560.570.580.560.58)
avaA40.35(0.340.350.340.360.37)B40.56(0.550.550.560.570.59)
avvA50.43(0.410.450.480.370.44)B50.57(0.560.580.580.550.58)
allavavC10.60(0.610.590.620.600.59)D10.59(0.630.550.580.560.62)
aaC20.66(0.620.660.670.660.68)D20.56(0.550.550.590.550.57)
vvC30.48(0.420.460.480.490.54)D30.57(0.600.550.580.550.56)
avaC40.53(0.430.550.580.520.55)D40.58(0.590.590.560.600.56)
avvC50.45(0.420.410.510.470.44)D50.56(0.560.550.560.550.56)
Figure 7

Qualitative principal component analysis visualization of the embedding spaces of different modality setups and tasks.

Table 4

Comparison of AVE validation accuracy of different ensemble methods, data, and labels. EoPMA: Ensemble of Perception Mode Adapters. Our re‑evaluation of the officially provided DG‑SCT AVE checkpoint.

Ensemble MethodEnsemble DataFine‑Tuned SoundActions LabelValidation Accuracy on AVE (Mean and Each Run)
(Original DG‑SCT)81.67
Ensemble of adaptersAVE (different random seed)81.79
Ensemble of final embeddingSoundActionsPerceptionType81.81 (81.79,81.82,81.84)
Ensemble of final embeddingSoundActionsEnjoyability81.80 (81.77,81.82,81.78)
EoPMASoundActionsPerceptionType81.95 (82.09,81.92,81.87)
EoPMASoundActionsEnjoyability81.88 (81.84,81.92,81.89)
Figure 8

Accuracy of the original DG‑SCT model (baseline), an ensemble of two DG‑SCT models with different random seeds (ensemble of baselines), and the EoPMA models (PT for PerceptionType/EJ for Enjoyability, fine‑tuned modality, ensembling modality). β is the hyperparameter of EoMPA defined in Equations 2 and 4. Accuracies are calculated on the validation set of AVE.

DOI: https://doi.org/10.5334/tismir.223 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 1, 2024
Accepted on: Sep 17, 2025
Published on: Mar 11, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Jinyue Guo, Jim Tørresen, Alexander Refsum Jensenius, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.