Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Jinyue Guo; Jim Tørresen; Alexander Refsum Jensenius

doi:10.5334/tismir.223

Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Transactions of the International Society for Music Information Retrieval

Volume 9 (2026): Issue 1

By: Jinyue Guo , Jim Tørresen and Alexander Refsum Jensenius

Open Access

|Mar 2026

Figures & Tables

A sketch map of the three main sound types of Pierre Schaeffer’s typology (impulsive, sustained, and iterative) with related action profiles (Jensenius, 2022). Dashed vertical lines show the perceived onset/end points.

Table 1

Popular labeled audio‑visual datasets. In label modality, ‘audio&video’ means some labels are based on audio information while others are based on video information, while ‘combined’ means labels are based on the combined information of both modalities. In ‘Perception mode’ and ‘Label ontology’ columns, we use colors as an additional indicator of the perception types: red stands for causal labels, blue stands for reduced labels, and green stands for semantic labels. While ‘emotion’ was not considered in Schaeffer’s listening mode framework, we loosely regard it as a mixture of causal and semantic information.

Dataset	Year	# of Clips	Total Duration	Source	Label Modality	Perception Mode	Label Ontology
AudioSet	2017	2M+	5,800+ h	YouTube	audio	causal	events
Kinetics‑400	2017	306k+	850+ h	YouTube	combined	causal	actions
EPIC‑KITCHENS	2018	39,594	55 h	original	audio&video	causal	objects/actions
CMU‑MOSEI	2018	2,199	2+ h	YouTube	combined	causal/semantic	emotion
AVE	2018	4,143	11+ h	YouTube	combined	causal	events
LLP	2020	11,849	32.9 h	YouTube	audio&video	causal	events
VGGSound	2020	200k+	550+ h	YouTube	combined	causal	events
SSW60	2022	9.2k	25.7 h	original	combined	causal	events
SoundActions		365	1 h	original	combined	causal+reduced	events, objects, actions, environment, enjoyability, perception type

Histogram of durations of SoundActions samples. Sample counts are shown in logarithmic style to show the long tail.

Recording a sound action using a lightweight setup with a mobile phone equipped with a USB microphone.

Table 2

Statistics of the reduced type labels in SoundActions dataset.

	PerceptionType	Enjoyability
Class counts	Impulsive: 124	Yes: 49
	Sustained: 84	Neutral: 203
	Iterative: 125	No: 72
	No majority: 32	No majority: 41
Multirater $κ_{free}$	0.46	0.20
Agreed ≥ 2	91.5%	88.5%
Agreed = 3	43.3%	25.4%

Three factors of the audio–video fine‑tuning experiment: (1) fine‑tuning range, (2) modality combination, and (3) label type. All combinations of the three factors were tested on the SoundActions dataset with a fivefold cross‑validation.

Left: Classic audio‑visual adapter structure used by LAVisH (Lin et al., 2023) and DG‑SCT (Duan et al., 2023). Right: Ensemble of Perception Mode Adapters (EoPMA model). The causal adapters are trained on the AVE dataset (Tian et al., 2018); the reduced adapters are trained on reduced labels in the SoundActions dataset.

Table 3

Five‑fold fine‑tuning results of DG‑SCT with EoPMA on SoundActions.

			PerceptionType		Enjoyability
Fine‑Tune Type	Fine‑Tune Modality	Validation Modality	Group No.	Validation Accuracy Mean (Each Fold)	Group No.	Validation Accuracy Mean (Each Fold)
cls	av	av	A1	$0.54 (0.54 0.57 0.55 0.51 0.51)$	B1	$0.58 (0.57 0.55 0.58 0.57 0.60)$
	a	a	A2	$0.55 (0.53 0.54 0.56 0.58 0.55)$	B2	$0.57 (0.60 0.58 0.56 0.56 0.59)$
	v	v	A3	$0.44 (0.45 0.42 0.51 0.38 0.45)$	B3	$0.57 (0.56 0.57 0.58 0.56 0.58)$
	av	a	A4	$0.35 (0.34 0.35 0.34 0.36 0.37)$	B4	$0.56 (0.55 0.55 0.56 0.57 0.59)$
	av	v	A5	$0.43 (0.41 0.45 0.48 0.37 0.44)$	B5	$0.57 (0.56 0.58 0.58 0.55 0.58)$
all	av	av	C1	$0.60 (0.61 0.59 0.62 0.60 0.59)$	D1	$0.59 (0.63 0.55 0.58 0.56 0.62)$
	a	a	C2	$0.66 (0.62 0.66 0.67 0.66 0.68)$	D2	$0.56 (0.55 0.55 0.59 0.55 0.57)$
	v	v	C3	$0.48 (0.42 0.46 0.48 0.49 0.54)$	D3	$0.57 (0.60 0.55 0.58 0.55 0.56)$
	av	a	C4	$0.53 (0.43 0.55 0.58 0.52 0.55)$	D4	$0.58 (0.59 0.59 0.56 0.60 0.56)$
	av	v	C5	$0.45 (0.42 0.41 0.51 0.47 0.44)$	D5	$0.56 (0.56 0.55 0.56 0.55 0.56)$

Qualitative principal component analysis visualization of the embedding spaces of different modality setups and tasks.

Table 4

Comparison of AVE validation accuracy of different ensemble methods, data, and labels. EoPMA: Ensemble of Perception Mode Adapters. ^∗Our re‑evaluation of the officially provided DG‑SCT AVE checkpoint.

Ensemble Method	Ensemble Data	Fine‑Tuned SoundActions Label	Validation Accuracy on AVE (Mean and Each Run)
(Original DG‑SCT)	‑	‑	$81.6 7^{*}$
Ensemble of adapters	AVE (different random seed)	‑	$81.79$
Ensemble of final embedding	SoundActions	PerceptionType	$81.81$ $(81.79, 81.82, 81.84)$
Ensemble of final embedding	SoundActions	Enjoyability	$81.80$ $(81.77, 81.82, 81.78)$
EoPMA	SoundActions	PerceptionType	$81.95$ $(82.09, 81.92, 81.87)$
EoPMA	SoundActions	Enjoyability	$81.88$ $(81.84, 81.92, 81.89)$

Accuracy of the original DG‑SCT model (baseline), an ensemble of two DG‑SCT models with different random seeds (ensemble of baselines), and the EoPMA models (PT for PerceptionType/EJ for Enjoyability, fine‑tuned modality, ensembling modality). $β$ is the hyperparameter of EoMPA defined in Equations 2 and 4. Accuracies are calculated on the validation set of AVE.

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/tismir.223 | Journal eISSN: 2514-3298

Journal RSS Feed

Language: English

Submitted on: Sep 1, 2024

Accepted on: Sep 17, 2025

Published on: Mar 11, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

multi‑modal neural network,

audio–video learning,

auditory–visual theory,

dataset,

event classification,

parameter‑efficient fine‑tuning

© 2026 Jinyue Guo, Jim Tørresen, Alexander Refsum Jensenius, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 9 (2026): Issue 1

Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Figures & Tables

Figure 1

Table 1

Figure 2

Figure 3

Figure 4

Table 2

Figure 5

Figure 6

Table 3

Figure 7

Table 4

Figure 8

Paradigm

My account