Investigating Auditory–Visual Perception Using Multi-Modal Neural Networks with the SoundActions Dataset

Jinyue Guo; Jim Tørresen; Alexander Refsum Jensenius

doi:10.5334/tismir.223

1 Introduction

So‑called multi‑modal deep learning has experienced a resurgence in machine learning (ML) studies since the emergence of the Contrastive Language–Image Pre‑training (CLIP) model (Radford et al., 2021). While the initial focus of multi‑modal transformers was mainly on image and natural languages (Kim et al., 2021; Yao et al., 2021), other modalities have been increasingly investigated, including audio and video (Akbari et al., 2021; Zellers et al., 2022). From a music technology perspective, multi‑modal neural networks provide a new possibility for studying music since the musicking process—the engagement with music in any form, from performance to perception—is multi‑modal by nature (Jensenius, 2022, p. xv).

While the term ‘audio‑visual’ has been widely used by both ML and psychological studies, we find it confusing when conducting interdisciplinary research involving these fields. Therefore, we will use ‘audio–video’ to describe media of recorded sound and motion and ‘auditory–visual’ to denote perceptual features. The term ‘audio‑visual’ will only be used when referring to previous works. For a detailed discussion of the terminologies, please refer to Section 2.5.1.

Various audio–video tasks have improved significantly since the introduction of multi‑modal learning, including audio–video‑based action recognition (Sun et al., 2022), auditory–visual localization and parsing (Tian et al., 2018, 2020), and audio‑visual segmentation (Zhou et al., 2022). Tian et al. (2021) and Duan et al. (2023) introduce audio–video foundation models that have general auditory–visual understanding abilities and can be transferred to multiple downstream tasks. Baltrušaitis et al. (2019) summarize the applications of audio–video learning into six categories: (1) speech recognition, (2) event detection, (3) emotion and affect, (4) media description, (5) multimedia retrieval, and (6) multimedia generation. They also categorize the tasks into five classes: (1) representation, (2) translation, (3) alignment, (4) fusion, and (5) co‑learning. Wei et al. (2022) further classifies audio–video learning tasks into three different cognition‑inspired categories: (1) auditory–visual boosting, where the tasks are initially unimodal but can be enhanced with additional modalities; (2) auditory–visual perception, where the tasks are intrinsically multi‑modal; and (3) auditory–visual collaboration, where the tasks ask for a deeper understanding of the spatial and temporal reasoning of the objects.

On the other hand, psychologists, neuroscientists, and music theorists have been studying multi‑modality and auditory–visual perception for decades (Chion and Murch, 2019a; Meredith and Stein, 1986). Important phenomena and theories, such as the McGurk effect (McGurk and MacDonald, 1976), sonic imaging (Barreiro, 2010), acousmatic listening (Schaeffer, 1969; Schaeffer et al., 1967), reduced listening (Chion, 1983; Schaeffer, 1969), and synchresis (Chion and Murch, 2019a) have been discovered and widely adopted by artists, composers, musicians, and musicologists for artistic creation and analysis. Pierre Schaeffer’s sound theory is one of the most influential, categorizing human listening into three modes (Schaeffer, 1969): causal listening, semantic listening, and reduced listening. The concept of reduced listening refers to the perception when humans listen to the properties of sound itself instead of the source of the sound (causal) or speech (semantic). The idea of synchresis claims that auditory and visual perception are intrinsically unified instead of two separate senses. For a detailed explanation of the three listening modes and synchresis, please refer to Section 2.1.

While ML has made significant progress in traditional audio–video tasks, there is still a gap between ML‑based studies and psychological and aesthetic‑based studies in terms of constructing tasks related to aesthetic concepts and using perceptual phenomena to improve existing models.

This article aims to bridge gaps between musicological and technological approaches to auditory–visual and audio–video analyses. We summarize our research questions as follows:

What are the theoretical gaps between audio–video deep learning research and auditory–visual perception theories (namely Schaeffer (1969))?
Can we observe any behavior of audio–video deep learning models similar to phenomena in auditory–visual theory (e.g., added value (Chion and Murch, 2019b))?
Can we create deep learning methods inspired by auditory–visual theory to improve the audio–video deep learning models?

To answer these questions from the ‘soft’ theoretical perspectives to the ‘hard’ technological perspectives, we made a chain of four attempts in this article:

From the theoretical side, we reviewed and compared the methods of current audio–video deep‑learning research from the perspectives of relevant perception theories, identifying missing points in existing datasets and tasks in Section 2.
To address the lack of reduced‑type audio–video datasets, we present the SoundActions dataset, a collection of 365 short recordings of sound‑producing actions that have been manually annotated with various labels inspired by Schaeffer (1969) and Chion and Murch (2019a) in Section 3.1.
To compare the behaviors of audio–video models to auditory–visual phenomena, we performed a set of fine‑tuning experiments of the latest audio–video transformer on the reduced‑type labels of the SoundActions dataset in Section 3.2. We examined the model’s ability to recognize reduced‑type auditory–visual components through experiments on different fine‑tuning types and modality combinations and investigated the modality imbalance phenomenon in the model as discussed in Section 4.1.
Inspired by Schaeffer (1969), we proposed an ensemble method that we named as ‘Ensemble of Perception Mode Adapters’ (EoPMA) that uses a small amount of reduced‑type data to improve on existing causal‑type tasks, as explained in Sections 3.3 and 4.2.

2 Background

2.1 Auditory–Visual theory

Musicological theories of sound perception have evolved over the last decades, supported by the development of sound technologies. Some of these theories have been combined with film theories and related technologies. In this subsection, we briefly introduce Pierre Schaeffer’s concept of reduced listening, sound objects, and Michel Chion’s synchresis principle and the phenomenon of added value.

2.1.1 Three listening modes

As the pioneer of musique concrète, Pierre Schaeffer contributed to the modern experimental music scene not only through the practice and theory of music and sound but also through the theorization of listening. The invention of sound recording and playback devices made it possible to replay a sound disconnected from its sound‑producing objects and actions. Schaeffer (1969) categorizes human auditory perception in three modes:

Causal listening: Listening to a sound to gather information about its cause (or source). Examples: ‘a bird sound,’ ‘a machine sound.’
Semantic listening: Listening to a sentence in a spoken language or information in a code. Examples: A sentence spoken in English, a series of Morse code.
Reduced listening: A listening mode that focuses on the traits of the sound itself, independent of its cause and meaning. Examples: ‘A loud, impulsive sound with a long reverb’; ‘a high, iterative sound gradually fading.’

While causal and semantic listening are relatively common and easy to understand, Schaeffer named and emphasized reduced listening as ‘an instructive experience’ that is ‘opening up our ears and sharpening our power of listening’ (Chion and Murch, 2019c, p. 31). The adjective ‘reduced’ is borrowed from Husserl’s phenomenological notion of reduction (according to Chion and Murch (2019c, p. 33)). Shaeffer’s utilization of reduced listening in his musique concrète pieces and acousmatic music pieces has inspired many experimental musicians since the 1950s.

2.1.2 Schaeffer’s typology of reduced listening

One of the challenges of reduced listening is to find a terminology to describe the sound itself—that is, without mentioning any causal or semantic aspects such as events, objects, actions, materials, and environments, we need to look for new descriptions of the properties of the sound. Starting from the foundation of sound and temporality, Pierre Schaeffer introduced a universal typology for describing the temporal property of sound: impulsive, sustained, and iterative. Figure 1 illustrates the energy curve of these three types of sounds and their corresponding sound‑producing actions: an ‘impulsive’ event is a short but abrupt action/sound (e.g., clapping), a ‘sustained’ event is a long sound with stable action/sound energy (e.g., a legato note of a violin), and an ‘iterative’ event is a long sound but with fluctuating sound energy, while the action energy can be either fluctuating or stable (e.g., a vibrato note or an engine sound).

A sketch map of the three main sound types of Pierre Schaeffer’s typology (impulsive, sustained, and iterative) with related action profiles (Jensenius, 2022). Dashed vertical lines show the perceived onset/end points.

Despite being a valuable theoretical framework for electroacoustic music analysis, musicologists have also questioned Schaeffer’s typology. One of the major criticisms is that Schaeffer’s typology focuses on separable, ‘monophonic’ objects and thus cannot ‘handle the “polyphony” of superimposed sound objects nor more complex music productions featuring continuously evolving elements’ (Lartillot, 2024, p. 275). Musicologists have proposed various types of new analysis frameworks as improvements of Schaeffer’s typology, including R. Murray Schaefer’s ‘soundscape’ analysis (Schafer, 1993), Denis Smalley’s Spectro‑Morphology (Smalley, 1986), Stephane Roy’s Hierarchical and Functional Analysis (Roy, 2004), Lasse Thoresen’s Graphical Formalisation (Thoresen and Hedman, 2007, 2015), Ecrin’s Audio‑Content description (Geslin et al., 2002), Structural and Functional Analysis (Delalande et al., 1996; Thoresen and Hedman, 2007), and Pierre Couprie’s Morphology (Couprie, 2003). These theories addressed and improved the drawbacks of Schaeffer’s typology. Still, some act more like personal compositional frameworks instead of generalized theories, while some are piece‑by‑piece analysis tools that are not categorizable or computable.

2.1.3 From auditory to auditory–visual: Synchresis and added value

Since the invention of sound films in the 1920s, musicians and sound artists have started to think about the relationship between audio and video: ‘Despite all appearances, we do not see and hear a film, we hear/see it.’ (Chion and Murch, 2019a). Two separate routes of neural systems do not simply perceive audio and video; they affect each other, creating a complicated, interacting, and merged percept.

Continuing Schaeffer’s theory of listening modes, Chion and Murch (2019a) expand the concept to auditory–visual perception. The argument is that causal perception typically combines auditory and visual information. Chion and Murch mentioned some particular phenomena: mental imaging and deceptive causes. The former is when we sometimes perceive information from one modality and imagine the other, for example, a radio announcer’s look. Deceptive causes are when our attempt to find the cause is manipulated by the audio–video scene, for example, fake computer sounds in sci‑fi movies. In semantic perception, the McGurk effect (McGurk and MacDonald, 1976) is a classic example that perception of the consonants ‘b,’ ‘p,’ ‘g,’ and ‘k’ can be changed by different combinations of their sounds and videos. In reduced perception, Chion and Murch emphasized the benefit of reduced listening that ‘the emotional, physical, and aesthetic value of a sound is linked not only to the causal explanation we attribute to it but also to its own qualities of timbre and texture, to its own personal vibration’ (Chion and Murch, 2019a, p. 31). Therefore, the aesthetic value of the sound can be linked with the aesthetic value of the video through their shared reduced characteristics.

Finally, Chion and Murch (2019a) introduced two key concepts in their ‘audio‑vision’ theory: a principle called synchresis and a phenomenon called added value. Synchresis, a portmanteau of synchronism and synthesis, is defined as ‘the spontaneous and irresistible weld produced between a particular auditory phenomenon and visual phenomenon when they co‑occur. This join results independently of any rational logic’ (Chion and Murch, 2019a, p. XIX). In other words, synchresis is a particular type of perception with two prerequisites: (1) audio and video are both perceived and (2) the audio and the video events happen simultaneously. A famous example of synchresis would be the scene in The Shining when Jack breaks the wooden door with an axe. The simultaneous sound of the axe splitting the door and the visual display of the axe going deeper after every hit together create an unforgettable experience. As a phenomenon during the synchresis process, added value describes a deceptive impression that the visual information dominates human perception, and other modalities—such as sound—can be removed without harming the core expression of the visuals. However, when the sound is removed, people immediately realize that the feeling is gone. Considering the axe‑hitting scene again, you might think that the visual information is enough to provide the information that the door is breaking. Still, if you watch the scene without the sound of breaking wood and the scream of Wendy, you would not feel the thriller as much as if the audio and visuals were combined.

2.1.4 Challenges in automatic recognition of schaeffer’s typology

There have been some attempts at systematizing and computing the aforementioned electroacoustic theories with signal‑based feature‑extraction techniques. Through temporal and spectral calculations such as root‑mean‑square, Mel‑Spectrogram, and Mel‑frequency cepstral coefficients, Lartillot (2024) attempted to quantify theoretical concepts, including identifying sound objects and their dynamics, facture, and mass. While making progress toward a universal Toolbox des Objets Sonores, ‘(t)here remains a large gap between, on one side, the overarching analytical methodologies and ideals developed by musicologists and, on the other side, the relatively modest contribution of what computational automation can offer today’ (Lartillot, 2024, p. 291). The author mentions deep learning as a potentially more robust approach to solving questions regarding sound object detection. Still, this specific task has not been studied due to a lack of labeled data (Lartillot, 2024, p. 282). After all, ‘the problem of automated detection of sound objects in electroacoustic music remains unsolved’ (Lartillot, 2024).

Besides computational challenges, the gap between clean, recorded audio signals and the human auditory perception of sound in complex environments also adds to the difficulty. Holbrook (2022, p. 180) gives an example of how the same impulsive sound can be perceived as impulsive or sustained in spaces with different reverberance parameters. Therefore, we argue that human labeling is necessary to account for human perception when developing sound object classification systems.

Finally, since we extended Schaeffer’s typology from auditory to auditory–visual typologies, we would like to consider visual actions as well, which presents further challenges for pure Digital Signal Processing (DSP) methods that combine audio and video processing techniques.

2.2 Audio–Video datasets and labeling

Nowadays, recording audio–video data is easy thanks to the widespread availability of high‑quality personal recording devices and access to online streaming platforms for sharing. However, collecting an audio–video dataset still requires much effort to standardize the content, define classes, and organize metadata. Table 1 shows the popular audio–video datasets that provide labels. This section will briefly introduce their characteristics.

Table 1

Popular labeled audio‑visual datasets. In label modality, ‘audio&video’ means some labels are based on audio information while others are based on video information, while ‘combined’ means labels are based on the combined information of both modalities. In ‘Perception mode’ and ‘Label ontology’ columns, we use colors as an additional indicator of the perception types: red stands for causal labels, blue stands for reduced labels, and green stands for semantic labels. While ‘emotion’ was not considered in Schaeffer’s listening mode framework, we loosely regard it as a mixture of causal and semantic information.

Dataset	Year	# of Clips	Total Duration	Source	Label Modality	Perception Mode	Label Ontology
AudioSet	2017	2M+	5,800+ h	YouTube	audio	causal	events
Kinetics‑400	2017	306k+	850+ h	YouTube	combined	causal	actions
EPIC‑KITCHENS	2018	39,594	55 h	original	audio&video	causal	objects/actions
CMU‑MOSEI	2018	2,199	2+ h	YouTube	combined	causal/semantic	emotion
AVE	2018	4,143	11+ h	YouTube	combined	causal	events
LLP	2020	11,849	32.9 h	YouTube	audio&video	causal	events
VGGSound	2020	200k+	550+ h	YouTube	combined	causal	events
SSW60	2022	9.2k	25.7 h	original	combined	causal	events
SoundActions		365	1 h	original	combined	causal+reduced	events, objects, actions, environment, enjoyability, perception type

Gemmeke et al. (2017) introduced AudioSet, a large‑scale dataset of sound events with manual labels. The dataset contains over 2 million audio–video samples with a summed duration of more than 5,800 h, collected via YouTube. Apart from the data, they also introduced an ontology of hierarchical sound events that greatly influence sound event detection. As the top layer of the AudioSet ontology, they classified sound events into seven main categories: (1) human sounds, (2) animal sounds, (3) natural sounds, (4) music, (5) sounds of things, (6) source‑ambiguous sounds, and (7) channel, environment, and background sounds. As a continuation of some classical sound event detection works (Gaver, 1993; Nakatani and Okuno, 1998; Sager et al., 2016; Salamon et al., 2014), the AudioSet ontology focuses on causal listening (Chion and Murch, 2019a, pp. 22–34), which means the labels describe the sound‑producing action or the material of the sound‑producing objects. The AudioSet ontology also does not include visual aspects of the events, such as positional information and surroundings.

Tian et al. (2018) introduced the AVE dataset, a subset of AudioSet data with frame‑level event labels. The AVE dataset contains 4,143 video clips covering 28 event categories, which is also a subset of the AudioSet ontology. While the size and number of categories are smaller than AudioSet, the frame‑level labels enable a new type of task, auditory–visual event localization (AVEL), which recognizes the type of event and the duration and timing of the event. Another sub‑dataset is the LLP (Tian et al., 2020), containing 11,849 video clips of 25 categories. The LLP dataset has specified separate time information on the audio and video streams since the sounds and related actions might happen simultaneously. For example, a video clip might have the sound of a dog first, and the figure of the dog appears after one second. In this case, the ‘dog’ label for audio and video would have different start times. Tian et al. (2020) introduce a new task called ‘audio‑visual video parsing’ (AVVP), in which the models give separate output on sound and visual events. Although the timelines are separated, the ontology of audio and video labels in the LLP dataset remains the same subset of the AudioSet ontology.

The AudioSet data and ontology have widely influenced many other datasets. Meanwhile, datasets with different ontologies are also created to meet the needs for various tasks. The Kinetics‑400 dataset (Kay et al., 2017) focuses on human actions, containing 306,245 clips in 400 action classes. The actions ontology is first gathered from previous action and motion capture datasets and then added or modified during the crowd‑sourcing labeling process by the Mechanical Turk workers. The VGGSound dataset (Chen et al., 2020) contains over 200,000 clips in 300 classes, with most class labels containing two parts of information: the action and the object, for example, ‘playing violin’ or ‘llama humming.’ For inseparable events, for example, ‘thunder,’ there is only one descriptor in the label. Unlike AudioSet, the VGGSound dataset uses a flat hierarchy in which each sample only contains one label layer.

The datasets mentioned above were collected from YouTube, which is a good source for building large datasets; however, there are many challenges regarding privacy, copyright, quality control, and reproducibility. Huh et al. (2023) and Damen et al. (2022) collected a large dataset of original, manually labeled, egocentric audio–video recordings of kitchen activities. The dataset contains 117,000 frame‑level segments of events totaling more than 100 h in duration. The EPIC‑KITCHENS and EPIC‑SOUNDS datasets introduce a more detailed ontology. First, they separated visual events and auditory events in both temporal and semantic dimensions. For example, the action of ‘cutting a tomato’ is only recognizable as a visual event. In contrast, the auditory event is ‘cutting objects,’ and the visual and auditory events can have different durations. Reviewers assigned auditory labels only heard the sound of the samples during labeling, and reviewers assigned video labels only saw the video of the samples. Second, they classified events into two types, actions and collisions of different materials, which clarifies whether the auditory–visual event is caused by an action or by some objects.

Finally, the MediaEval community proposed several tasks that are relevant to audio–video information retrieval, including E‑Sports game storytelling (Lux et al., 2019), Multimedia for Recommender System (Deldjoo et al., 2019), visual and audio analysis of movies video for emotion detection Batziou et al. (2018), and Multi‑modal Person Discovery in Broadcast TV (Poignant et al., 2015). While not designed explicitly for data‑rich ML approaches, these tasks and datasets show the potential for audio–video analysis in more professionally focused applications.

The ontologies used by existing datasets primarily fall into Schaeffer’s category of causal listening or causal perception. To our knowledge, CMU‑MOSEI (Bagher Zadeh et al., 2018) is the only audio–video dataset with non‑causal labels focusing on human facial emotions. This could be seen as falling into Schaeffer’s definition of semantic perception. In the existing audio–video datasets, there are very few descriptions of the auditory or visual characteristics of the audio–video data. For example, one clip in the AudioSet class ‘music’ could be a quiet, middle‑frequency‑ranged chamber music piece without much visual action. Another sample with the same label could be a loud, all‑frequency‑band experiment noise music piece with abrupt visuals. To our knowledge, no existing audio–video dataset provides reduced‑type labels.

2.3 Audio–Video machine learning models

Before the latest revival of contrastive multi‑modal learning led by CLIP (Radford et al., 2021), the first attempts in multi‑modal deep learning (Ngiam et al., 2011; Yuhas et al., 1989) were actually on audio and video data. The pioneer works (Alwassel et al., 2020; Arandjelovic and Zisserman, 2017; Aytar et al., 2016; Korbar et al., 2018; Ma et al., 2021; Owens et al., 2016; Tian et al., 2021) experimented with two main issues: different network structures of information fusion (early‑, mid‑, and late‑fusion), and the understanding of cross‑modal temporality by adding recursive or sequential networks. In the age of transformers, latent Audio‑Visual Hybrid Adapters (LAVisH) (Duan et al., 2023; Lin et al., 2023) have become the most powerful technique in audio–video models since they solve two problems at the same time: parameter‑efficient fine‑tuning (PEFT) and crossmodal information fusion. With the help of LAVisH, it is now possible to use existing large vision and audio transformer models pre‑trained on large datasets and perform fine‑tuning on smaller datasets in various downstream tasks, including AVEL, AVVP, Audio‑Visual Segmentation, and Audio‑Visual Question Answering.

Despite the rapid development in model structures and training efficiency, the limitation of existing datasets—all being causal or semantic—makes it hard to test the models’ ability to recognize reduced listening components in audio–video samples. Furthermore, even though modality comparisons are often included in the ablation studies, existing works usually only compare three types of situations, using (1) audio, (2) video, and (3) audio and video. It is hard to validate whether the models have learned synchresis or the added‑value phenomenon based on these three cases alone. We will further discuss the gap between audio–video learning and auditory–visual theory in Section 2.5.

2.4 Deep ensemble learning and ensemble of PEFT

Deep ensemble learning is a technique that combines the parameters or outputs of multiple deep learning models to achieve superior results compared to the individual models (Ganaie et al., 2022). As its continuation after the PEFT methods (Xu et al., 2023) becomes widely adopted, some recent works attempt to ensemble the PEFT modules instead of the entire model. Kim et al. (2024) investigated the ensemble of sequential (Houlsby et al., 2019) and parallel (He et al., 2022) adapters, showing superior performance in multiple language or vision tasks compared to single adapters. Kwok et al. (2024) and Lin et al. (2024) extend the investigation into a broader range of PEFT methods, including LoRA (Hu et al., 2022) and PiggyBack (Mallya et al., 2018). Kwok et al. (2024) conclude that the ensemble of multiple different PEFT methods, that is, LoRA+parallel adapter, outperforms the ensemble of the same PEFT method with different random seeds. Since the development of the ensemble of PEFT and audio‑visual adapters happened around the same time, no attempts have been made to apply the ensemble method to audio‑visual adapters yet.

2.5 The gap between audio–video learning and auditory–visual theory

Musicologists and ML scientists share an interest in auditory–visual phenomena, but they study them using widely different theoretical and methodological approaches. In this section, we discuss three major issues worth considering when working between the disciplines: the difference in the definitions of ‘multi‑modality’ and ‘audio‑visuality,’ the lack of reduced listening audio–video datasets, and comparing the concept of modality imbalance and added value.

2.5.1 Multi‑Modality and audio–visuality

Although ‘multi‑modal’ and ‘audio‑visual’ have been widely used in ML studies, their definitions and relationships in ML have deviated from their original definitions in psychological and linguistic studies.

In psychology, ‘multi‑modality’ refers to the fusion of multiple modalities (visual, auditory, tactile, olfactory, gustatory, and vestibular) based on the respective senses (sight, hearing, touch, smell, taste, and balance) (Schomaker, 1995). In cognitive linguistics, ‘multi‑modality’ is seen as ‘phenomena of human communication as a combination of different semiotic resources, . . . such as still and moving image, speech, writing, layout, gesture, and/or proxemics’ (Adami, 2016). Neuroscientists define ‘audio–visuality’ as the phenomena of ‘how single neurons deal with simultaneous cues from different sensory modalities’ (Meredith and Stein, 1986).

On the other hand, in ML studies, ‘multi‑modal’ generally refers to models that take more than one input type and are often narrowed to image and text inputs. For example, a model using an optical (scanned image) and a symbolic (MIDI or MusicXML) representation of a musical score as input would still be considered ‘multi‑modal’ (Christodoulou et al., 2024). Often, ‘audio–visual’ is considered a subset of ‘multi‑modal’ using audio and image/video files as input. That is why we believe a more precise terminology could be ‘audio‑image’ or ‘audio–video’ for models that input audio and image or audio and video. We will use this separation throughout this article, except when describing explicit model names.

2.5.2 Lack of reduced listening data

As listed in Table 1, although the label ontology of current audio–video datasets varies between events, objects, actions, or emotions, none of them includes reduced perception components in their labels. For example, the seven main categories of the AudioSet ontology (described in Section 2.2) all describe the sound source, regardless of their reduced properties. In one case, a sample categorized as music can be iterative and loud, similar to machine sounds, while another can be sustained and gentle, similar to environmental sounds.

While causal listening is the most common and ‘intuitive’ perception mode, ‘we must not overestimate the accuracy and potential of causal listening, its capacity to furnish sure, precise data solely based on analyzing sound. In reality, causal listening is not only the most common but also the most easily influenced and deceptive mode of listening’ (Chion and Murch, 2019a, pp. 22–34). Therefore, we need datasets with reduced perception labels, and, more importantly, the samples should be collected with the awareness of reduced listening to guarantee the representation of different reduced perception patterns in the dataset.

2.5.3 Added value and the modality imbalance problem

Chion and Murch (2019a, pp. 35–65) define the phenomenon of added value as the deceptive impression that vision contains all relevant information and that the other modalities act as unnecessary ‘supplements.’ However, when the other modalities are removed, the perceivers realize that the vision does not constitute a complete expression.

ML researchers have observed a similar behavior in the models, named the modality imbalance problem. Audio–video models might rely on one modality and discard the other modality, either because of the modality imbalance in the dataset (Xia et al., 2023) or the imbalance of modality gradients during the training process (Huang et al., 2022; Peng et al., 2022; Wei and Hu, 2024). The phenomenon might be harmful to the performance and robustness of the models.

Despite the similarity between added value and modality imbalance, researchers have different attitudes toward these two concepts. Although regarded as a deceptive impression, musicologists acknowledge added value as a natural part of human perception. On the other hand, modality imbalance is considered harmful to the model, and ML scientists have been looking for solutions to balance modalities as much as possible.

3 Method

We propose one dataset and two related experiments to bridge the above‑mentioned theory gaps and support future multidisciplinary research. The SoundActions dataset provides causal and reduced labels. Our first experiment is based on fine‑tuning the DG‑SCT model on the SoundActions dataset, and our second experiment, EoPMA, uses an ensemble method inspired by Pierre Schaeffer’s typology. The dataset and experiments are explained in the following subsections.

3.1 The soundactions dataset

The SoundActions dataset^¹ is a collection of 365 selected real‑world, action–sound pairs with various causal and reduced listening labels. Each recording in the dataset focuses on a single action or a continuous series of actions. Most recordings are short, around 5–10 s, although some are considerably longer. Figure 2 shows a summary of all SoundActions samples, and Figure 3 shows the durations of samples in the SoundActions dataset.

Histogram of durations of SoundActions samples. Sample counts are shown in logarithmic style to show the long tail.

3.1.1 Data collection, naming, and cleaning

The last author made one recording for the SoundActions dataset daily throughout 2022, hence the 365 recordings. All recordings were made with a Samsung S21 Ultra mobile phone and a Røde VideoMic Me‑C microphone. In most recordings, the phone stood on a tripod, with the microphone facing the sound‑producing object. All recordings were made in real‑world settings, but care was taken to reduce the background noise. Figure 4 shows the recording process of one sample in the SoundActions dataset.

Recording a sound action using a lightweight setup with a mobile phone equipped with a USB microphone.

Since the recordings were made over the span of a year, seasonal actions include walking in the snow in the winter and trimming bushes in the summer. The actions in the dataset are pre‑selected and made as diversified as possible under the idea of the environment–action–object typology (Jensenius, 2022), described as follows:

Environment: The action’s surroundings. We included nine classes of environments: ‘outdoor,’ ‘office,’ ‘room,’ ‘corridor,’ ‘hall, ‘bathroom,’ ‘kitchen,’ ‘car,’ and ‘other indoors.’
Action: The (inter)action that leads to sound production with the object(s). Actions are described as free texts, for example, ‘stepping,’ ‘cutting,’ ‘pressing.’
Object: The sound‑producing object(s) involved in the interaction. Sometimes, it is one object; other times, it is two or more equal objects (e.g., two stones hitting each other); and other times, it is a tool interacting with an object (e.g., a drumstick). Objects are described as free texts, for example, ‘snow,’ ‘button,’ ‘hands,’ ‘key.’

In addition to the listed typologies, we also included 15 classes of materials: ‘metal,’ ‘wood,’ ‘plastic,’ ‘organic,’ ‘skin,’ ‘stone,’ ‘fabric,’ ‘rubber,’ ‘liquid,’ ‘snow,’ ‘ceramic,’ ‘paper,’ ‘leather,’ ‘electronic,’ and ‘others.’

After recording, the files were manually trimmed (using FFmpeg to avoid recompression) to contain only one sound‑producing action or a series of actions forming one coherent sound object, following Schaeffer’s theory. The file names contain information about the sound‑producing action and objects, such as ‘cutting paper’ or ‘pouring rice.’

3.1.2 Data labeling

Three research assistants with experience in music technology and psychology tagged all 365 samples independently. They were provided links to the files and a brief instruction about the dataset and the labels. For example, for the PerceptionType label, they were instructed to follow Pierre Schaeffer’s typology and expand it to cover auditory–visual information. They were asked to tag with a holistic impression of the audio–video sample. Sometimes, the visual and auditory streams may not match. For example, an impulsive action may result in an iterative or sustained sound due to the mechanical or digital construction of the sound‑producing objects. There were no restrictions on the tagging order, and the participants could watch/listen to the files as many times as they wished. The participants could not see each other’s results during the process. For the PerceptionType and Enjoyability labels, the participants were asked to tag into predefined categories; for the other labels, free text was allowed. No personal information was collected during the labeling process.

After tagging, the authors manually cleaned the labeling results to standardize the categories, for example, the material labels such as ‘copper,’ ‘iron,’ and ‘steel’ were all standardized into ‘metal.’ Finally, the tags of the three participants were merged into one final version. For single‑hot labels including PerceptionType, Enjoyability, and Environment, the majority vote was taken to define the final label, and samples without a majority are tagged as ‘unknown.’ For multi‑hot labels including Material, Objects, and Actions, all tagged labels are included.

Table 2 shows the class counts and agreement levels of the two reduced‑type labels in the SoundActions dataset. The classes in PerceptionType are relatively balanced, with slightly fewer ‘sustained’ labels than the other two classes. Among the samples, $8.5 %$ were without a majority vote in the ratings, which can be because three participants all voted for different options or because one participant voted for ‘unknown’ and the other two voted for different classes. There is no sample for which more than one participant voted for ‘unknown,’ and only one participant used ‘unknown’ for four samples, while the other two did not use ‘unknown’ at all. In terms of agreement level, we calculated the multirate free kappa (Randolph, 2005) that measures the level of agreement into a score between $- 1$ and $1$ , where $- 1$ means total disagreement, $0$ means the same as random voting, and $1$ means total agreement. The $κ_{free}$ value for PerceptionType is $0.46$ , indicating a moderate agreement level among the three participants. Moreover, at least two participants agreed on $91.5 %$ of the samples, and all agreed on $43.3 %$ . This means the PerceptionType labels have a certain level of reliability. On the other hand, while the percentage of agreement on two samples is still $88.5 %$ , the percentage of all agreements and the $κ_{free}$ value for the Enjoyability voting is considerably lower. This means that the Enjoyability voting is much more subjective than PerceptionType. Enjoyability voting also had a less‑balanced distribution between the classes, with about $55.6 %$ of the samples labeled as ‘neutral.’

Table 2

Statistics of the reduced type labels in SoundActions dataset.

	PerceptionType	Enjoyability
Class counts	Impulsive: 124	Yes: 49
	Sustained: 84	Neutral: 203
	Iterative: 125	No: 72
	No majority: 32	No majority: 41
Multirater $κ_{free}$	0.46	0.20
Agreed ≥ 2	91.5%	88.5%
Agreed = 3	43.3%	25.4%

3.2 Audio–Video fine‑tuning on reduced‑type labels

The first experiment was to use an audio–video deep learning model to recognize the reduced‑type labels in the SoundActions dataset. We used DG‑SCT (Duan et al., 2023), the current state‑of‑the‑art model for auditory–visual event recognition, to investigate the ability to recognize reduced‑type labels under different conditions^². As demonstrated in Figure 5, three main factors are considered in the experiments:“

Fine‑tuning range: Two kinds of fine‑tuning have been tested: fine‑tuning only the final layer after all transformer and adapter layers, labeled as ‘cls,’ or fine‑tuning all adapters, labeled as ‘all.’ The ‘cls’ groups aim to test the model’s ability after training on large‑scale, causally labeled datasets, and the ‘all’ groups tend to investigate the best ability of the model over reduced‑type labels. The ‘cls’ groups have 531K trainable parameters out of the total 460M parameters, while the ‘all’ groups have 71.1M out of 460M trainable parameters.
Modality combination: To investigate the model’s mechanism on extracting information from different modalities, we tested three different combinations of input modalities: the combined audio–video input (‘av’), audio‑only (‘a’), or visual‑only (‘v’). When only one modality is used, we create an all‑zero matrix as a dummy input for the other modality. Similar to the fine‑tuning modality, the validation data is provided with ‘av,’ ‘a,’ or ‘v.’ If the model is fine‑tuned with ‘a’ or ‘v,’ we will not provide the unseen modality in validation. However, validation modalities are withdrawn for models fine‑tuned with ‘av’ to test their reliability on different modalities.
Label type: We use the two reduced‑type labels provided by the SoundActions dataset: ‘PerceptionType,’ which includes Impulsive, Sustained, Iterative, and Unknown, and ‘Enjoyability,’ which includes Yes, Neutral, No, and Unknown.

Three factors of the audio–video fine‑tuning experiment: (1) fine‑tuning range, (2) modality combination, and (3) label type. All combinations of the three factors were tested on the SoundActions dataset with a fivefold cross‑validation.

Due to the limited size of the SoundActions dataset, we used five‑fold validation for each group of the experiments and calculated the mean and standard deviation as the final metrics. The training and validation data might use different modalities according to the experiment setups. The official checkpoint of DG‑SCT on the AVE task was used as the pre‑trained model for the experiment. The audio backbone HTS‑AT (Chen et al., 2022) was pre‑trained on AudioSet (Gemmeke et al., 2017), the video backbone Swin‑T (Liu et al., 2022) was pre‑trained on ImageNet 22 K (Deng et al., 2009), and the DG‑SCT adapters were pre‑trained on the AVE dataset Tian et al. (2018). All backbones and adapters are fixed for ‘cls’ groups, and only the final CMBS layer (Xia and Zhao, 2022) with 531K trainable parameters is activated. For ‘all’ groups, backbones are fixed, while adapters and the CMBS layer are activated, resulting in 189M trainable parameters.

The fine‑tuning experiments were conducted with the learning rate of $5 \times 1 0^{- 4}$ after a grid search between $1 \times 1 0^{- 3}$ and $4 \times 1 0^{- 2}$ . The learning rate is reduced by half if the training loss stops decreasing for more than five epochs, and the training is terminated when the validation loss stops decreasing for more than 10 epochs. The ‘cls’ groups were trained with a batch size of 16, and ‘all’ groups used batch size 4 but accumulated four batches, due to heavier computational needs. Both groups used nine types of video data–augmentation transforms, while no audio augmentation was used. An NVIDIA A100 GPU was used to conduct the experiments. PyTorch (Paszke et al., 2019), PyTorch‑Lightning (Falcon and The PyTorch Lightning team, 2019), and Weights and Biases (Biewald, 2020) were used for the training scripts and experiment logs.

3.3 EoPMA: Ensemble of perception mode adapters

The second experiment aims to improve the audio–video model using Schaeffer’s theory of three listening modes (Schaeffer et al., 1967) and its extension to auditory– visual perception modes (Chion and Murch, 2019a). Instead of being triggered separately or exclusively in human perception, the three listening modes ‘overlap and combine in the complex and varied context of (the film) sound track’ (Chion and Murch, 2019c, p. 33). Inspired by this, we propose the EoPMA method, which combines multiple adapters trained with different perception mode labels. Figure 6 shows the structure of a classic audio–video adapter and the EoPMA model. Since no speech data are involved in the AVE or SoundActions datasets, we include two modes of adapters: the causal perception adapters that are trained on causal labels in the AVE dataset, and the reduced perception adapters that are trained on reduced labels in the SoundActions dataset. After each adapter layer, the audio and/or video embeddings from the two adapters are combined and fed into the next transformer layer, depending on the modality setup.

Left: Classic audio‑visual adapter structure used by LAVisH (Lin et al., 2023) and DG‑SCT (Duan et al., 2023). Right: Ensemble of Perception Mode Adapters (EoPMA model). The causal adapters are trained on the AVE dataset (Tian et al., 2018); the reduced adapters are trained on reduced labels in the SoundActions dataset.

There are two benefits of using the adapter structure for mixing the perception modes: (1) compared to fine‑tuning the entire audio–video backbone, the parameter efficient nature of the adapter model makes the training more efficient and also reduces the overfitting on small datasets such as SoundActions, and, (2) compared to only adjusting the final embeddings that loses the adjustment of low‑level features in the shallow layers, the insertion of adapters throughout all backbone layers makes the model able to focus on reduced‑type features at all abstraction levels.

Similar to Duan et al. (2023), we define the audio input of layer $l$ as $a^{(l)} \in R^{T \times (L^{(l)} \cdot F^{(l)}) \times C_{a}^{(l)}}$ and the video input of such as $v^{(l)} \in R^{T \times (H^{(l)} \cdot W^{(l)}) \times C_{v}^{(l)}}$ , where $T, C$ denote the timestamp and channel, $L, F$ denote the length of each time frame and the frequencies of the audio spectrogram, and $H, W$ denote the height and width of the video frame. Then, we denote the causal adapter as $Ω_{c}^{v 2 a} (v^{(l)}, a^{(l)}), Ω_{c}^{a 2 v} (a^{(l)}, v^{(l)})$ , where $Ω_{c}^{v 2 a}$ is the video‑to‑audio attention block and $Ω_{c}^{a 2 v}$ is the audio‑to‑video attention block. Analogously, we define the reduced adapter as $Ω_{r}^{v 2 a} (v^{(l)}, a^{(l)}), Ω_{r}^{a 2 v} (a^{(l)}, v^{(l)})$ .

The operations of causal adapters in each layer can be written as:

1

\begin{array}{l} \begin{aligned} a_{f c}^{(l)} = Ω_{c}^{v 2 a} (v^{(l)}, a^{(l)}), v_{f c}^{(l)} = Ω_{c}^{a 2 v} (a^{(l)}, v^{(l)}) \end{aligned} \end{array}

where $a_{f c}^{(l)}$ denotes the causal, audio attention of the first attention block, while $v_{f c}^{(l)}$ denotes the video attention. Analogously, we obtain the reduced attention $a_{f r}^{(l)}$ , $v_{f r}^{(l)}$ by using the reduced adapters. Then, the attentions from the two perception‑type adapters are added by the hyperparameter $β \in [0, 1]$ that controls the ratio between the causal and reduced embeddings:

2

\begin{array}{l} \begin{aligned} a_{y}^{(l)} = a^{(l)} + MHA (a^{(l)}) + (1 - β) a_{f c}^{(l)} + β a_{f r}^{(l)}, \\ v_{y}^{(l)} = v^{(l)} + MHA (v^{(l)}) + (1 - β) v_{f c}^{(l)} + β v_{f r}^{(l)} \end{aligned} \end{array}

where $a_{y}^{(l)}$ is the interim audio embedding after the first attention block, while $v_{y}^{(l)}$ is the interim video embedding. Then, within the same layer $(l)$ , the processed is repeated with the second group of causal attention blocks $Ω_{y c}^{v 2 a}$ , $Ω_{y c}^{a 2 v}$ and reduced attention blocks $Ω_{y r}^{v 2 a}$ , $Ω_{y r}^{a 2 v}$ .

3

\begin{array}{l} \begin{aligned} a_{y f c}^{(l)} = Ω_{y c}^{v 2 a} (v_{y}^{(l)}, a_{y}^{(l)}), v_{y f c}^{(l)} = Ω_{y c}^{a 2 v} (a_{y}^{(l)}, v_{y}^{(l)}) \end{aligned} \end{array}

where $a_{y f c}$ , $v_{y f c}$ are the causal audio and caudal video attentions. The reduced attentions $a_{y f r}$ , $v_{y f r}$ are obtained analogously. Finally, we obtain the output audio and video embedding of the layer $a^{(l + 1)}, v^{(l + 1)}$ :

4

\begin{array}{l} \begin{aligned} a^{(l + 1)} = a_{y}^{(l)} + MLP (a_{y}^{(l)}) + (1 - β) a_{y f c}^{(l)} + β a_{t f r}^{(l)}, \\ v^{(l + 1)} = v_{y}^{(l)} + MLP (v_{y}^{(l)}) + (1 - β) v_{y f c}^{(l)} + β v_{y f r}^{(l)} \end{aligned} \end{array}

where MHA and MLP refer to ‘multi‑head attention’ and ‘multi‑layer perception,’ respectively.

In the inference tests for EoPMA, we used the pre‑trained adapters from the original DG‑SCT checkpoints on the AVE dataset as the causal perception adapters and the SoundActions fine‑tuned adapters (groups C and D in Table 3) as reduced perception adapters. The SoundActions fine‑tuned adapters with the highest validation accuracy within each group are used for the ensemble. No further training is required for the EoPMA inference test. Instead, we tested three variables: the hyperparameter $β$ , the SoundActions adapters from different fine‑tuning setups as in Table 3, and the modality combinations in the ensemble process. We tested the EoPMA on the validation set of the AVE dataset and compared the accuracy with the classic audio–video adapters, including LAVisH (Lin et al., 2023) and DG‑SCT (Duan et al., 2023).

Table 3

Five‑fold fine‑tuning results of DG‑SCT with EoPMA on SoundActions.

			PerceptionType		Enjoyability
Fine‑Tune Type	Fine‑Tune Modality	Validation Modality	Group No.	Validation Accuracy Mean (Each Fold)	Group No.	Validation Accuracy Mean (Each Fold)
cls	av	av	A1	$0.54 (0.54 0.57 0.55 0.51 0.51)$	B1	$0.58 (0.57 0.55 0.58 0.57 0.60)$
	a	a	A2	$0.55 (0.53 0.54 0.56 0.58 0.55)$	B2	$0.57 (0.60 0.58 0.56 0.56 0.59)$
	v	v	A3	$0.44 (0.45 0.42 0.51 0.38 0.45)$	B3	$0.57 (0.56 0.57 0.58 0.56 0.58)$
	av	a	A4	$0.35 (0.34 0.35 0.34 0.36 0.37)$	B4	$0.56 (0.55 0.55 0.56 0.57 0.59)$
	av	v	A5	$0.43 (0.41 0.45 0.48 0.37 0.44)$	B5	$0.57 (0.56 0.58 0.58 0.55 0.58)$
all	av	av	C1	$0.60 (0.61 0.59 0.62 0.60 0.59)$	D1	$0.59 (0.63 0.55 0.58 0.56 0.62)$
	a	a	C2	$0.66 (0.62 0.66 0.67 0.66 0.68)$	D2	$0.56 (0.55 0.55 0.59 0.55 0.57)$
	v	v	C3	$0.48 (0.42 0.46 0.48 0.49 0.54)$	D3	$0.57 (0.60 0.55 0.58 0.55 0.56)$
	av	a	C4	$0.53 (0.43 0.55 0.58 0.52 0.55)$	D4	$0.58 (0.59 0.59 0.56 0.60 0.56)$
	av	v	C5	$0.45 (0.42 0.41 0.51 0.47 0.44)$	D5	$0.56 (0.56 0.55 0.56 0.55 0.56)$

Four comparison groups are added to the experiment: (1) the original DG‑SCT adapters trained on the AVE dataset; (2) using the reduced‑tuned model only to validate on the AVE dataset; (3) ensemble the final embeddings of reduced‑tuned model and the original AVE‑trained model; and (4) ensemble of two original AVE‑trained models that are trained with different random seeds, same with the method by Kwok et al. (2024). The hypothesis is that group 2 would have less accuracy than group 1 because group 2 is trained on a different task; groups 3 and 4 should perform better than group 1; and, finally, the EoPMA method would perform better than all the comparison groups.

There are two main differences between EoPMA and the previous works on PEFT ensemble (Kim et al., 2021; Kwok et al., 2024; Lin et al., 2024): First, even though the two perception mode adapters in EoPMA have the same structure, they are not trained on the same dataset, only with different random seeds as in Kwok et al. (2024). Instead, the perception mode adapters are trained with different perception mode labels. Second, the ensemble method used by Kwok et al. (2024) is a post‑processing system that only combines the last few layers of the models, whereas, in EoPMA, the embeddings are merged at every layer.

4 Results and Analysis

4.1 Reduced‑Type fine‑tuning results

Table 3 lists the results of all experiment groups. As discussed in Section 2.5, we analyze the results with auditory–visual theory from three aspects: the ability of the models to recognize reduced‑type labels, the level of synchresis in the models, and the added value and modality imbalance of the models. Results of ‘cls’ fine‑tuning demonstrate the model’s ability to recognize reduced‑type labels after being pre‑trained on causal labels from AudioSet and AVE. The accuracy of groups A1 and B1 shows that the model has learned the general properties of audio–video data in the original dataset.

Comparing the performance of different fine‑tuning types (A1 vs. C1, B1 vs. D1), we see that fine‑tuning all adapters (‘all’) gives better performance than only fine‑tuning the classifier (‘cls’), since fine‑tuning all adapters makes the model able to adjust the low‑level information in the shallow layers. While the validation accuracies shown in Table 3 do not differ significantly, we notice that the training accuracy of the ‘cls’ groups is much lower than the ‘all’ groups. This means that while both group types are suffering from overfitting due to the small dataset, the ‘all’ groups have more generalizable performance.

The results within each group (e.g., A1 to A5) show the modality imbalance while using the same fine‑tuning type and labels. We can see that both groups using ‘PerceptionType’ labels have a significant performance drop when modalities are withdrawn in either fine‑tuning or validation. When fine‑tuning with only one modality (e.g., A2 and A3), the result demonstrates the best performance with that single modality, while withdrawing one modality during validation (e.g., A4 and A5) reveals the model’s reliance on one modality when both modalities are provided. We can see that the model performs worse in the second case, which suggests that, when fine‑tuning using both modalities, the model has learned to fuse the information from both modalities, which resembles the synchresis phenomena in human perception, as mentioned in Section 2.1.3.

Finally, comparing the performance on the two different labels (Group A&C vs. B&D), we observed two phenomena: 1) When using ‘cls’ fine‑tuning (A1 vs. B1), the performance on ‘PerceptionType’ is lower than ‘Enjoyability’; however, when using ‘all’ fine‑tuning (C1 vs. D1), ‘PerceptionType’ performs slightly better. 2) The performance of ‘PerceptionType’ groups (A, C) has more drastic changes when using different modality combinations. Observation 1 suggests that the ‘Enjoyability’ labels are similar to the original labels used for pre‑training, indicating that they contain more causal‑type information. In contrast, ‘PerceptionType’ labels are less similar and contain more reduced‑type information that the model did not learn during pre‑training. Observation 2 suggests that the ‘PerceptionType’ labels require a higher level of synchresis, and it relies more on audio information rather than on video (C2 vs. C3, C4 vs. C5).

Figure 7 shows a visualization of dimension‑reduced embedding spaces of different modalities, calculated using principal component analysis (Pearson, 1901). When comparing PT4–PT5 and EJ4–EJ5, we see that, when only one modality is provided during fine‑tuning, the video groups (PT5, EJ5) perform significantly better than the audio groups (PT4, RJ4). This shows that the labels correlate more with video information, which means the participants relied more on the video than on the audio when labeling. However, when both audio and video are provided during fine‑tuning (PT1–3, EJ1–3), instead of using the easier video embeddings (PT3, EJ3), the audio embeddings contribute more to the separation (PT2, EJ2). This interesting result of the ML model is similar to the added value concept in Chion’s theory (Chion and Murch, 2019a).

Qualitative principal component analysis visualization of the embedding spaces of different modality setups and tasks.

4.2 EoPMA results

Table 4 shows the validation accuracy of applying different ensemble methods to the AVE dataset. As a comparison, the SoundActions fine‑tuned models with the original AVE classifier do not perform well on the AVE evaluation since the label sets of the two datasets focus on different perception modes. When ensembling the output embeddings before going into the classifier, the performance is slightly improved compared to the original model trained on AVE. However, the ensemble of two AVE‑trained models with different random seeds is not better, matching findings by Kwok et al. (2024). Meanwhile, the EoPMA method’s accuracy is higher than that when using only different random seeds, which holds for both reduced‑type labels. This suggests that the reduced perception adapters extracted additional information from the same audio–video input that helped the causal perception adapters, even when the AVE task is causal. The PerceptionType labels are more helpful than Enjoyability labels, which might be due to the better agreement level and label quality (Table 2).

Table 4

Comparison of AVE validation accuracy of different ensemble methods, data, and labels. EoPMA: Ensemble of Perception Mode Adapters. ^∗Our re‑evaluation of the officially provided DG‑SCT AVE checkpoint.

Ensemble Method	Ensemble Data	Fine‑Tuned SoundActions Label	Validation Accuracy on AVE (Mean and Each Run)
(Original DG‑SCT)	‑	‑	$81.6 7^{*}$
Ensemble of adapters	AVE (different random seed)	‑	$81.79$
Ensemble of final embedding	SoundActions	PerceptionType	$81.81$ $(81.79, 81.82, 81.84)$
Ensemble of final embedding	SoundActions	Enjoyability	$81.80$ $(81.77, 81.82, 81.78)$
EoPMA	SoundActions	PerceptionType	$81.95$ $(82.09, 81.92, 81.87)$
EoPMA	SoundActions	Enjoyability	$81.88$ $(81.84, 81.92, 81.89)$

Figure 8 plots the accuracy change along different $β$ values. The range of $β$ values that improve the accuracy on the AVE validation set is between $0.001$ and $0.027$ , which is relatively small since the task is still causal. Nonetheless, the improvement is consistent mainly within a specific range of $β$ values and among all setups of reduced labels and modality combinations. Resonating with the observations in Figure 7, we also find that, while $P T_a_a$ (trained on audio only and assembles audio embeddings only, similar to PT4) performs worse than $P T_v_v$ (trained on video only and assembles video embeddings only, similar to PT5), the audio–video model’s audio embedding $P T_a v_a$ (similar to PT2) has a better performance than $P T_a v_v$ (similar to PT3). It again shows the similarity of modality imbalance in the DG‑SCT model with the concept of added value. The improvement in accuracy might seem insignificant, but it is achieved with only 365 samples in the SoundActions dataset compared to the original AVE dataset with 4,143 samples. This shows the potential of reduced‑type labels to make further improvements if the dataset can be extended in the future.

Accuracy of the original DG‑SCT model (baseline), an ensemble of two DG‑SCT models with different random seeds (ensemble of baselines), and the EoPMA models (PT for PerceptionType/EJ for Enjoyability, fine‑tuned modality, ensembling modality). $β$ is the hyperparameter of EoMPA defined in Equations 2 and 4. Accuracies are calculated on the validation set of AVE.

5 Conclusions and Limitations

This article attempts to bridge the gap between phenomenological sound theory, auditory–visual theory, and audio–video‑based ML. First, we reviewed similarities and differences between phenomenological, psychological, and technological approaches, considering their historical and methodological developments, addressing gaps, and proposing solutions to bridge them. Second, we introduced the SoundActions dataset that addresses the lack of reduced‑type labels in existing audio–video datasets for ML. Third, we performed a fine‑tuning experiment on the DG‑SCT model with different fine‑tuning ranges, modality combinations, and reduced‑type labels. The fine‑tuning results demonstrated its modality‑imbalance behaviors, akin to the added‑value concept in the auditory–visual theory of Chion and Murch (2019a). Fourth, we proposed the EoPMA method, inspired by Pierre Schaeffer’s theory of three listening modes (Schaeffer et al., 1967). The EoPMA improves DG‑SCT’s performance on the AVE dataset by providing reduced‑type information to the model despite the AVE task focusing on causal descriptors. We believe the EoPMA method shows the potential of reduced‑type labels in further improving existing tasks if the dataset can be extended in the future.

Hand‑crafting and manually annotating an audio– video dataset is a laborious endeavor, yet it can result in a higher ‘signal‑to‑noise ratio’ than scraping‑based data‑collection approaches. The downside is that the proposed SoundActions dataset is small for modern‑day transformer models, thus making it difficult to generalize the results. Also, the tagging process was conducted on a small scale with only three participants, which might make the labels biased toward certain groups. Still, we believe our approach is relevant from a context of the arts and humanities, where musicologists or electroacoustic composers and performers often work individually on crafting datasets. Finally, due to the computational need, the fine‑tuning and EoPMA experiments were only conducted on the DG‑SCT model, which might lack generalizability for other model structures or pre‑training datasets. In the future, we would like to integrate the auditory–visual theory more closely into the structure, loss computation, or training method of audio–video neural networks to improve their performance in traditional or new theory‑inspired audio–video tasks. We are also looking for methods to extend the reduced‑type dataset both in terms of size and label types, for example, a crowd‑sourced platform for openly labeling SoundActions samples. Finally, we hope to see more multidisciplinary work on the intersection of audio–video ML and auditory–visual theory, as we believe it would be a win–win for both fields.

Authors’ Contributions

JG: Conceptualization, Data curation, Model and experiment design, Analysis, Writing—original draft, Writing—review & editing. JT: Supervision, Writing— review & editing. ARJ: Conceptualization, Investigation, Data collection, Funding acquisition, Supervision, Writing— original draft, Writing—review & editing.

Funding Information

The Research Council of Norway supported this study through projects 262762 (RITMO), 324003 (AMBIENT), and 322364 (fourMs).

Competing Interests

ARJ is a board member of the scientific committee of the Sound and Music Computing conference and used to be the lead of the International Conference on New Interfaces for Musical Expression from 2011–2022.

Notes

[1] The SoundActions dataset: https://osf.io/3z65b/.

[2] Detailed result sheets and code for EoPMA: https://github.com/fisheggg/soundactions.