Predicting Perceived Semantic Expression of Functional Sounds Using Unsupervised Feature Extraction and Ensemble Learning

Annika Frommholz; Steffen Lepa; Tom Virkus; Stefan Weinzierl; Johannes Helberger

doi:10.5334/tismir.290

Full Article

1 Introduction

Functional sounds, typically brief audio cues designed to facilitate an interaction with electronic devices, are a ubiquitous component of contemporary user experience across smartphones, cars, and smart home technologies. Although these sounds are commonly associated with utilitarian signaling, such as notifications, alerts, or system prompts (Hermann et al., 2011), they are increasingly being crafted with careful attention to aesthetic detail. This involves the use of acoustic parameters, such as timbre, pitch, loudness, and duration, to evoke the intended perception and behavior of users. The importance of functional sounds in human–computer interaction, and how their properties shape meaning, has been extensively explored by Serafin et al. (2022). Despite their prevalence, functional sounds remain largely unexplored in music information retrieval (MIR) and digital musicology, particularly with respect to how listeners interpret their communicative expression.

This study shows that functional sounds, though not musical in the traditional sense, have musical structures that can be meaningfully analyzed and modeled using MIR and machine learning techniques. Iconic examples, such as the Windows startup chime, the PlayStation 2 boot sequence, and the Twitter notification sound, demonstrate how short audio signals can be immediately recognizable, convey brand identity, and communicate intent clearly through carefully designed sonic features. These examples illustrate how musical attributes are leveraged in non‑musical contexts to facilitate interpretation, memorability, and user response (Knoeferle, 2012; Mas et al., 2021).

Previous research in MIR has primarily focused on music, speech, and environmental sounds, with functional sounds receiving relatively little attention. One notable exception is Anzenbacher et al. (2017), who used MIR techniques to link acoustic properties of audio logos to their industry or semantic category. Similarly, although the auditory dimensions of user interaction have been examined in the fields of human–computer interaction (Rocchesso et al., 2019), sound design (Özcan and Vanegmond, 2012), and sound branding research (Graakjær and Bonde, 2018), existing studies often lack a theoretical basis grounded in computational musicology and communication theory. This study aims to address this gap by investigating whether high‑level audio features extracted from functional sounds can predict how listeners perceive meaning along the expressive dimensions of functional sound communication.

To this end, we present a three‑step methodology. First, we extract high‑level features using signal processing methods from MIR and unsupervised machine learning. Second, we train a regression model to predict the perceived expression of functional sounds. Finally, we validate the predictions of the best‑performing model in a listening experiment. We use the FBMSet‑805 dataset from Virkus et al. (2025a) as our ground truth, consisting of 805 functional sounds annotated using the FBMUX framework, along with measurement instruments designed to capture functional and brand‑related perceived semantic expression. Our goal is to predict the perception of functional sounds using our understanding of how auditory cues convey complex semantic messages. We divide this work into three substudies (Figure 1), answering the following research questions:

Schematic overview of methodological steps.

Feature Extraction: Which timbre, chroma, and loudness high‑level features can be identified as a dense representation of functional sounds?
Prediction: To what extent can perceived semantic expression in functional sounds—measured along the FBMUX dimensions—be predicted and explained using audio features?
Validation: How do predictions of the best‑ performing model compare to user perceptions?

2 Related Work

Decades of psychoacoustic research (Aures, 1985; von Bismarck, 1974) have shown that perceptual attributes such as sharpness and roughness reliably correspond to spectral–temporal sound properties. Applied studies in sound branding and warning‑sound design (Anzenbacher et al., 2017; Fastl et al., 2007; Menzel et al., 2011) further demonstrate that these attributes shape semantic interpretation and listener response. This literature provides a foundation for modeling semantic qualities of functional sounds using psychoacoustic and MIR‑based features.

2.1 High‑level feature extraction with topic modelling

In MIR, feature extraction compresses audio waveforms into information‑rich representations suitable for statistical and machine learning analysis. Early approaches relied on low‑level descriptors like Mel‑Frequency Cepstral Coefficients (MFCCs) and chroma features, supported by toolkits such as MIR Toolbox (Lartillot et al., 2008), Marsyas (Tzanetakis and Cook, 2000), and PsySound3 (Cabrera et al., 2008). Initial studies, e.g., Tzanetakis and Cook (2002), used principal component analysis on rhythm, timbre, and chroma features for genre classification, while Aucouturier and Pachet (2004) introduced Gaussian mixture models (GMMs) to statistically summarize MFCC distributions as scalable timbre abstractions.

The evolution to latent feature modeling employed probabilistic generative methods like latent Dirichlet allocation (LDA) to learn higher‑level audio representations. Hu and Saul (2009) applied LDA to pitch‑class profiles in symbolic and audio music, while Kim et al. (2012) extended this to sound effect classification by treating clustered MFCC vectors as 'acoustic words.' These works established topic models as effective tools for extracting semantically meaningful audio dimensions.

A particularly influential application of this modeling paradigm is the work by Mauch et al. (2015), who applied a two‑stage statistical process to extract high‑level semantic features from low‑level audio descriptors for large‑scale music analysis. Their method involved: (1) extracting timbre (12 MFCCs, zero‑crossing rate) and harmony (12‑bin chromagram) features, (2) clustering frames via GMM into timbre and harmony clusters, (3) aggregating these into song‑level histograms, and (4) applying LDA to infer topic distributions over timbre and harmony. This enabled the quantification of abstract musical traits directly from audio and informs the approach taken in the present study.

2.2 Prediction of perceived semantic audio expression

Once high‑level features for longer audio segments have been extracted, they can be applied to a variety of MIR tasks. One key application is predicting perceptual or semantic dimensions from audio, such as emotion, mood, or brand identity.

The MIREX 2007 automatic mood classification task aimed to predict discrete mood categories from audio. Initial models using support vector machines and low‑level descriptors (e.g., MFCCs, chroma) performed poorly due to small datasets and the absence of high‑level representations (Peeters, 2008). Subsequent approaches leveraged Gaussian super vectors and GMMs (Cao and Li, 2009; Tardieu et al., 2011), followed by Convolutional Neural Networks (CNNs) and Long Short‑Term Memory (LSTM) models trained on spectrograms. However, even deep learning models rarely surpassed traditional baselines, with top accuracies around 61% (Bian, 2018; Song et al., 2018).

Dimensional emotion recognition reframes emotion prediction as regression in a continuous valence–arousal space (Russell, 1980). Early models used linear and tree‑based regressors with low‑level features (Yang et al., 2008), later extended by GMM‑based models (Wang et al., 2012) and ensemble methods (Cai et al., 2022), reaching moderate performance (e.g., $R^{2} = 0.64$ for arousal). Deep learning improved temporal modeling but remained limited by small datasets (Kang and Herremans, 2024).

The ABC_DJ project,^¹ which aimed to predict perceived semantics in the context of music branding (Herzog et al., 2016), forms a methodological and conceptual foundation for the present study. Through multiple iterations, this work resulted in the construction of measurement instruments for semantic expression in popular music excerpts across four to five latent dimensions: Arousal, Valence, Authenticity, Timeliness, and Eroticity. Respective ground truth data for training and evaluating machine learning models was collected. Prediction models for semantic attributes such as “Easy‑going” and “Authentic” could be created with moderate to high explained variance ( $R^{2} = 0.22$ to $0.74$ ), particularly when using random forest (RF) and linear hierarchical stepwise regression, based on both expert‑driven machine learning classifications and algorithmic MIR features (Lepa et al., 2020a, 2020b).

Based on the concept of sound as a stable, nonverbal sign carrier, recent research has extended these methodologies to the domain of functional sounds. The work of Virkus et al. (2025c) and Virkus et al. (2025b) investigates three distinct communication levels—status, appeal, and brand identity—by adapting the four‑sided communication model from Schulz von Thun (1981). The resulting measurement instruments define seven dimensions for status, five for appeal, and seven for brand identity of functional sounds, with each dimension represented by three questionnaire items. The full list of dimensions for status, appeal, and brand identity can be found in Supplement A, and the annotation procedure is mentioned in Section 4.1 in more detail. The FBMUX framework (Functional and Brand Meaning of User Experience Sounds) comprises the measurement instruments developed by Virkus et al. (2025c) and Virkus et al. (2025b). To this end, Virkus et al. (2025a) created the FBMSet‑805 dataset, which consists of 805 functional sounds from seven industry sectors annotated using FBMUX.

2.3 The challenge with ground truth on sound semantics

Modeling perceptual attributes in music is fundamentally limited by the absence of objective ground truth. As Flexer et al. (2021) argue, musical meaning is inherently subjective and context‑dependent, emerging only through human perception. Prediction tasks in perceived semantic expression must therefore respect the limits of human agreement.

This subjectivity imposes an upper bound on model performance: no system can exceed the consistency of human judgments. Prior MIR evaluations—e.g., in MIREX similarity tasks—demonstrate that inter‑rater agreement constrains achievable accuracy (Flexer, 2014; Schedl et al., 2013).

Perceptually informed evaluation frameworks address this by benchmarking model outputs against measures of human agreement (e.g., Cronbach's alpha, inter‑rater correlations). Studies such as those published by Flexer (2014) and Friberg et al. (2014) show that MIR features grounded in auditory perception outperform generic, mathematically defined descriptors by better aligning with subjective ratings.

This justifies the present study's use of perceptual consistency as a reference point for model evaluation.

3 Study 1: Feature Extraction

The first substudy focuses on implementing a feature‑extraction process, applying unsupervised machine learning on the low‑level MIR descriptors (timbre, chroma, and loudness) of the FBMSet‑805 sounds and analyzing the distribution of the resulting high‑level representations.

3.1 Methods

3.1.1 Data

This substudy builds on the FBMSet‑805 ground truth dataset from Virkus et al. (2025a), which contains 805 functional sounds annotated with perceived expression across the FBMUX dimensions and categorized by industry domain (Apps, Consumer Electronics, Home Devices, Health Devices, Future Industry, Mobility, or Operating Systems). For the following section, only the 805 sounds are relevant, and no modifications were made to the raw audio files.

3.1.2 Feature extraction framework

As discussed in Section 2.1, attaining musically informed features through an unsupervised feature‑extraction approach appears to be the most viable solution for the novel prediction task with small datasets. Although one acoustic property was added, we base our work mainly on that of Mauch et al. (2015), extracting topics from timbre, chroma, and loudness using the same two‑step procedure:

Low‑level descriptors: Signal‑processing techniques are applied to compute audio descriptors from the waveform.
Unsupervised learning is used in a two‑stage process (frame‑wise and sample‑wise) to derive topic distributions from the extracted descriptors.

The resulting topic distribution captures structural and perceptual characteristics for each audio property. General signal processing and data handling were performed in Python using Scikit‑learn (Pedregosa et al., 2011), Numpy (Harris et al., 2020), and Pandas frameworks (McKinney et al., 2010).

Timbre

Using the librosa library (McFee et al., 2015), we extract two low‑level MIR descriptors for 93‑ms frames with a frame size of 2048 and $75 %$ overlap at a sample rate of 22050 Hz: zero‑crossing rate and 24 MFCCs. We select 24 MFCCs because preliminary GMM model‑selection tests showed that higher‑order coefficients yield more stable and better‑separated timbral clusters than the standard 12–13 MFCCs. The normalized features across all frames are concatenated into a 25‑dimensional timbre vector that captures spectral and temporal characteristics. The topic feature‑extraction process begins by identifying timbre similarities across all frames. First, we apply principal component analysis to all frame‑wise timbre feature vectors to reduce dimensionality, retaining only components with eigenvalues $\geq$ 1 (Kaiser criterion), and normalize afterwards. As defined by Kaiser (1960), the criterion selects components explaining at least as much variance as a single original feature, ensuring preservation of relevant information. Next, we fit a GMM using the reduced frame‑wise timbre vectors, selecting the optimal number of GMM components via the Bayesian information criterion evaluation metric, which balances model fit and complexity (Schwarz, 1978). This prevents overfitting while capturing meaningful timbre structure. Each frame is then assigned to one of the identified timbre clusters, based on its highest‑probability membership. The second step is to aggregate frame‑level timbre clusters into sound‑level topics describing full samples of functional sounds. To do so, frame‑level timbre clusters are aggregated into sound‑level histograms, representing the distribution of timbre across each sound sample. LDA models co‑occurrence patterns to extract latent timbre topics. We employ a grid search to optimize the number of topics based on the coherence measure, which quantifies topic consistency (Řehůřek and Sojka, 2010).

Chroma

Chroma descriptors are extracted via librosa as 12‑dimensional chromagrams per frame and temporally aligned with timbre features. The dominant pitch class (or silence) of each frame is identified by applying a 33 median filter and thresholding values below 0.8. The pitch indices (0–11) are then transposed to align the dominant class to 0. This yields a sequence of relative chroma indices for each sound sample, with notes in class 0 referred to as Base Notes and −1 marking low‑energy frames. This abstraction discards absolute tonality in terms of Western harmonic rules of music, focusing instead on relational pitch structure. LDA then uncovers topics with similar chroma structures from sample‑wise relative pitch distributions.

Loudness

We extract perceived loudness using the Zwicker method (ISO 532‑1:2017) via the Mosqito package (Green Forge Coop, 2024). This method accounts for the ear's nonlinear sensitivity and temporal masking. Due to package constraints, the loudness analysis is performed at 48 kHz, then down‑sampled to match the timbre and chroma frame rates and normalized. Frame‑level loudness is aggregated using a GMM, with the number of clusters optimized via the Bayesian information criterion. Cluster sequences are aggregated into histograms and then classified using LDA, yielding loudness topics that represent dynamic articulatory similarities across sounds. Timbre, chroma, and loudness topic distributions are then used in combination for the subsequent prediction tasks in terms of integrating spectral, harmonic, and dynamic information.

3.1.3 Interpretation of topic distributions

To interpret and label the topics found through unsupervised learning and classification with LDA, we employ (1) analysis of LDA component distributions to identify prototype histograms and (2) inspection of representative sounds per topic, i.e., those with the greatest probabilities in a single topic. We consulted with two sound designers to support this process (see Acknowledgments). The final topic labels derived in the process reflect different acoustic patterns within the timbre, chroma, and loudness domain. To evaluate the topic distributions, we averaged topic probabilities over all samples within each timbre, chroma, and loudness topic.

3.2 Results

Feature extraction resulted in 48 timbre clusters and 16 timbre topics, 38 loudness clusters and 8 loudness topics, and 14 chroma topics. Model‑selection details are presented in Supplements B and C (Figures 1, 2, and 3); prototypical LDA components providing the basis for topic naming are shown in Supplement D.

Mean topic probabilities across the FBMSet‑805 dataset grouped by acoustic property.

Comparison of factor reliabilities across participants in the ground truth data and explanatory power on the full dataset measured by $R^{2}$ in the regressor predictions. Factor reliabilities are taken from Virkus et al. (2025c) and Virkus et al. (2025b).

The average topic probabilities across all 805 sounds (Figure 2) provide the overall topic distribution across the dataset.

The timbre topics reveal a mix of balanced and dominant categories, with smooth tonal textures (Airy Sines, Warm Decaying Sines) and high‑energy percussive elements (Overall High Energy, Aggressive Beeps II) prevalent, while noisy/transient sounds are less frequent. Chroma topic distribution is more skewed, dominated by stable harmonic patterns (Minor Thirds, Repeated Base Notes, Ascending/Descending Fourths and Fifths), with dissonant intervals appearing rarely, indicating a prevalence of low‑integer frequency ratios in the tonal material of the dataset. Critically, the octave‑equivalence of chroma features means the Repeated Base Note topic represents both the Prime and the Octave interval classes. Loudness topics exhibit more even variation, featuring fading dynamics (Loudness Descends) and rhythmic bursts (Short Pulses, Short Accented Sequences), with extreme dynamics less common.

Overall, timbre and loudness topics show broad diversity with some dominant categories, whereas chroma concentrates around specific tonal features. Across all three acoustic topic dimensions, features associated with perceptual pleasantness in prior research literature— tonal timbres, consonant intervals, and moderate dynamics—predominate. These patterns describe the dataset's learnable structure and indicate that contemporary user experience (UX) sound design systematically avoids highly dissonant or extreme acoustic sounds.

4 Study 2: Prediction

This substudy aims to predict perceived expression across the 19 FBMUX dimensions using a machine learning regression model that performs supervised learning with the features extracted in the preceding substudy. The performance of different machine learning paradigms, hyperparameters, and model configurations is evaluated through a model selection process during training. We then analyze the best‑performing model using explainable artificial intelligence techniques and ambiguity analysis.

4.1 Methods

4.1.1 Target data and preprocessing

In this study, the annotations from the FBMSet‑805 dataset reflecting user‑perceived expression across the FBMUX dimensions per sound are used as target data for the prediction model. Each of the 19 FBMUX dimensions is measured by exactly three items from the questionnaires developed by Virkus et al. (2025c) and Virkus et al. (2025b), and scores are integrated by confirmatory factor analysis.

The 19 FBMUX dimensions grouped by communication level are the following:

Status (internal device processes): Having finished successfully, Having a problem, Process ongoing, Being ready, Having news, Being empty, and Shutting down
Appeal (prompts to the user): Negative warnings, Urgency reminder, Encouraging confirmations, Starting prompts, and Waiting prompts
Brand identity (device branding): Sophistication, Positivity, Progressiveness, Dominance, Solidity, Purity, and Playfulness

In the listening study reported by Virkus et al. (2025a), 805 sounds were rated on 57 items using a 0–100‑point scale. Statistical application of the measurement instruments yields factor scores for the 19 FBMUX dimensions on a Z‑standardized scale (with the unit being standard deviations). For our analysis, median factor scores across raters were computed for each sound, and factor reliabilities were documented using McDonald's omega, a measure of internal factor consistency indicating how well the items capture the latent factor meaning (McDonald, 1999) in terms of user rating consistency. This produces 805 sound–factor score pairs. The factor reliabilities are provided in Supplement A.

4.1.2 Model selection and training

To identify the best‑performing predictive model, we applied hyperparameter tuning and stratified k‑fold cross‑validation during training across multiple model configurations (Table 1). To support reproducibility, the Python code as well as model training data used for this process was archived (Frommholz, 2026).

Table 1

Overview of model configurations tested during model selection, combining two learning paradigms, three output strategies, and optional metadata inclusion.

Model Selection Criteria	Tested Configurations
Learning Paradigm	Deep neural network Random forest
Model Output Configuration	Multi‑output (all) 3 × Multi‑output (communication level‑wise) 19 × Single‑output (communication dimension‑wise)
Inclusion of Metadata	Yes No

Learning paradigms

Two complementary learning paradigms are explored. RFs (Breiman, 2001) are suitable for small‑to‑medium‑sized datasets due to their robustness and interpretability, while deep neural networks (DNNs) are ideal for capturing complex, nonlinear relationships (Goodfellow et al., 2016). We use the RandomForestRegressor() from scikit‑learn and implemented a custom DNNRegressor() architecture consisting of fully connected layers with batch normalization using PyTorch Lightning (Paszke et al., 2019).

Model configurations

To address the possible impact of factor correlations on model performance, three output configuration strategies are compared:

Multi‑output (all): Joint prediction of all 19 FBMUX dimensions to capture factor correlations
Multi‑output (communication level‑wise): Three submodels (status, appeal, brand identity) capturing factor correlations across dimensions within the respective communication level
Single‑output: One model per dimension, maximizing target‑specific performance

Industry metadata

We tested the impact of including industry metadata corresponding to the original FBMSet‑805 industry classification of the device sound as a seven‑dimensional one‑hot vector. This vector is concatenated with the 38 audio topics to form a 45‑dimensional input vector.

Hyperparameter tuning and cross‑validation

During training, we evaluated 12 model configurations (two learning paradigms and three output strategies with/without industry metadata). Each configuration was assessed using k‑fold cross‑validation with $k \in {3, 4, 5, 6}$ , applying nonparametric stratification to maintain consistent target distributions. For each fold, a grid search was conducted over paradigm‑specific hyperparameters (see Supplement E); deep learning models were trained for 60 epochs. The validation set used for grid search evaluation depends on the number of folds and the test set (the holdout set for final model evaluation is the same size). The best configuration was selected based on average performance across folds and retrained on the full training data using optimal settings. As a baseline, we also trained a multi‑output linear regression model using LinearRegression() from scikit‑learn, with all audio topics and industry metadata as input.

4.1.3 Evaluation metrics and interpretability techniques

Performance metrics

We evaluated model performance using the root mean squared error (RMSE) and the coefficient of determination ( $R^{2}$ ). RMSE guides optimization during training by minimizing prediction error for DNN weights or RF thresholds and determining the best hyperparameter configuration. It is defined as:

1

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}},

where $N$ represents the number of predictions, $y_{i}$ is the true target value, and ${\hat{y}}_{i}$ is the predicted value.

The performance of the final model is reported using $R^{2}$ , which quantifies explained variance:

2

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}},

where $\bar{y}$ is the mean of the true values. A higher $R^{2}$ indicates a better fit.

Consistency metrics

We complement traditional regression metrics with ambiguity‑aware validation. This allows us to evaluate not only accuracy but also perceptual plausibility within the bounds of human agreement.

To realize this, we model alignment with the psychological consistency of the FBMUX dimensions by correlating per‑dimension $R^{2}$ scores (full dataset) with factor reliabilities (McDonald's omega). We hypothesize that dimensions with greater internal consistency will be better predictable from audio topics.

Next, we quantify how the model distinguishes between sounds with clear versus ambiguous perceived expression on the FBMUX dimensions, using Mahalanobis distance as a proxy. This distance measure from Mahalanobis (1936) accounts for covariance structure, making it suitable for our multivariate correlated feature spaces. Each communication level—status (seven‑dimensional), appeal (five‑dimensional), and brand identity (seven‑dimensional)—defines a correlated feature space where ideal, unambiguous expressions correspond to unit vectors. Distances from samples to these vectors (using ground truth and predictions) form matrices (sample dimensions). Correlating predicted and ground truth distances within each space measures the model's capacity to capture perceptual ambiguity within each communication level.

Interpretability techniques

To interpret model predictions, we use SHapley Additive exPlanations (SHAP) values. SHAP is a model‑agnostic approach grounded in cooperative game theory (Lundberg and Lee, 2017). SHAP values quantify the marginal contribution of each feature to the model's prediction by answering the question, ‘How, and to what extent, would the prediction change if this feature were included?' For tree‑based models, such as RFs, we apply TreeSHAP (Lundberg et al., 2020), and for DNNs, we use the general implementation (Lundberg and Lee, 2017) to compute SHAP values.^² We compute SHAP values for all samples and all 19 output dimensions, enabling both global and local interpretability. To support flexible analysis, we aggregate SHAP values at multiple levels: per communication level (status, appeal, and brand identity) and by feature group (timbre, chroma, and loudness).

4.2 Results

4.2.1 Prediction

Following the model selection procedure (Section 4.1.2), we identified the best‑performing model for each combination of learning paradigm, output configuration, metadata inclusion, and cross‑validation split. The full set of test results, including all single‑output models, is reported in Supplement G.

Overall model selection results

To compare configurations, we selected the cross‑validation fold with the best performance in terms of RMSE for each combination of learning paradigm, output configuration, and metadata inclusion. For multi‑output (level‑wise), we averaged results across the three communication levels (status, appeal, brand identity); for single output, we averaged results across all 19 FBMUX dimensions. Table 2 reports the mean $R^{2}$ and standard deviations for all configurations.

Table 2

Overview of prediction performance on the test dataset measured by coefficient of determination $R^{2}$ for the tested criteria. Mean and standard deviation values of test $R^{2}$ values are given where results were averaged over multiple models.

Learning Paradigm	Deep Neural Network	Deep Neural Network	Random Forest	Random Forest
Industry Metadata	No	Yes	No	Yes
Output Configuration
multi‑output_all	0.029	0.054	0.126	0.160
multi‑output_level	0.042 ± 0.02	0.053 ± 0.06	0.104 ± 0.05	0.139 ± 0.07
single‑output	0.010 ± 0.06	0.023 ± 0.09	0.122 ± 0.09	0.135 ± 0.09
Baseline (Linear Regression)	0.077

Of the two learning paradigms evaluated, RFs outperformed DNNs as well as the baseline linear regression model. The optimal configuration was a multi‑output RF regressor that included industry metadata, which proved superior to single‑output or level‑wise models. This approach achieved a coefficient of determination $R^{2} = . 16$ on the test dataset. This optimal configuration serves as the basis for all subsequent analysis and interpretation. The hyperparameters of the best‑performing model are documented in Supplement F.

Consistency of the best‑performing model

To evaluate how well the best‑performing model aligns with the structure of the perceptual ground truth, we compare the model's predictions to factor reliabilities and semantic clarity. A moderate positive correlation ( $r = . 64, p = . 003$ ) is observed between factor reliabilities and $R^{2}$ , showing that the model achieves better prediction for internally consistent dimensions (Figure 3).

Being ready and Having news achieved the highest $R^{2}$ , while Process ongoing and Dominance were poorly predicted. Shutting down and Waiting prompts displayed discrepancies between factor reliability and prediction accuracy. Overall, predictability varied substantially across dimensions. The average correlation values between the Mahalanobis distances of the model predictions and the ground truth factor scores are as follows: $r_{Status} = 0.738$ ; $r_{Appeal} = 0.858$ ; and $r_{Brand Identity} = 0.959$ .

4.2.2 Interpretation of model predictions

Since the best‑performing model is an RF regressor, we apply permutation importance from the scikit‑learn package (Breiman, 2001). The contribution of each feature to global model performance is quantified, measuring the change in the evaluation metric (RMSE) when a feature is randomly permuted. Global model explanations using permutation importances are shown in Figure 4. Topics with high permutation importance represent globally informative predictors across multiple FBMUX dimensions.

Permutation importances of high‑level topics (model input) for the best‑performing random forest regression model grouped by acoustic property. Feature importance reflects model sensitivity to specific acoustic patterns.

Of the timbre topics, Low Pulses is the most important feature, followed by Bells and High Rings, suggesting that more tonal low‑frequency and high‑frequency timbres carry distinguishing information for semantic differentiation. On the other hand, Aggressive Toots and Spacy Noisy Swooshes are the least impactful. Among the chroma features, the most important are Minor Thirds II, Tritones, and Major Seconds (C–D), while Short Base Notes (I, II, III) are the least important. This may indicate that intervals with dissonant or tension‑bearing character provide overall useful cues. Loudness features are lower and relatively uniform in importance, suggesting that loudness patterns may be dimension‑specific rather than global predictors.

SHAP values offer a more local interpretation of model predictions. Figure 5 shows the results averaged over all samples, aggregated across audio properties, and grouped by communication level.

SHAP values for the best‑performing random forest regression model averaged over all samples, aggregated across acoustic property, and grouped by communication level. These values indicate the contribution of each property to predicted semantic expression in the respective communication level.

Figure 5 shows the aggregated SHAP values for each audio property (timbre, chroma, loudness), averaged across all samples and grouped by communication level. These values indicate the average contribution of each property. However, they must be interpreted with caution: positive or negative SHAP values reflect whether the topic tends to increase or decrease the model's predictions, but they do not reveal whether high or low feature values cause that effect. For status, loudness contributes most positively to model predictions, suggesting the model learns status messages through intensity patterns. For appeal dimensions, both loudness and timbre show strong positive contributions, reflecting a similar mechanism, as appeal also relies on loudness to grab attention and may use timbre variations to evoke specific user reactions. For brand identity, timbre is the dominant contributor, suggesting that aesthetic messages rely strongly on the interplay of spectral–temporal properties. Chroma topics did not improve predictions substantially for any communication level.

5 Study 3: Validation

To address the third research question—whether model predictions reflect device users' perceptions—a validation study was conducted, employing an online listening experiment. The goal was to assess whether categorical model predictions (indicating the highest probability for any FBMUX dimension within the communication level in a sample) would align with perception regarding these communication levels.

5.1 Methods

5.1.1 Participants

This substudy included 46 participants, recruited via a university newsletter and professional and personal networks. The mean age of the participants was 33.04 years (SD = 8.93). Regarding gender, 37 individuals identified as male ( $80.4 %$ ), seven identified as female (15.2%), and two identified as diverse (4.4%). Most participants reported German as their mother tongue (42 participants, 91.3%), while two participants (4.4%) reported English as such and two (4.4%) indicated other languages. In terms of educational attainment, four participants (8.7%) completed secondary education, 21 (45.7%) held a bachelor's degree, and 21 (45.7%) held a master's degree.

5.1.2 Stimuli

The stimuli consisted of 17 sounds selected for highest semantic ratings and lowest variance across participants from the FBMSet‑805 dataset (3.1.1). Each sound was chosen to predominantly represent one FBMUX dimension while maintaining low scores across all other dimensions within the same communication level. This selection aimed to minimize ambiguity in semantic interpretation. Due to the absence of sounds with clear maxima in the Process ongoing and Urgency reminder dimensions, sounds from these two dimensions were not included, resulting in a total of 17 stimuli.

5.1.3 Procedure

The listening experiment was conducted online via LimeSurvey, lasting 10–15 min. Participants received an introduction outlining the study's aim to enhance functional sound intuitiveness and completed an initial volume‑adjustment sound check. The actual study task then involved three stages corresponding to communication levels—status, appeal, and brand identity. Participants always had to classify sounds by selecting the best‑fitting category from seven status, five appeal, or seven brand identity options. The response options reflected FBMUX dimensions but were presented as relatable phrases from Virkus et al. (2025c) (e.g., Encouraging confirmations appeared as 'Nice! / Very good! / Great, keep going!') to aid understanding. To prevent order effects, both stimuli and options were presented in randomized order. Demographic data were collected at the very end of the study.

5.1.4 Data analysis

Since machine learning model predictions from Substudy 2 were initially in continuous form, we converted them into discrete class labels based on the highest predicted factor score per communication level. For each communication level, participants' classifications for each sound were directly compared to the model's predicted class labels, with agreement indicating alignment between human classifications and model output. A binomial test was conducted for each communication level to determine whether overall overlap of rater judgment with model prediction was significantly better than guessing probability. All analyses were performed in Python using the pandas and sklearn packages.

5.2 Results

The validation study assessed how well model predictions aligned with human classifications across the three communication levels: status, appeal, and brand identity. Table 3 presents the results, which include classification accuracy and statistical significance based on a binomial test. Overall, the model accuracy in terms of validly predicting device user perception was highest for status (38.3%), followed by brand identity (32.22%) and appeal (27.66%). The statistical analysis revealed significant deviations from random guessing for all three levels, with p‑values of <0.001 for status and brand identity and 0.003 for appeal, indicating that the model captures meaningful patterns in human perception, though with varying degrees of alignment.

Table 3

Binomial test results of the validation experiment comparing model predictions against user perception within communication level for 17 sounds from FBMSet‑805.

Level	Status	Appeal	Brand Identity
n	329	235	329
accuracy	38.3%	27.66%	32.22%
p‑value	<0.001	0.003	<0.001

To assess the consistency of participant responses within each communication level, Fleiss' kappa was calculated as a measure of inter‑rater agreement. Results indicate only slight agreement among participants across all levels, with the highest agreement observed for status ( $κ = . 186$ ), followed by brand identity ( $κ = . 144$ ) and appeal ( $κ = . 099$ ). According to the common interpretation of Fleiss' kappa, values below .20 suggest slight agreement, implying that participant responses were not highly consistent within each level.

6 Discussion

The present study applied MIR and machine learning methods to predict perceived semantic expression in functional sounds. By extracting low‑level timbre, chroma, and loudness features; aggregating them into high‑level topics via GMMs and LDA; and employing RF multi‑output regression, we showed that MIR‑based audio representations capture meaningful aspects of communicative expression. Despite modest overall predictive performance ( $R^{2} = 0.16$ ), Substudy 3 demonstrated significant alignment between model predictions and user perception, indicating our machine learning model learned salient auditory patterns linked to semantic meaning as perceived by human device users.

6.1 Interpretation of results

Analyzing the topic distribution in Figure 2 in Section 3.2 revealed that the functional sounds of the FBMSet‑805 exhibit a balanced spread of timbre and loudness topics, whereas chroma is concentrated around a few tonal configurations (same note, octave, and minor thirds). This distribution aligns with established Human Computer Interaction (HCI) and auditory‑display design practices, where timbral and dynamic variation serve as primary means of communicative differentiation and pitch content is used sparingly for signaling or emphasis (Brewster et al., 1994; Serafin et al., 2022). The topic modeling approach thus effectively captures acoustic structures that are perceptually and semantically relevant to sound design in human–computer interaction.

The RF multi‑output regression effectively captured nonlinear relationships and mitigated overfitting, beneficial given the moderate dataset size. In contrast, the DNN architectures showed converging and plateauing validation losses despite extensive architectural and hyperparameter exploration. This suggests that the tested configurations reached their practical limit on the dataset. Further, leveraging shared variance across multiple dimensions, along with incorporating industry metadata as contextual priors, improved performance. This suggests that the industry origin of a functional sound systematically shapes its semantic perception, implying that UX sound design should account for domain‑specific communication conventions. Although overall predictive power was modest ( $R^{2} = 0.16$ ), this is notable given the domain's inherent subjectivity, perceptual ambiguity, and task complexity. Predicting 19 semantic dimensions across three communication levels significantly exceeds the dimensionality of valence–arousal MER (Kang and Herremans, 2024) or predicting semantic expression of music branding on five dimensions (Lepa et al., 2020a). Positive correlations between per‑dimension $R^{2}$ and factor reliabilities suggest that well‑defined constructs are better learnable (see Figure 3), while alignment of model‑ and ground‑truth Mahalanobis distances indicates sensitivity to perceptual clarity versus ambiguity.

Analysis of feature permutation importance showed that globally important features are not the ones prevalent in the dataset (see Figure 2) because common features often carry redundant information. Our machine learning model mainly learned to pay attention to low‑ and high‑frequency tonal timbres to differentiate functional sounds, which aligns with research on auditory displays by Brewster et al. (1994). The model identified the importance of dissonant tension‑bearing chroma intervals for distinguishing expression of functional sounds, consistent with research that links dissonance to emotional differentiation (Di Stefano et al., 2022). Because no loudness topic stood out in terms of importance, we suggest that loudness patterns are dimension‑specifically relevant rather than globally informative. SHAP analysis (see Figure 5) revealed that the influence of acoustic properties differs across communication levels. For status and appeal, the model primarily relies on loudness modulation, consistent with prior findings linking loudness to perceived urgency and affect (Bailes et al., 2015). For appeal, timbre additionally contributes, suggesting the model uses spectral–temporal cues to differentiate valence responses, in line with evidence that roughness modulates aversion (Di Stefano et al., 2022). In contrast, brand identity predictions depend predominantly on timbre, reflecting higher‑order aesthetic processing consistent with audio branding research (Techawachirakul et al., 2023). These patterns indicate that functional dimensions (status, appeal) exploit primal arousal–valence pathways for rapid, context‑independent signaling, whereas aesthetic dimensions (brand identity) engage more complex cognitive mechanisms to convey ongoing experience and brand personality. Overall, these results demonstrate that MIR methods, when adapted to functional sounds, can yield domain‑relevant musical patterns and that the resulting prediction model attends to similar psychoacoustic cues as human listeners.

Validation results showed the best prediction accuracy for sonically expressed status and the poorest for appeal, likely due to appeal's higher subjectivity and context dependence. Low inter‑rater agreement and modest model–human alignment (25%–30%) underscore the challenge of modeling ambiguous perceptual dimensions and highlight the importance of incorporating perceptual uncertainty in machine learning model evaluation.

6.2 Limitations

The predominantly Western dataset and moderate sample size limit cultural generalizability and model complexity, which appear rather typical of UX sound research. Feature extraction via static frame slicing and LDA captures interpretable summaries but omits temporal sequencing, likely contributing to modest predictive performance ( $R^{2}$ ). Chroma features' harmonic ambiguity and lack of rhythmic modeling further restrict expressive detail. A limitation of the validation strategy performed here is that it focused on clearly identifiable dimensions, while more ambiguous ones (Process ongoing, Urgency reminder) lacked distinct examples and were hence not validated. The relatively small validation sample (n = 46) constrains generalizability, though significant model–human alignment supports perceptual relevance.

6.3 Implications and future work

The findings of this study can have important implications for MIR, sound design, digital musicology, and human–machine interaction research. For MIR, our findings demonstrate that tools originally developed for the computational analysis of long‑form music (Mauch et al., 2015) can be applied to rather short functional sounds through extending the GMM/LDA method by the domain‑relevant (Menzel et al., 2011) and perceptually grounded (Flexer, 2014) loudness property. Using signal‑level descriptors, topic modeling, and regression‑based prediction leads to perceptually meaningful predictions of communicative expression. Finally, this work contributes to the growing interest in using MIR for more socially and contextually grounded applications. For sound design and UX, the proposed prediction model provides a data‑driven, interpretable pipeline that can reduce reliance on iterative prototyping, enabling designers to better align acoustic features with communicative intent. It supports an interpretability‑driven approach to UX sound design and connects to current research on auditory interface design principles (Serafin et al., 2022). In this way, our approach bridges auditory UX research and digital musicology, with the potential to transform design workflows across industries. In human–machine interaction, it supports the development of emotionally and functionally intelligent auditory interfaces. In digital musicology, this work extends topic‑modeling approaches such as that of Mauch et al. (2015) by coupling probabilistic audio features with semantic prediction, moving beyond descriptive analysis of musical patterns. The proposed GMM/LDA–RF pipeline thus enables computational modeling of the perceived meaning of functional sounds and offers a transferable framework for future studies on industry aesthetics and cultural variation.

Future work should enhance feature extraction to better capture melodic gestures and temporal dynamics, exploring pitch‑height sensitive representations such as MIDI or constant‑Q transform as well as N‑gram topic modeling (Jurafsky and Martin, 2025). Including additional timbral descriptors into our feature extraction such as roughness and hardness also seems promising, as these psychoacoustic features are known to influence semantic expression of urgency (Arnal et al., 2015), with hardness being a particularly relevant descriptor for sound effects (Pearce et al., 2019). Apart from methodological improvement, future work should explore the direction of using our framework to formulate UX sound design recommendations on dimension level.

6.4 Conclusion

This study shows that combining MIR with interpretable machine learning techniques advances research on the perceived semantic expression of functional UX sounds. Although not considered ‘musical' in the traditional sense, functional sounds exhibit structured, communicative properties that align with musicological analysis. We demonstrate that a two‑stage pipeline consisting of probabilistic feature extraction and multi‑output regression can reveal latent sound topics that predict human interpretation. While the machine learning model's prediction accuracy is modest, its outputs correlate with perceptual clarity and semantic reliability, thus validating the approach. Thus, this work expands the scope of digital musicology and MIR through three core contributions:

A computational framework linking acoustic features to perceived communicative meaning, grounded in a theoretical communication model;
Evidence that MIR techniques effectively extract musically interpretable and perceptually relevant structures from functional sounds, expanding the methods' applicability; and
Methodological innovations in feature modeling (GMM/LDA for short sounds) suited for small‑data, high‑ambiguity domains in MIR, UX, and digital musicology research.

This study contributes to the growing interdisciplinary field at the intersection of MIR, UX design, and digital musicology. It highlights the potential of computational methods to enhance human–machine interaction through perceptually grounded audio design. In doing so, the study supports a broader reconceptualization of 'musical meaning' as applicable not only to artistic expression but also to designed auditory experiences.

Acknowledgments

We thank the Sound Innovation Lab for conceiving and supporting the overall project, as well as for their intensive intellectual and practical contributions throughout its development. In particular, we thank Paul Schulze and Andreas Hoppe for consultation and Siamend Darwesh for code review and technical consultation. We also thank Marc Voigt from the Audio Communication Group at TU Berlin for managing necessary computational resources, along with all participants of the validation study.

Funding Information

Funded by the ProValid program, State of Berlin, project #VAL40/2023.

Competing Interests

The authors have no competing interests to declare. None of the authors are currently members of the journal’s editorial team or board, nor have they held such a position within the past three years.

Authors’ Contributions

All authors contributed to study design. Annika Frommholz conducted the signal processing, experiments, coding, and analyses. Annika Frommholz, Steffen Lepa, and Tom Virkus co‑designed the listening test. Steffen Lepa consulted the main author with audio feature engineering, machine learning workflow, and statistical methods. Tom Virkus consulted in terms of functional sound communication. Johannes Helberger supervised research from a professional sound design perspective, and Stefan Weinzierl supervised research from an audio communication perspective. Finally, Annika Frommholz drafted the manuscript, and all authors revised it.

Reproducibility Statement

Substudy 1 materials and algorithms are subject to third‑party licensing and confidentiality agreements and cannot be publicly shared; methodologies are described in detail to support reproducibility within these constraints. Code and data for Substudy 2 are publicly available at https://doi.org/10.5281/zenodo.18404881.

Ethics and Consent Statement

Substudy 3 was conducted in accordance with the Declaration of Helsinki. All participants provided informed consent, and data were anonymized.

Notes

[1] Artist to Business to Business to Consumer Audio Branding System (ABC_DJ).

[2] GitHub: SHAP.

Additional File

The additional file for this article can be found as follows:

Supplementary Appendix A

Prompt specification. DOI: https://doi.org/10.5334/tismir.290.s1.