Predicting Perceived Semantic Expression of Functional Sounds Using Unsupervised Feature Extraction and Ensemble Learning

Annika Frommholz; Steffen Lepa; Tom Virkus; Stefan Weinzierl; Johannes Helberger

doi:10.5334/tismir.290

Figures & Tables

Schematic overview of methodological steps.

Mean topic probabilities across the FBMSet‑805 dataset grouped by acoustic property.

Comparison of factor reliabilities across participants in the ground truth data and explanatory power on the full dataset measured by $R^{2}$ in the regressor predictions. Factor reliabilities are taken from Virkus et al. (2025c) and Virkus et al. (2025b).

Table 1

Overview of model configurations tested during model selection, combining two learning paradigms, three output strategies, and optional metadata inclusion.

Model Selection Criteria	Tested Configurations
Learning Paradigm	Deep neural network Random forest
Model Output Configuration	Multi‑output (all) 3 × Multi‑output (communication level‑wise) 19 × Single‑output (communication dimension‑wise)
Inclusion of Metadata	Yes No

Table 2

Overview of prediction performance on the test dataset measured by coefficient of determination $R^{2}$ for the tested criteria. Mean and standard deviation values of test $R^{2}$ values are given where results were averaged over multiple models.

Learning Paradigm	Deep Neural Network	Deep Neural Network	Random Forest	Random Forest
Industry Metadata	No	Yes	No	Yes
Output Configuration
multi‑output_all	0.029	0.054	0.126	0.160
multi‑output_level	0.042 ± 0.02	0.053 ± 0.06	0.104 ± 0.05	0.139 ± 0.07
single‑output	0.010 ± 0.06	0.023 ± 0.09	0.122 ± 0.09	0.135 ± 0.09
Baseline (Linear Regression)	0.077

Permutation importances of high‑level topics (model input) for the best‑performing random forest regression model grouped by acoustic property. Feature importance reflects model sensitivity to specific acoustic patterns.

SHAP values for the best‑performing random forest regression model averaged over all samples, aggregated across acoustic property, and grouped by communication level. These values indicate the contribution of each property to predicted semantic expression in the respective communication level.

Table 3

Binomial test results of the validation experiment comparing model predictions against user perception within communication level for 17 sounds from FBMSet‑805.

Level	Status	Appeal	Brand Identity
n	329	235	329
accuracy	38.3%	27.66%	32.22%
p‑value	<0.001	0.003	<0.001

Predicting Perceived Semantic Expression of Functional Sounds Using Unsupervised Feature Extraction and Ensemble Learning

Figures & Tables

Figure 1

Figure 2

Figure 3

Table 1

Table 2

Figure 4

Figure 5

Table 3

Paradigm

My account