Have a personal or library account? Click to login
Make That Sound More Metallic: Towards a Perceptually Relevant Control of the Timbre of Synthesizer Sounds Using a Variational Autoencoder Cover

Make That Sound More Metallic: Towards a Perceptually Relevant Control of the Timbre of Synthesizer Sounds Using a Variational Autoencoder

Open Access
|May 2021

Figures & Tables

tismir-4-1-76-g1.png
Figure 1

Global diagram of (V)AE-based sound analysis-transformation-synthesis.

tismir-4-1-76-g2.png
Figure 2

General architecture of a VAE. Grey dotted arrows represent sampling processes.

tismir-4-1-76-g3.png
Figure 3

Global diagram of the semantic proximity analysis.

Table 1

Frequency and transversality measures of the eight most frequent and transverse semantic clusters, represented by the most frequent and transverse verbal descriptor within each cluster. The frequency of the semantic clusters is expressed as percentages of the evaluated sounds for each participant and their transversality as percentages of the total number of participants. The frequency and transversality of the verbal descriptors are expressed as percentages of the expressions within each cluster.

Semantic ClusterFrequency (in %)Transversality (in %)
Semantic ClusterIsolated Verbal Desc.Semantic ClusterIsolated Verbal Desc.
Qui résonne (cluster of 8 expressions)13.525.747.537.5
Métallique (cluster of 4 expressions)10.652.643.675.0
Agressif (cluster of 4 expressions)9.947.640.648.8
Qui vibre (cluster of 7 expressions)7.843.546.540.4
Chaud (cluster of 4 expressions)7.745.836.640.5
Qui évolue (cluster of 8 expressions)5.727.029.733.3
Soufflé (cluster of 5 expressions)4.543.525.757.7
Percussif (cluster of 4 expressions)3.637.825.726.9
Table 2

Intra- and inter-listener agreement on the eight perceptual dimensions, for the second perceptual test. The first two columns give information on the intra-listener agreement: the average Pearson’s coefficient R and the percentage of participants showing an R > 0.5. The four last columns report the levels of inter-listener agreement observed over the whole group of participants and for each group of participants identified from the HAC analysis (the percentage of participants in each group being reported in brackets).

Perceptual dimensionIntra-listener agreementInter-listener agreement
Average Pearson’s R% of part. for whom R > 0.5Average Pearson’s R (% of participants)
All1st cluster (selected)2nd cluster3rd cluster
Métallique0.5969.0%0.380.47 (59.2%)0.40 (40.8%)
Chaud0.5064.8%0.360.48 (43.5%)0.39 (37.0%)0.45 (19.5%)
Soufflé0.5866.2%0.310.40 (53.2%)0.35 (46.8%)
Qui vibre0.3849.3%0.230.42 (62.9%)0.30 (37.1%)
Percussif0.8187.3%0.560.62 (85.5%)0.40 (14.5%)
Qui résonne0.4157.7%0.230.33 (56.1%)0.27 (43.9%)
Qui évolue0.5467.6%0.420.47 (70.8%)0.49 (29.2%)
Agressif0.6881.7%0.510.58 (60.3%)0.60 (27.6%)0.57 (12.1%)
tismir-4-1-76-g4.png
Figure 4

Performance of the classic VAE and the proposed perceptually-regularized VAE in terms of (a) RMSE (in dB) and (b) PEMO-Q scores, for three values of α (error bars represent 95% confidence intervals calculated with paired t-tests considering the classic VAE as the reference).

tismir-4-1-76-g5.png
Figure 5

Spearman correlation coefficients between extracted latent dimensions (first three rows) and perceptual ratings (last row).

tismir-4-1-76-g6.png
Figure 6

Interpretability measure (Pati and Lerch, 2020) for the first eight dimensions of the latent space.

Table 3

Averaged mapping and disentanglement metrics (Pati and Lerch, 2020) obtained for the classic and perceptually-regularized VAE models.

SCCInterpretabilityMIGSAP
Classic VAE0.32160.07620.00350.0264
Perceptually-regularized VAE0.78950.64480.05130.4275
Table 4

Statistical results of the perceptual A/B test results for the five selected perceptual dimensions. Effect of the train/test data origin factor (first two columns). Comparison of the perceptual choice (A/B) with chance level (third and fourth columns). Inter-listener agreement using Randolph’s free-marginal multi-rater kappa (Randolph, 2005) (last column).

Dimension“train/test dataset” factorChance threshold comparisonInter-listener agreement
χ2p-valuezp-valueRandolph’s κ
Agressif0.0260.873.39≪0.00010.50
Chaud1.070.300.370.360.08
Métallique0.430.611.470.070.29
Soufflé0.0020.960.250.400.36
Qui vibre0.760.383.93≪0.00010.27
tismir-4-1-76-g7.png
Figure 7

Results of the A/B perceptual test for five perceptual dimensions and for labeled stimuli (train) and new unknown ones (test). Bars represent mean values, error bars represent 95% confidence interval, and the red line indicates chance level.

DOI: https://doi.org/10.5334/tismir.76 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 21, 2020
Accepted on: Mar 29, 2021
Published on: May 18, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Fanny Roche, Thomas Hueber, Maëva Garnier, Samuel Limier, Laurent Girin, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.