Have a personal or library account? Click to login
Multi-Objective Investigation of Six Feature Source Types for Multi-Modal Music Classification Cover

Multi-Objective Investigation of Six Feature Source Types for Multi-Modal Music Classification

By: Igor Vatolkin and  Cory McKay  
Open Access
|Jan 2022

Figures & Tables

tismir-5-1-67-g1.png
Figure 1

Examples of non-dominated feature sets (connected circles) after feature selection in an experiment on Rock music (see Section 5) using two criteria: the first is binary classification error me, which is minimized, and the second is the proportion gk of the features from the k-th group, which is maximized. The share of album cover features is maximized in the upper sub-figure, and the share of model-predicted semantic tags is maximized in the lower sub-figure.

tismir-5-1-67-g2.png
Figure 2

Theoretically possible non-dominated fronts for the minimization of me and maximization of gk.

tismir-5-1-67-g3.png
Figure 3

Binary classification performance of symbolic (top) and model-based (bottom) features on Traditional Blues music (see Section 5), based on minimization of both me and gk.

Table 1

Summary of feature groups associated with each of the six modalities. The complete list of features is provided in the supplementary material, Sections A.1 to A.6.

GroupSub-GroupsSample FeaturesDim.
Audio signalTimbre, pitch + harmony, tempo + rhythm + structure, structural complexityMFCCs and delta MFCCs (Lartillot and Toiviainen, 2007), CMRARE modulation features (Martin and Nagathil, 2009), chroma DCT-reduced log pitch (Müller and Ewert, 2011), structural complexity (Mauch and Levy, 2011) for chroma, chords, harmony, tempo/rhythm, timbre908
Model-basedInstruments, instrumental complexity, moods, various semantic descriptorsShare of guitar, piano, wind, and strings, semantic descriptors annotated by music experts: orchestra occurence, clear or rough vocals, melodic range, dynamics, digital effects, level of activation494
SymbolicPitch, melodic, chords, rhythm, tempo, instrument presence, instruments, texture, dynamicsPitch class histogram, amount of arpeggiation, tempo, number of instruments, dynamic range and variation789
Album coversSIFT descriptors (Lowe, 2004)100
PlaylistsCo-occurrences of artists (Vatolkin et al., 2014)293
LyricsAverage number of syllables per word, rate of misspelling, vocabulary size, bag-of-words, Doc2Vec87/219
Table 2

Fold assignments in cross-validation splits.

SplitTrainingValidationTest
1Fold 1Fold 2Fold 3
2Fold 2Fold 3Fold 1
3Fold 3Fold 1Fold 2
Table 3

Numbers of positive and negative tracks in the training, validation, and test sets for a split.

TracksTrainingValidationTest
LMD-aligned genres
Positives105105105
Negatives104420420
SLAC genres
Positives161616
Negatives166464
SLAC sub-genres
Positives888
Negatives167272
tismir-5-1-67-g5.png
Table 4

Comparison of the six feature types based on h (me ↓, gk) FS optimization. Mean and standard deviations are estimated for the three folds in the splits in which they respectively played a test role (see Section 5.8), and across all ten repetitions of each experiment. All rows but the two starting with “Combined” indicate mean best test classification errors me for pure feature groups only (lower values are better) and mean normalized multi-group feature importance ih (higher values are better). me values are averaged across the ten repetitions, and each ih value specifies the highest-importance non-dominated solution among all ten experimental trials. The mean best me and ih for each class is in bold, and cell background color indicates sorted mean ih values: deep red indicates highest importance and deep blue corresponds to lowest importance for a given column and its folds. Finally, the values in the rows starting with “Combined” indicate the smallest mean test error me obtained across all non-dominated solutions for each class, including (in this row only) mixed feature sets. The following procedure was used to estimate me: first, the smallest error from all non-dominated solutions for each individual experimental run is noted, this is then averaged across the ten experimental trials, and the minimum is taken across all six feature groups.

tismir-5-1-67-g4.png
Figure 4

Share of each non-playlist feature group in the feature subsets with the smallest test errors for each genre. A: audio; M: model-predicted tag; S: symbolic; C: album cover; T: lyrics. Results are based on h(me, gk) FS optimization, and are shown for each of the three folds separately, for the splits in which they played a test role.

tismir-5-1-67-g6.png
Table 5

Normalized multi-group feature redundancy (Rh) comparison of the five feature types left after excluding playlist features (lower values are better). The mean and standard deviation are shown across three folds. The best value for each class is in bold. Deep red indicates the best mean Rh and deep blue the worst (equal values are possible).

tismir-5-1-67-g7.png
Table 6

Comparison of 15 feature sub-groups from the modalities Audio Signal, Model-Based, and Symbolic with respect to normalized multi-group importance ih, after h(me ↓, gk) FS optimization. Higher values are better. Mean values and standard deviations across the three folds are reported. The highest mean ih value for each genre is in bold; cells with higher values are marked in red, and cells with lower values in blue.

DOI: https://doi.org/10.5334/tismir.67 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jun 17, 2020
Accepted on: Dec 2, 2021
Published on: Jan 24, 2022
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2022 Igor Vatolkin, Cory McKay, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.