
Figure 1
Flowchart of the musical genre classification method using data augmentation. The Global Features block, made of Descriptors and Summary, is computed for each element produced by the data augmentation block.
Table 1
Convention for the transformation strength.
| Γ or γ = 0 | no transformation |
| Γ or γ = 0.5 | very light transformations |
| Γ or γ = 1 | medium transformations |
| Γ or γ = 1.5 | strong transformations |
| Γ or γ = 2 | exaggerated degradations |
Table 2
Evaluation of segmentation Accuracy (%) for ISMIR-2004.
| train\test | no seg. | 80 s | 30 s | 15 s |
| no seg. | 81.0 | 84.3 | 84.9 | 81.2 |
| 80 s | 83.7 | 86.5 | 86.8 | 85.3 |
| 30 s | 85.6 | 86.5 | 89.3 | 86.1 |
| 15 s | 85.3 | 85.3 | 87.2 | 87.3% |
Table 3
Evaluation of transformations. Accuracy mean (%) for ISMIR-2004, using a transformation strength Γ* = 1. The small numbers in parentheses are the standard deviations (in percentage points, pp) computed with 25 repetitions. Note that the 95% confidence interval of the accuracy mean is less than 0.37pp.
| train\test | original | +1 transf. | +2 transf. | +4 transf. | +14 transf. |
| original | 81.0(0.0) | 78.7(0.6) | 77.6(0.7) | 77.2(0.8) | 76.3(0.5) |
| +1 transf. | 84.5(0.6) | 83.6(0.7) | 83.4(0.9) | 83.3(0.7) | 81.9(0.6) |
| +2 transf. | 84.5(0.8) | 84.7(0.7) | 84.8(0.8) | 84.4(0.7) | 84.6(0.6) |
| +4 transf. | 85.0(0.9) | 84.2(0.9) | 84.3(0.5) | 85.4(0.7) | 85.4(0.6) |
| +14 transf. | 84.4(0.4) | 84.9(0.6) | 85.1(0.6) | 85.5(0.7) | 85.8(0.4) |
Table 4
Testing combinations of segmentation and transformation. Accuracy mean (%) for ISMIR-2004. The symbols S and T respectively mean that segmentation or transformation is used during training (rows) or testing (columns), and the symbols S and T denote that the respective method is not used. Note that for the experiments with sound transformations, the standard deviations of the accuracy are less than 0.94pp and the 95% confidence intervals of the accuracy mean are less than 0.37pp.
| train\test | S T | S T | S T | S T |
| S T | 81.0 | 77.2 | 84.3 | 78.1 |
| S T | 85.0 | 85.4 | 84.4 | 85.4 |
| S T | 83.7 | 81.8 | 86.5 | 83.2 |
| S T | 85.8 | 86.3 | 86.3 | 87.1 |
Table 5
Natural vs artificial data augmentation. Accuracy mean (%) for FMA (10-fold cross validation). The first column corresponds to small training sets with 1000 songs, and the second to larger training sets with 5000 songs. Note that the results in bold font use training sets with the same size.
| Small (1000) | Big (5000) | |
| No augmentation | 45.8 | 54.9 |
| Segmentation | 48.6 | 55.2 |
| Transformation | 48.5 | 54.7 |
Table 6
Robustness to degradation, shown by mean prediction accuracy (%) for ISMIR-2004. Rows represent amount of data augmentation; columns represent transformation strength Γ*. The standard deviations are less than 0.88pp and the 95% confidence intervals of the mean are less than 0.35pp.
| Γ* for testing → | 0 | 0.5 | 1 | 1.5 | 2 |
| original | 86.5 | 73.2 | 74.2 | 71.7 | 68.9 |
| +1 transf. | 86.3 | 85.2 | 84.5 | 82.7 | 81.1 |
| +2 transf. | 86.4 | 86.0 | 85.6 | 83.9 | 82.1 |
| +4 transf. | 86.3 | 86.8 | 86.0 | 84.7 | 83.0 |

Figure 2
Individual and chained transformations. Each horizontal colored bar and black segment represents the mean accuracy and its standard deviation computed with 25 repetitions of each experiment. The vertical dashed line represents the accuracy without transformation, and the dotted line represents the mean accuracy for the transformation chain used in this paper.

Figure 3
Testing of transformation overfitting. The rows represent the transformations used during training, and the columns represent the transformations of the test signals (ISMIR-2004).
Table 7
Classification accuracy (%), showing the effect of transformations for cross-dataset issues. The SVM parameters C and σ are fixed to 1.
| Training set | Testing set (only original) | |
| ISMIR-2004 | ISMIR-2004 | 1517-Artists |
| original | 85.0 | 40.6 |
| +2 transf. | 85.3 | 46.5 |
| 1517-Artists | ISMIR-2004 | 1517-Artists |
| original | 57.0 | 58.6 |
| +2 transf. | 56.9 | 63.1 |
Table 8
Evaluation of transformations. Accuracy mean (%) for ISMIR-2004, Γ* = 1, using: Std- Desc+ModSpec+GMM. The small numbers given between parentheses are the standard deviations (pp) computed with 25 repetitions. Note that the 95% confidence interval of the accuracy mean is less than 0.49pp.
| train\test | original | +1 transf. | +2 transf. | +4 transf. | +14 transf. |
| original | 83.0(0.6) | 79.8(0.9) | 79.1(0.9) | 78.9(1.2) | 78.5(1.1) |
| +1 transf. | 82.7(0.9) | 82.8(0.9) | 83.1(0.8) | 83.4(0.9) | 83.6(0.9) |
| +2 transf. | 83.3(1.0) | 83.3(0.8) | 83.6(0.8) | 84.3(1.1) | 84.5(0.8) |
| +4 transf. | 83.1(0.5) | 83.5(0.8) | 84.0(0.8) | 84.4(0.7) | 84.8(0.6) |
| +14 transf. | 83.7(0.9) | 84.1(0.7) | 84.7(0.7) | 85.0(0.5) | 85.3(0.6) |
Table 9
Evaluation of segmentation. Accuracy (%) for ISMIR-2004, using: StdDesc+ModSpec+GMM. The small numbers given between parentheses are the standard deviations (pp) computed with 25 repetitions. Note that the 95% confidence interval of the accuracy mean is less than 0.4pp.
| train\test | no seg. | 80 s | 30 s | 15 s |
| no seg. | 83.9(0.8) | 83.8(1.0) | 83.6(0.8) | 82.0(1.0) |
| 80 s | 85.2(0.6) | 86.4(0.4) | 86.3(0.5) | 85.8(0.4) |
| 30 s | 85.3(0.3) | 85.9(0.4) | 86.9(0.1) | 86.9(0.1) |
| 15 s | 84.8(0.3) | 85.6(0.2) | 85.8(0.4) | 85.9(0.2) |
Table 10
Testing combinations of segmentation and transformation. Accuracy (%) for ISMIR-2004, using: StdDesc+ModSpec+GMM. cf. Table 4 for an explanation. Note that the 95% confidence interval of the accuracy mean is less than 0.43pp.
| train\test | S T | S T | S T | S T |
| S T | 83.1(0.7) | 78.6(1.1) | 83.2(0.5) | 80.3(0.8) |
| S T | 83.1(0.7) | 84.4(0.6) | 83.5(0.7) | 85.1(0.7) |
| S T | 85.3(0.3) | 83.2(0.7) | 85.8(0.6) | 84.4(0.5) |
| S T | 83.8(0.6) | 84.3(0.6) | 84.3(0.7) | 85.0(0.6) |
Table 11
Natural vs artificial data augmentation. Accuracy (%) for FMA using Std- Desc+ModSpec+GMM. cf. Table 5 for an explanation.
| Small (1000) | Big (5000) | |
| No augmentation | 41.7 | 54.0 |
| Segmentation | 48.2 | 54.0 |
| Transformations | 48.0 | 53.4 |
Table 12
Robustness to degradation, shown by mean prediction accuracy (%) for ISMIR-2004, using StdDesc+ModSpec+GMM. Rows represent amount of data augmentation; columns represent transformation strength Γ*. The standard deviations are less than 1.05pp and the 95% confidence intervals of the mean are less than 0.41pp.
| Γ* for testing → | 0 | 0.5 | 1 | 1.5 | 2 |
| original | 85.8 | 79.9 | 78.1 | 75.8 | 72.2 |
| +1 transf. | 84.9 | 84.2 | 83.7 | 82.5 | 80.6 |
| +2 transf. | 84.4 | 84.3 | 83.8 | 83.1 | 81.4 |
| +4 transf. | 84.2 | 84.2 | 84.4 | 83.5 | 81.6 |
Table 13
Classification accuracy (%), showing the effect of transformations for cross-dataset issues, using StdDesc+ModSpec+GMM.
| Training set | Testing set (only original) | |
| ISMIR-2004 | ISMIR-2004 | 1517-Artists |
| original | 86.1 | 47.8 |
| +2 transf. | 85.2 | 47.5 |
| 1517-Artists | ISMIR-2004 | 1517-Artists |
| original | 46.7 | 60.7 |
| +2 transf. | 56.7 | 62.9 |
