An Analysis of the Effect of Data Augmentation Methods: Experiments for a Musical Genre Classification Task

Rémi Mignot; Geoffroy Peeters

doi:10.5334/tismir.26

Abstract

Supervised machine learning relies on the accessibility of large datasets of annotated data. This is essential since small datasets generally lead to overfitting when training high-dimensional machine-learning models. Since the manual annotation of such large datasets is a long, tedious and expensive process, another possibility is to artificially increase the size of the dataset. This is known as data augmentation. In this paper we provide an in-depth analysis of two data augmentation methods: sound transformations and sound segmentation. The first transforms a music track to a set of new music tracks by applying processes such as pitch-shifting, time-stretching or filtering. The second one splits a long sound signal into a set of shorter time segments. We study the effect of these two techniques (and the parameters of those) for a genre classification task using public datasets. The main contribution of this work is to detail by experimentation the benefit of these methods, used alone or together, during training and/or testing. We also demonstrate their use in improving the robustness of potentially unknown sound degradations. By analyzing these results, good practice recommendations are provided.

References

1Barlow, R. J. (1989). Statistics: A guide to the use of statistical methods in the physical sciences, volume 29. John Wiley & Sons.
Back to article
2Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford University Press. DOI: 10.1201/9781420050646.ptb6
Back to article
3Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In ACM Conference on Computational Learning Theory, pages 144–152. DOI: 10.1145/130385.130401
Back to article
4Chang, E. I., & Lippmann, R. P. (1995). Using voice transformations to create additional training talkers for word spotting. In Advances in Neural Information Processing Systems, pages 875–882.
Back to article
5Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. DOI: 10.1109/TIT.1967.1053964
Back to article
6Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(9), 1469–1477. DOI: 10.1109/TASLP.2015.2438544
Back to article
7Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transaction on Acoustics, Speech and Signal Processing, 28(4), 357–366. DOI: 10.1109/TASSP.1980.1163420
Back to article
8Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2017). FMA: A dataset for music analysis. In International Society for Music Information Retrieval Conference, pages 316–323. https://github.com/mdeff/fma.
Back to article
9Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. Wiley New York, 2nd edition.
Back to article
10Feng, Y., Zhuang, Y., & Pan, Y. (2003). Music information retrieval by detecting mood via computational media aesthetics. In IEEE International Conference on Web Intelligence, pages 235–241.
Back to article
11Flexer, A. (2007). A closer look on artist filters for musical genre classification. In International Conference on Music Information Retrieval.
Back to article
12Fu, Z., Lu, G., Ting, K. M., & Zhang, D. (2011). A survey of audio-based music classification and annotation. IEEE Transactions on Multimedia, 13(2), 303–319. DOI: 10.1109/TMM.2010.2098858
Back to article
13Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417. DOI: 10.1037/h0071325
Back to article
14Humphrey, E. J., & Bello, J. P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In 11th International Conference on Machine Learning and Applications (ICMLA), volume 2, pages 357–362. DOI: 10.1109/ICMLA.2012.220
Back to article
15Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. In ICML Workshop on Deep Learning for Audio, Speech and Language, volume 117.
Back to article
16Kanda, N., Takeda, R., & Obuchi, Y. (2013). Elastic spectral distortion for low resource speech recognition with deep neural networks. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 309–314. DOI: 10.1109/ASRU.2013.6707748
Back to article
17Kirchhoff, H., Dixon, S., & Klapuri, A. (2012). Multitemplate shift-variant non-negative matrix deconvolution for semi-automatic music transcription. In International Society for Music Information Retrieval Conference, pages 415–420. DOI: 10.1109/ICASSP.2012.6287833
Back to article
18Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105.
Back to article
19LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. DOI: 10.1109/5.726791
Back to article
20Lee, C.-H., Shih, J.-L., Yu, K.-M., & Lin, H.-S. (2009a). Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features. IEEE Transactions on Multimedia, 11(4), 670–682. DOI: 10.1109/TMM.2009.2017635
Back to article
21Lee, H., Pham, P., Largman, Y., & Ng, A. Y. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in Neural Information Processing Systems, pages 1096–1104.
Back to article
22Lee, K., & Slaney, M. (2008). Acoustic chord transcription and key extraction from audio using keydependent HMMs trained on synthesized audio. IEEE Transactions on Audio, Speech, and Language Processing, 16(2), 291–301. DOI: 10.1109/TASL.2007.914399
Back to article
23Li, T. L. H., & Chan, A. B. (2011). Genre classification and the invariance of MFCC features to key and tempo. In International Conference on Multimedia Modeling, pages 317–327. DOI: 10.1007/978-3-642-17832-0_30
Back to article
24Lidy, T., Rauber, A., Pertusa, A., & Quereda, J. (2007). Improving genre classification by combination of audio and symbolic descriptors using a transcription system. In International Conference on Music Information Retrieval, pages 61–66.
Back to article
25Mandel, M. I., & Ellis, D. P. (2008). Multiple-instance learning for music information retrieval. In International Conference on Music Information Retrieval, pages 577–582.
Back to article
26Marchand, U., & Peeters, G. (2014). The modulation scale spectrum and its application to rhythmcontent description. In International Conference on Digital Audio Effects, pages 167–172.
Back to article
27Mauch, M., & Ewert, S. (2013). The Audio Degradation Toolbox and its application to robustness evaluation. In International Society for Music Information Retrieval Conference, pages 83–88.
Back to article
28McFee, B., Humphrey, E. J., & Bello, J. P. (2015). A software framework for musical data augmentation. In International Society for Music Information Retrieval Conference, pages 248–254.
Back to article
29Ness, S. R., Theocharis, A., Tzanetakis, G., & Martins, L. G. (2009). Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs. In Proceedings of the 17th ACM International Conference on Multimedia, pages 705–708. DOI: 10.1145/1631272.1631393
Back to article
30Oppenheim, A. V., & Schafer, R. W. (2009). Discrete-Time Signal Processing. Prentice Hall, 3rd edition.
Back to article
31Orfanidis, S. J. (2005). High-order digital parametric equalizer design. Journal of the Audio Engineering Society, 53(11), 1026–1046.
Back to article
32Peeters, G. (2007). A generic system for audio indexing: Application to speech/music segmentation and music genre recognition. In International Conference on Digital Audio Effects, pages 205–212.
Back to article
33Peeters, G., Giordano, B., Susini, P., Misdariis, N., & McAdams, S. (2011). The Timbre Toolbox: Extracting audio descriptors from musical signals. The Journal of the Acoustical Society of America, 130(5), 2902–2916. DOI: 10.1121/1.3642604
Back to article
34Peeters, G., & Rodet, X. (2003). Hierarchical Gaussian tree with inertia ratio maximization for the classification of large musical instrument databases. In International Conference on Digital Audio Effects.
Back to article
35Quatieri, T. F., & McAulay, R. J. (1992). Shape invariant time-scale and pitch modification of speech. IEEE Transactions on Signal Processing, 40(3), 497–510. DOI: 10.1109/78.120793
Back to article
36Ragni, A., Knill, K. M., Rath, S. P., & Gales, M. J. (2014). Data augmentation for low resource languages. In 15th Annual Conference of the International Speech Communication Association, pages 810–814.
Back to article
37Röbel, A. (2003). Transient detection and preservation in the phase vocoder. In International Computer Music Conference (ICMC), pages 247–250.
Back to article
38Röbel, A., & Rodet, X. (2005). Efficient spectral envelope estimation and its application to pitch shifting and envelope preservation. In International Conference on Digital Audio Effects, pages 30–35.
Back to article
39Schlüter, J. (2016). Learning to pinpoint singing voice from weakly labeled examples. In International Society for Music Information Retrieval Conference, pages 44–50.
Back to article
40Schlüter, J., & Grill, T. (2015). Exploring data augmentation for improved singing voice detection with neural networks. In International Society for Music Information Retrieval Conference, pages 121–126.
Back to article
41Seyerlehner, K., & Schedl, M. (2014). MIREX 2014: Optimizing the fluctuation pattern extraction process. Technical report, Dept. of Computational Perception, Johannes Kepler University, Linz, Austria.
Back to article
42Seyerlehner, K., Widmer, G., & Pohle, T. (2010a). Fusing block-level features for music similarity estimation. In International Conference on Digital Audio Effects, pages 225–232.
Back to article
43Seyerlehner, K., Widmer, G., Schedl, M., & Knees, P. (2010b). Automatic music tag classification based on block-level features. In 7th Sound and Music Computing Conference.
Back to article
44Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recognition, volume 3, pages 958–962. DOI: 10.1109/ICDAR.2003.1227801
Back to article
45Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293–302. DOI: 10.1109/TSA.2002.800560
Back to article
46Yaeger, L. S., Lyon, R. F., & Webb, B. J. (1997). Effective training of a neural network character classifier for word recognition. In Advances in Neural Information Processing Systems, pages 807–816.
Back to article
47Zölzer, U. (2011). DAFx: Digital Audio Effects. John Wiley & Sons. DOI: 10.1002/9781119991298
Back to article

An Analysis of the Effect of Data Augmentation Methods: Experiments for a Musical Genre Classification Task

Abstract

Paradigm

My account