Have a personal or library account? Click to login

A Gaussian–Based WGAN–GP Oversampling Approach for Solving the Class Imbalance Problem

By:
Qian Zhou and  Bo Sun  
Open Access
|Jun 2024

References

  1. Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein generative adversarial networks, International Conference on Machine Learning, Sydney, Australia, pp. 214–223.
  2. Barua, S., Islam, M.M. and Murase, K. (2013). PROWSYN: Proximity weighted synthetic oversampling technique for imbalanced data set learning, Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, pp. 317–328.
  3. Bourou, S., El Saer, A., Velivassaki, T.-H., Voulkidis, A. and Zahariadis, T. (2021). A review of tabular data synthesis using GANs on an IDS dataset, Information 12(09): 375.
  4. Breiman, L. (2001). Random forests, Machine Learning 45(1): 5–32.
  5. Breiman, L. (2017). Classification and Regression Trees, Routledge, London.
  6. Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F. and Harmouch, H. (2022). The effects of data quality on machine learning performance, arXiv: 2207.14529.
  7. Chaabane, I., Guermazi, R. and Hammami, M. (2020). Enhancing techniques for learning decision trees from imbalanced data, Advances in Data Analysis and Classification 14(3): 1–69.
  8. Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16: 321–357.
  9. Chen, J., Huang, H., Cohn, A.G., Zhang, D. and Zhou, M. (2022). Machine learning-based classification of rock discontinuity trace: SMOTE oversampling integrated with GBT ensemble learning, International Journal of Mining Science and Technology 32(2): 309–322.
  10. Chen, J., Yan, Z., Lin, C., Yao, B. and Ge, H. (2023). Aero-engine high speed bearing fault diagnosis for data imbalance: A sample enhanced diagnostic method based on pre-training WGAN-GP, Measurement 213(7): 112709.
  11. Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1): 21–27.
  12. Cui, J., Zong, L., Xie, J. and Tang, M. (2023). A novel multi-module integrated intrusion detection system for high-dimensional imbalanced data, Applied Intelligence 53(1): 272–288.
  13. Derrac, J., Garcia, S., Sanchez, L. and Herrera, F. (2015). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17(2–3): 255–287.
  14. Douzas, G. and Bacao, F. (2018). Effective data generation for imbalanced learning using conditional generative adversarial networks, Expert Systems with Applications 91(1): 464–471.
  15. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository, http://archive.ics.uci.edu/ml.
  16. Fernández, A., Garcia, S., Herrera, F. and Chawla, N.V. (2018). Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research 61: 863–905.
  17. Freund, Y. and Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55(1): 119–139.
  18. García, S., Luengo, J. and Herrera, F. (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining, Knowledge-Based Systems 98(7): 1–29.
  19. Gazzah, S. and Amara, N.E.B. (2008). New oversampling approaches based on polynomial fitting for imbalanced data sets, 2008 8th IAPR International Workshop on Document Analysis Systems, Nara, Japan, pp. 677–684.
  20. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets, Advances in Neural Information Processing Systems 27: 2672–2680.
  21. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. and Courville, A.C. (2017). Improved training of Wasserstein GANs, Advances in Neural Information Processing Systems 30: 5767–5777.
  22. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection, Journal of Machine Learning Research 3(Mar): 1157–1182.
  23. He, H. and Garcia, E.A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21(9): 1263–1284.
  24. Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. and Rankin, D. (2022). Synthetic data generation for tabular health records: A systematic review, Neurocomputing 493(27): 28–45.
  25. James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications to R, 2nd Edn, Springer, New York.
  26. Janicka, M., Lango, M. and Stefanowski, J. (2019). Using information on class interrelations to improve classification of multiclass imbalanced data: A new resampling algorithm, International Journal of Applied Mathematics and Computer Science 29(4): 769–781, DOI: 10.2478/amcs-2019-0057.
  27. Japkowicz, N. (2003). Class imbalances: Are we focusing on the right issue, Workshop on Learning from Imbalanced Data Sets II, Washington, USA, p. 63.
  28. Kaggle (2024), Datasets: Lower Back Pain, https://www.kaggle.com/datasets/sammy123/lower-back-pain-symptoms-dataset, and Telecom Churn, https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets.
  29. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection, 14th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, pp. 1137–1145.
  30. Kovács, G. (2019). An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83(9): 105662.
  31. Liu, X.-Y., Wu, J. and Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics 39(2): 539–550.
  32. López, V., Fernández, A., García, S., Palade, V. and Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences 250(33): 113–141.
  33. Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets, arXiv: 1411.1784.
  34. Miyato, T., Kataoka, T., Koyama, M. and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks, arXiv: 1802.05957.
  35. Moreo, A., Esuli, A. and Sebastiani, F. (2016). Distributional random oversampling for imbalanced text classification, Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy, pp. 805–808.
  36. Napierala, K. and Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems 46: 563–597.
  37. Nik, A.H.Z., Riegler, M.A., Halvorsen, P. and Storås, A.M. (2023). Generation of synthetic tabular healthcare data using generative adversarial networks, International Conference on Multimedia Modeling, Bergen, Norway, pp. 434–446.
  38. Ohsaki, M., Wang, P., Matsuda, K., Katagiri, S., Watanabe, H. and Ralescu, A. (2017). Confusion-matrix-based kernel logistic regression for imbalanced data classification, IEEE Transactions on Knowledge and Data Engineering 29(9): 1806–1819.
  39. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H. and Kim, Y. (2018). Data synthesis based on generative adversarial networks, Proceedings of the VLDB Endowment 11(10): 1071–1083.
  40. Park, S. and Park, H. (2021). Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic, Computing 103(3): 401–424.
  41. Powers, D.M. (2020). Evaluation: From precision, recall and f-measure to ROC, informedness, markedness and correlation, arXiv: 2010.16061.
  42. Ren, J., Wang, Y., Cheung, Y.-m., Gao, X.-Z. and Guo, X. (2023). Grouping-based oversampling in kernel space for imbalanced data classification, Pattern Recognition 133(1): 108992.
  43. Sáez, J.A., Luengo, J., Stefanowski, J. and Herrera, F. (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291(2): 184–203.
  44. Sun, B., Zhou, Q., Wang, Z., Lan, P., Song, Y., Mu, S., Li, A., Chen, H. and Liu, P. (2023). Radial-based undersampling approach with adaptive undersampling ratio determination, Neurocomputing 553(39): 126544.
  45. Sun, Y., Wong, A.K. and Kamel, M.S. (2009). Classification of imbalanced data: A review, International Journal of Pattern Recognition and Artificial Intelligence 23(04): 687–719.
  46. Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference, Springer, New York.
  47. Wold, S., Esbensen, K. and Geladi, P. (1987). Principal component analysis, Chemometrics and Intelligent Laboratory Systems 2(1–3): 37–52.
  48. Woods, K.S., Doss, C.C., Bowyer, K.W., Solka, J.L., Priebe, C.E. and Kegelmeyer Jr, W.P. (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, International Journal of Pattern Recognition and Artificial Intelligence 7(06): 1417–1436.
  49. Xie, Y. and Zhang, T. (2018). Imbalanced learning for fault diagnosis problem of rotating machinery based on generative adversarial networks, 2018 37th Chinese Control Conference (CCC), Wuhan, China, pp. 6017–6022.
  50. Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN, Advances in Neural Information Processing Systems 32: 7335–7345.
  51. Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O. and Li, H. (2017). High-resolution image inpainting using multi-scale neural patch synthesis, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6721–6729.
  52. Zhang, M., Wan, X., Gang, L., Lv, X., Wu, Z. and Liu, Z. (2021). An automated driving strategy generating method based on WGAIL–DDPG, International Journal of Applied Mathematics and Computer Science 31(3): 461–470, DOI: 10.34768/amcs-2021-0031.
  53. Zhang, Y., Liu, Y., Wang, Y. and Yang, J. (2023). An ensemble oversampling method for imbalanced classification with prior knowledge via generative adversarial network, Chemometrics and Intelligent Laboratory Systems 235(4): 104775.
  54. Zhao, Y., Li, H., Bissyandé, T.F., Klein, J. and Grundy, J. (2021). On the impact of sample duplication in machine-learning-based android malware detection, ACM Transactions on Software Engineering and Methodology 30(3): 1–38.
  55. Zhao, Z., Kunar, A., Birke, R. and Chen, L.Y. (2021). CTAB-GAN: Effective table data synthesizing, Asian Conference on Machine Learning, pp. 97–112, (virtual).
  56. Zheng, M., Li, T., Zhu, R., Tang, Y., Tang, M., Lin, L. and Ma, Z. (2020a). Conditional wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification, Information Sciences 512(7): 1009–1023.
  57. Zheng, W. and Zhao, H. (2020b). Cost-sensitive hierarchical classification for imbalance classes, Applied Intelligence 50(8): 2328–2338.
  58. Zhu, B., Pan, X., vanden Broucke, S. and Xiao, J. (2022). A GAN-based hybrid sampling method for imbalanced customer classification, Information Sciences 609(28): 1397–1411.
DOI: https://doi.org/10.61822/amcs-2024-0021 | Journal eISSN: 2083-8492 | Journal ISSN: 1641-876X
Language: English
Page range: 291 - 307
Submitted on: Jul 31, 2023
Accepted on: Feb 21, 2024
Published on: Jun 25, 2024
Published by: University of Zielona Góra
In partnership with: Paradigm Publishing Services
Publication frequency: 4 times per year

© 2024 Qian Zhou, Bo Sun, published by University of Zielona Góra
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.