This study compared methods for handling imbalanced data in predicting customer churn in banking and e-commerce, using datasets with features selected via SHAP and MRMR. Two approaches were evaluated: data-level (Oversampling, Undersampling, and Hybrid resampling) and algorithm-level. Oversampling excelled on small to medium datasets, while Undersampling improved Recall but reduced Precision, lowering overall performance. Ensemble models outperformed single models, with tree-based Decision Trees showing better learning on imbalanced data among single models. The study recommends ensemble models for churn prediction, proposing the SHAP framework to enhance their interpretability through global and local explanations. Two models, ROS-CatBoost and CW-XGBoost, achieved exceptional results, with metrics like Accuracy, Precision, Recall, F1-score, ROC AUC, and PR AUC all above 0.9, indicating strong predictive accuracy for both churn and retention. These findings highlight the effectiveness of ensemble models and interpretability tools in addressing imbalanced data challenges.
© 2025 Luong Thanh Tam, Luong Gia Vi, Nguyen Manh Tuan, published by Bulgarian Academy of Sciences, Institute of Information and Communication Technologies
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.