Have a personal or library account? Click to login
Robust Hybrid Data-Level Approach for Handling Skewed Fat-Tailed Distributed Datasets and Diverse Features in Financial Credit Risk Cover

Robust Hybrid Data-Level Approach for Handling Skewed Fat-Tailed Distributed Datasets and Diverse Features in Financial Credit Risk

Open Access
|Jun 2025

Abstract

Skewed fat-tailed distributed (imbalance or class-imbalance) datasets pose over- whelming aberrations in numerous machine learning (ML) algorithms, particularly in real-life applications, especially in the domain of credit risk modelling, where default cases (minority-classes) are often outnumbered by non-default cases (majority-classes) cases or vice versa. Data-level (DL) approaches have been suggested in the recent literature as remedies for skewed fat-tailed distributed datasets. The popularized DL approach in contemporary studies is the synthetic minority over-sampling technique (SMOTE) and its variants that are capable of mitigating the risk of overfitting and minimizing the generalization errors. However, these approaches can introduce noisy instances that adversely diminish the robustness of the ML algorithms. Also, they are often amenable to the presence of nominal features with mismatching labels that are inherent in real-world datasets. To bridge these gaps, we proposed a hybrid innovation framework that effectively mitigates the aberrations presented by nominal features with mismatching labels and noisy instances simultaneously. The proposed approach is the SMOTE-edited nearest neighbors-encoding nominal and continuous (SMOTEENN-ENC) features. The efficacy of our novelty was evaluated against DL approaches suggested in the literature, orchestrated to handle skewed fat-tailed distributed datasets with inherent diverse features. This approach was coupled with widely employed ensemble algorithms, namely the random forest (RF) and the extreme gradient boost (XGBoost). The results suggested that our novelty, SMOTEENN-ENC, integrated with the XGBoost algorithm demonstrated superiority and stability in the predictive performance when applied to skewed fat-tailed distributed datasets with inherent diverse features.

DOI: https://doi.org/10.2478/fcds-2025-0009 | Journal eISSN: 2300-3405 | Journal ISSN: 0867-6356
Language: English
Page range: 229 - 270
Submitted on: Aug 31, 2024
Accepted on: Feb 13, 2025
Published on: Jun 10, 2025
Published by: Poznan University of Technology
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2025 Keith R Musara, Edmore Ranganai, Charles Chimedza, Florence Matarise, Sheunesu Munyira, published by Poznan University of Technology
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.