Robust Distance-Based SMOTE Approaches for Skewed Fat-Tailed Distributed Datasets with Heterogeneous Features

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Skewed fat-tailed distributed (class-imbalanced) datasets with inherent heterogeneous features often induce overwhelming aberrations in numerous machine learning (ML) algorithm applications, particularly in credit risk assessments. To mitigate these complexities, data-level (DL) approaches have been suggested in the literature as remedies. Notably, the benchmarks of DL approaches in the literature consist of variants of the synthetic minority oversampling technique (SMOTE) customized to effectively handle various data aberrations using distance measures. To counter the limitations of the Euclidean distance (ED) in capturing heterogeneous features, their computational efficiency is enhanced through the utilization of the modified ED (MED). However, the MED substantially fails to account for correlations between features. To circumvent these shortcomings, some studies have suggested using the Mahalanobis distance (MD), which accurately accounts for correlations in features. Nevertheless, this measure considerably falls short due to its inability to capture heterogeneous features, as well as its susceptibility to outliers. Therefore, this study proposes a modified Mahalanobis distance (MMD) designed to adequately capture heterogeneous features. Additionally, the MMD parameters were estimated using the median absolute deviation (MAD), which efficiently discards outliers. The study evaluated the computational efficiency of SMOTE-based approaches combined with the MMD, computed intrinsically to MAD approach, using the most widely employed ensemble-based ML approach, the random forest (RF) algorithm. The findings indicate that our novelty significantly outper-formed conventional approaches. This investigation enriches the understanding of generalizable predictive performance in the credit risk landscape.

Article activity feed