Extended Hybrid Resampling Architecture for Addressing Imbalanced Datasets in Multi-Label Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Class imbalance is a common problem in multi-label classification (MLC). This problem can reduce the predictive accuracy of classifiers. To address this issue, recent studies have proposed hybrid resampling approaches that combine data-level balancing techniques in MLC. The goal of this research is to improve the performance of multi-label classifiers on imbalanced datasets by developing and testing extended hybrid resampling architecture based on REMEDIAL-Hybrid-with-Resampling (R-HwR), R-HwR-ROS and R-HwR-SMT. Hybrid resampling architecture was proposed by extending R-HwR-ROS and R-HwR-SMT with resampling strategies such as Multi-Label edited Nearest Neighbor (MLeNN), Multi-Label Tomek Link (MLTL) and Multi-Label Random Under Sampling (MLRUS) using five multi-label classifiers: Binary Relevance (BR), Classifier Chain (CC), Calibrated Label Ranking (CLR), Label Powerset (LP), and Multi-Label k-Nearest Neighbor (ML-kNN). The classifier performances were evaluated using Micro/Macro-F1, Hamming Loss, and statistical tests such as the Wilcoxon signed-rank and Friedman tests to identify significant improvements and optimal setups across several benchmark datasets. The hybrid of Base + MLTL significantly improved R-HwR-ROS and R-HwR-SMT, whereas Base + MLeNN significantly enhanced R-HwR-ROS (p < 0.05). Specifically, CC has emerged as the most reliable classifier. In R-HwR-ROS, MLeNN outperformed other combinations with the BR, CC, and CLR classifiers, whereas MLTL outperformed the other combinations with the LP and ML-kNN classifiers. In R-HwR-SMT, MLTL outperformed the other combinations for all classifiers. Hybrid resampling algorithms, including MLeNN and MLTL, greatly boost classifier robustness and balance across varied datasets.