Self-organizing maps as a way to evaluate optimal strategies for balancing binary class distributions: a methodological approach
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Since machine learning algorithms rely on data, the way datasets are collected significantly impacts their performance. Data must be carefully gathered to minimize missing values or class imbalance. However, the inherent nature of the data tends can sometimes lead to such imbalances. An unbalanced dataset can lead to biased models, where predictions are influenced by the majority class. To avoid this problem, balancing strategies can be applied to equalize the instances of each class. This paper introduces a methodological approach to evaluate which balancing strategies yield the best results depending on the dataset. We leverage self-organizing maps, an unsupervised neural network model, to identify which strategy generates the most suitable balanced synthetic data. By considering the topological structure of the data, we propose a metric that uses the trained map to measure changes between the original dataset and the transformed dataset after applying different strategies. This metric is based on the idea that synthetic data resembling the original dataset more closely is preferable.