Impact of Dataset Imbalance on Machine Learning Models for Diabetes Mellitus Prediction

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Diabetes Mellitus is a chronic medical condition that requires early detection and management to prevent severe complications. Machine learning (ML) models have shown promise in predicting diabetes, yet the effectiveness of these models can be significantly hindered by dataset imbalance. In medical datasets, particularly for diabetes prediction, imbalances often occur when the instances of diabetic cases (minority class) are vastly underrepresented compared to non-diabetic cases (majority class). This imbalance can lead to skewed model performance, where the model is more likely to predict the majority class, resulting in high accuracy but poor sensitivity to detecting actual diabetic cases. This paper explores the impact of dataset imbalance on machine learning models used for diabetes prediction, highlighting challenges such as model overfitting, misclassification, and unreliable performance metrics. It also reviews various strategies for addressing dataset imbalance, including data-level methods such as oversampling and undersampling, algorithm-level approaches like cost-sensitive learning, and hybrid solutions. Case studies from recent research are presented to demonstrate the consequences of imbalance and the improvements achieved by implementing balancing techniques. The findings emphasize the critical need for more representative datasets and the adoption of advanced techniques to enhance the predictive accuracy and reliability of ML models in healthcare applications. This study provides insights into future directions for diabetes prediction systems, focusing on improving data quality, model robustness, and ultimately, patient outcomes.

Article activity feed