The Impact of the SMOTE Method on Machine Learning and Ensemble Learning Performance Results in Addressing Class Imbalance in Data Used for Predicting Total Testosterone Deficiency in Type 2 Diabetes Patients

Mehmet Kivrak
Ugur Avci
Hakki Uzun
Cuneyt Ardic

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background and Objective: Diabetes Mellitus is a long-term, multifaceted metabolic condition that necessitates ongoing medical management. Hypogonadism is a syndrome that is a clinical and/or biochemical indicator of testosterone deficiency. Cross-sectional studies have reported that 20–80.4% of all men with Type 2 diabetes have hypogonadism, and Type 2 diabetes is related to low testosterone. This study presents an analysis of the use of ML and EL classifiers in predicting testosterone deficiency. In our study, we compared optimized traditional ML classifiers and three EL classifiers using grid search and stratified k-fold cross-validation. We used the SMOTE method for the class imbalance problem. Methods: This database contains 3397 patients for the assessment of testosterone deficiency. Among these patients, 1886 patients with Type 2 diabetes were included in the study. In the data preprocessing stage, firstly, outlier/excessive observation analyses were performed with LOF and missing value analyses were performed with random forest. The SMOTE is a method for generating synthetic samples of the minority class. Four basic classifiers, namely MLP, RF, ELM and LR, were used as first-level classifiers. Tree ensemble classifiers, namely ADA, XGBoost and SGB, were used as second-level classifiers. Results: After the SMOTE, while the diagnostic accuracy decreased in all base classifiers except ELM, sensitivity values increased in all classifiers. Similarly, while the specificity values decreased in all classifiers, F1 score increased. The RF classifier gave more successful results on the base-training dataset. The most successful ensemble classifier in the training dataset was the ADA classifier in the original data and in the SMOTE data. In terms of the testing data, XGBoost is the most suitable model for your intended use in evaluating model performance. XGBoost, which exhibits a balanced performance especially when the SMOTE is used, can be preferred to correct class imbalance. Conclusions: The SMOTE is used to correct the class imbalance in the original data. However, as seen in this study, when the SMOTE was applied, the diagnostic accuracy decreased in some models but the sensitivity increased significantly. This shows the positive effects of the SMOTE in terms of better predicting the minority class.

Version published to 10.3390/diagnostics14232634
Nov 22, 2024
Version published to 10.20944/preprints202410.1324.v1
Oct 16, 2024

Exploration and Analysis of Risk Factors for Coronary Artery Disease with Type 2 Diabetes Based on SHAP Explainable Machine Learning Algorithm

This article has 7 authors:
1. Dandan Tang
2. Fengwei Liang
3. Xingli Gu
4. Yuanyuan Jin
5. Xuanjie Hu
6. Fen Liu
7. Yining Yang
This article has no evaluationsLatest version May 21, 2025
Machine Learning and Deep Learning Approaches for Predicting Diabetes Progression: A Comparative Analysis

This article has 3 authors:
1. Oluwafisayo Babatope Ayoade
2. Seyed Shahrestani
3. Chun Ruan
This article has no evaluationsLatest version Jun 26, 2025
Prostate Cancer Prediction Model Based on Machine Learning

This article has 8 authors:
1. Long Zhang
2. JianMei Zhang
3. Ru Huang
4. YongXue Du
5. QiHang Duan
6. XingYu Chen
7. SuPing Wang
8. Ying Zhang
This article has no evaluationsLatest version May 7, 2025

Listed in

Abstract

Article activity feed

Related articles

Exploration and Analysis of Risk Factors for Coronary Artery Disease with Type 2 Diabetes Based on SHAP Explainable Machine Learning Algorithm

Machine Learning and Deep Learning Approaches for Predicting Diabetes Progression: A Comparative Analysis

Prostate Cancer Prediction Model Based on Machine Learning