Comparing Algorithm Effectiveness in Health Data Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Stroke remains one of the leading causes of death and long-term disability worldwide, highlighting the critical need for early detection and prevention. Recent advancements in machine learning (ML) offer promising solutions for identifying individuals at high risk based on clinical and demographic factors. This study presents a comparative analysis of six supervised ML algorithms — Logistic Regression (LR), Random Forest (RF), XGBoost (XGB), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), and Support Vector Classifier (SVC) — for predicting stroke occurrence using the Healthcare Stroke Dataset from Kaggle. Comprehensive preprocessing steps were applied, including handling missing values, encoding categorical variables, normalization, and addressing class imbalance through the Synthetic Minority Oversampling Technique (SMOTE). Each model was evaluated using five-fold cross-validation based on Accuracy, Precision, Recall, and F1-score. The results show that RF achieved the highest overall accuracy (0.928), while SVC achieved the highest recall (0.790), indicating its superior sensitivity in detecting true stroke cases. The findings demonstrate that integrating data balancing with multi-metric evaluation significantly enhances predictive performance and clinical reliability in stroke prediction systems

Article activity feed