Machine Learning–Based Prediction of Heart Disease Using Logistic Regression, Support Vector Machine, and Random Forest Classifier
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Heart disease is still at the top of the list of causes of deaths around the globe, which shows that there is a great need for early and accurate diagnostic methods that will aid clinical decision-making. A machine learning–based predictive system for heart disease will be developed and evaluated in this project using a real-world Heart Failure Prediction dataset that contains 918 anonymized patient records and 11 clinical attributes. As part of data preprocessing, medically impossible values were identified and treated, invalid cholesterol readings were replaced with the median, non-sensical entries were removed, categorical variables were encoded, and feature standardization was done to ready the dataset for model training. Accordingly, Logistic Regression, Support Vector Machine (SVM) with an RBF kernel, and Random Forest were three supervised learning algorithms implemented to evaluate their performances in binary classification. To guarantee data quality and model trustworthiness, Exploratory Data Analysis (EDA) and cross-validation were done. Model performance evaluation included the use of accuracy, precision, recall, F1-score, confusion matrices, and ROC–AUC metrics. The results indicate that the Random Forest classifier produced the best overall performance with an accuracy of 87.50%, precision of 91.59%, recall of 87.50%, F1-score of 89.50%, and an AUC of 0.9391, thus beating both SVM and Logistic Regression. Though Logistic Regression gave a comprehensible baseline, its greater false-negative rate made it less suitable for high-risk clinical applications. SVM displayed excellent non-linear classification power but needed more computational tuning. Taken together, these results show that Random Forest is the most dependable and robust model for heart disease prediction with this dataset. The next step should be incorporating wider lifestyle factors, using improved data collection methods, sophisticated outlier handling, additional machine learning models, and possibly deployment as a clinical decision-support tool through web or mobile applications.