Comparative Study of Machine Learning Models for Textual Medical Notes Classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The expansion of electronic health records (EHRs) has generated a large amount of unstructured textual data, such as clinical notes and medical reports, which contain diagnostic and prognostic information. Effective classification of these textual medical notes is critical for improving clinical decision support and healthcare data management. This study presents a comparative analysis of four traditional machine learning algorithms, Random Forest, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine, for multiclass classification of medical notes into four disease categories: Neoplasms, Digestive System Diseases, Nervous System Diseases, and Cardiovascular Diseases. A dataset containing 9,633 labeled medical notes was preprocessed through text cleaning, lemmatization, stop-word removal, and vectorization using term frequency–inverse document frequency (TF-IDF) representation. Each model was tuned using grid search and cross validation to optimize classification performance. Evaluation metrics, including accuracy, precision, recall, and F1-score, were used to assess model performance. The results indicate that Logistic Regression achieved the highest overall accuracy (0.83), followed closely by Random Forest, Support Vector Machine and Naive Bayes (0.80 each). These findings confirm that traditional machine learning models remain robust, interpretable, and computationally efficient tools for textual medical note classification.

Article activity feed