Comparative Study of Machine Learning Models for Textual Medical Note Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The expansion of electronic health records (EHRs) has generated a large amount of unstructured textual data, such as clinical notes and medical reports, which contain diagnostic and prognostic information. Effective classification of these textual medical notes is critical for improving clinical decision support and healthcare data management. This study presents a statistically rigorous comparative analysis of four traditional machine learning algorithms—Random Forest, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine—for multiclass classification of medical notes into four disease categories: Neoplasms, Digestive System Diseases, Nervous System Diseases, and Cardiovascular Diseases. A dataset containing 9633 labeled medical notes was preprocessed through text cleaning, lemmatization, stop-word removal, and vectorization using term frequency-inverse document frequency (TF–IDF) representation. The models were trained and optimized through GridSearchCV with 5-fold cross-validation and evaluated across five independent stratified 90-10 train–test splits. Evaluation metrics, including accuracy, precision, recall, F1-score, and multiclass ROC-AUC, were used to assess model performance. Logistic Regression demonstrated the strongest overall performance, achieving an average accuracy of 0.8469 and high macro and weighted F1 scores, followed by Support Vector Machine and Multinomial Naive Bayes. Misclassification patterns revealed substantial lexical overlap between digestive and neurological disease notes, underscoring the limitations of TF–IDF representations in capturing deeper semantic distinctions. These findings confirm that traditional machine learning models remain robust, interpretable, and computationally efficient tools for textual medical note classification, and the study establishes a transparent and reproducible benchmark that provides a solid foundation for future methodological advancements in clinical natural language processing.