An enhanced explainable thyroid disease diagnosis by leveraging cluster-smote and machine learning models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Thyroid disorders represent a major public health concern worldwide, affecting metabolic regulation and increasing the risk of cardiovascular and systemic complications when not detected early. Existing machine learning (ML) approaches for thyroid disease prediction are often limited by severe class imbalance, suboptimal calibration, and a lack of model interpretability. This study integrates Cluster-based Synthetic Minority Oversampling Technique (Cluster-SMOTE) to preserve minority class structure, alongside multiple machine learning models. The Random Forest classifier emerged as the best-performing model based on the F1-score criterion. Model reliability was further assessed using calibration analysis, Brier score evaluation, and Decision Curve Analysis (DCA). SHapley Additive exPlanations (SHAP) were employed to provide both global and local explanations of model predictions. Experimental evaluation on a publicly available thyroid disease dataset demonstrated that the proposed Random Forest–based framework achieved an F1-score of 0.99, accuracy of 0.99, precision of 0.99, recall of 0.99, AUC of 0.99, and a Brier score of 0.003. DCA further confirmed that the proposed model yields higher net clinical benefit across a wide range of threshold probabilities. These findings demonstrate that combining Cluster-SMOTE, a robust Random Forest classifier, and XAI validation produces an accurate, well-calibrated, and clinically interpretable thyroid disease prediction framework.

Article activity feed