Calibrated, explainable machine learning on routine laboratory data for multiclass differential diagnosis of rheumatic diseases: a retrospective study of 12,085 patients

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Overlapping clinical features and routine biomarker profiles hinder timely differentiation of common rheumatic diseases, especially seronegative spondyloarthritis. We assessed whether machine learning (ML) models trained in routinely collected laboratories can deliver accurate, calibrated, and explainable multiclass diagnoses. Methods: In a retrospective analysis of a fully de-identified public dataset (n=12,085), adults (≥18 years) with confirmed diagnoses and ≤30% biomarker missingness were included. Nineteen predictors (demographics, ESR/CRP, and serology) and four engineered features were used. Missingness (~14.5%) was imputed using MICE and continuous variables were standardized. Stratified sampling created 80/20 train–test splits. We trained Random Forest, LightGBM, XGBoost, CatBoost, and TabNet with pre-specified regularization. The performance was evaluated on an independent test set with 5-fold internal cross-validation, pairwise McNemar testing, and calibration (Brier score, ECE). SHAP provides explainability. A predefined seronegative subgroup (RF and anti-CCP antibodies) was examined. Results: All models performed well, and XGBoost achieved the highest accuracy (85.48%). Random Forest (83.78%) was selected for detailed interpretation owing to its accuracy–calibration balance. The macro-averaged AUCs value exceeded 0.92. Calibration analyses showed a close agreement between the predicted and observed probabilities. SHAP ranked ESR/CRP, RF/anti-CCP, HLA-B27, and complement C3/C4 as the dominant contributors, which is consistent with the disease biology. The per-class results mirrored known challenges: SLE showed excellent detection (precision, 100.0%; recall, 97.9%), whereas ankylosing spondylitis (AS) had the lowest recall (57.6%). Across 2,417 test cases, 381 (15.76%) were misclassified; the most frequent error was AS → RA (109 cases; 28.6% of errors). In seronegative patients (n=390), HLA-B27 prevalence was higher (+6.5%; p=0.024) and that of anti-La was lower (–11.6%; p=0.001); SHAP identified HLA-B27 as the leading marker when RF/anti-CCP was absent. Conclusions: Routine laboratory data can be transformed into calibrated, explainable probabilities to aid multiclass differential diagnoses in rheumatology. The remaining gaps, particularly for seronegative spondyloarthritis, motivate external/temporal validation, integration off clinical and imaging features, prospective utility evaluation, and prespecified fairness assessments.

Article activity feed