Model uncertainty quantification: A post hoc calibration approach for heart disease prediction

Peter Adebayo Odesola
Adewale Alex Adegoke
Idris Babalola

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We investigate whether post-hoc calibration improves the clinical trustworthiness of heart-disease predictions beyond conventional accuracy metrics. Using a structured clinical dataset (1,025 records; 85/15 train-test split), we benchmarked six classifiers logistic regression, SVM, k-nearest neighbors, naïve Bayes, random forest, and XGBoost on accuracy, ROC-AUC, precision, recall, and F1, and then evaluated probability quality before and after Platt (sigmoid) and isotonic calibration using Brier score, expected calibration error (ECE), log loss, Spiegelhalter’s Z-test, and reliability diagrams. Baseline discrimination was high (e.g., SVM: accuracy 92.9%, ROC-AUC 99.4%, F1 92.8%), and ensembles achieved perfect test-set scores (random forest and XGBoost: 100% across metrics), prompting calibration analysis. Isotonic calibration consistently improved probability quality for most models: random forest Brier from 0.007 to 0.002, ECE 0.051 to 0.011, log loss 0.056 to 0.012; naive Bayes Brier from 0.162 to 0.132, ECE 0.145 to 0.118, log loss 1.936 to 0.446; SVM ECE 0.086 to 0.044 and log loss 0.142 to 0.133. Platt scaling helped some models but occasionally worsened calibration (e.g., KNN ECE 0.035 to 0.081). Reliability diagrams corroborated these trends, with isotonic yielding curves closer to the 45° diagonal line, while Spiegelhalter’s test moved toward non-significance for several models’ post-calibration. Ultimately, Isotonic calibration delivered the most consistent gains in probability reliability while preserving discrimination, strengthening the interpretability and clinical actionability of model outputs.

Version published to 10.1101/2025.09.28.25336834 on medRxiv
Sep 30, 2025

Comprehensive, Transparent, and Fair Machine Learning Models for Hypertension Risk Prediction: Benchmarking With Framingham, External Validation, Individual-Level Analysis, and Equitable Clinical Utility

This article has 2 authors:
1. Parsa Amirian
2. Mahsa Zarpoosh
This article has no evaluationsLatest version Sep 5, 2025
Predictive Performance Precision Analysis in Medicine: Identification of low-confidence predictions at patient and profile levels (MED3pa I)

This article has 7 authors:
1. Olivier Lefebvre
2. Félix Camirand Lemyre
3. Jean-François Ethier
4. Lyna Hiba Chikouche
5. Ludmila Amriou
6. Dan Poenaru
7. Martin Vallìeres
This article has no evaluationsLatest version Aug 26, 2025
Internal and External Validation of Machine Learning Algorithms Versus FINDRISC for Incident Type 2 Diabetes: A Transparent, Explainable Benchmark Using SHAP

This article has 2 authors:
1. Parsa Amirian
2. Mahsa Zarpoosh
This article has no evaluationsLatest version Sep 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Comprehensive, Transparent, and Fair Machine Learning Models for Hypertension Risk Prediction: Benchmarking With Framingham, External Validation, Individual-Level Analysis, and Equitable Clinical Utility

Predictive Performance Precision Analysis in Medicine: Identification of low-confidence predictions at patient and profile levels (MED3pa I)

Internal and External Validation of Machine Learning Algorithms Versus FINDRISC for Incident Type 2 Diabetes: A Transparent, Explainable Benchmark Using SHAP