Not the Models You Are Looking For: Traditional ML Outperforms LLMs in Clinical Prediction Tasks

Katherine E. Brown
Chao Yan
Zhuohang Li
Xinmeng Zhang
Benjamin X. Collins
You Chen
Ellen Wright Clayton
Murat Kantarcioglu
Yevgeniy Vorobeychik
Bradley A. Malin

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives

To determine the extent to which current Large Language Models (LLMs) can serve as substitutes for traditional machine learning (ML) as clinical predictors using data from electronic health records (EHRs), we investigated various factors that can impact their adoption, including overall performance, calibration, fairness, and resilience to privacy protections that reduce data fidelity.

Materials and Methods

We evaluated GPT-3.5, GPT-4, and ML (as gradient-boosting trees) on clinical prediction tasks in EHR data from Vanderbilt University Medical Center and MIMIC IV. We measured predictive performance with AUROC and model calibration using Brier Score. To evaluate the impact of data privacy protections, we assessed AUROC when demographic variables are generalized. We evaluated algorithmic fairness using equalized odds and statistical parity across race, sex, and age of patients. We also considered the impact of using in-context learning by incorporating labeled examples within the prompt.

Results

Traditional ML (AUROC: 0.847, 0.894 (VUMC, MIMIC)) substantially outperformed GPT-3.5 (AUROC: 0.537, 0.517) and GPT-4 (AUROC: 0.629, 0.602) (with and without in-context learning) in predictive performance and output probability calibration (Brier Score (ML vs GPT-3.5 vs GPT-4): 0.134 versus 0.384 versus 0.251, 0.042 versus 0.06 versus 0.219). Traditional ML is more robust than GPT-3.5 and GPT-4 to generalizing demographic information to protect privacy. GPT-4 is the fairest model according to our selected metrics but at the cost of poor model performance.

Conclusion

These findings suggest that LLMs are much less effective and robust than locally-trained ML for clinical prediction tasks, but they are getting better over time.

Version published to 10.1101/2024.12.03.24318400v1 on medRxiv
Dec 5, 2024

Fair machine learning models for disease prediction: In-depth interviews with key health experts

This article has 2 authors:
1. Nhung Nghiem
2. Ramona Tiatia
This article has no evaluationsLatest version Feb 6, 2025
Leveraging Temporal Learning with Dynamic Range (TLDR) for Enhanced Prediction of Outcomes in Recurrent Exposure and Treatment Settings in Electronic Health Records

This article has 7 authors:
1. Jingya Cheng
2. Jonas Hügel
3. Jiazi Tian
4. Alaleh Azhir
5. Shawn N. Murphy
6. Jeffrey G. Klann
7. Hossein Estiri
This article has no evaluationsLatest version Mar 20, 2025
Do Language Models Think Like Doctors?

This article has 15 authors:
1. Liam G. McCoy
2. Rajiv Swamy
3. Nidhish Sagar
4. Minjia Wang
5. James Cao
6. Stephen Bacchi
7. Nigel Fong
8. Nigel CK Tan
9. Kevin Tan
10. Thomas A. Buckley
11. Peter Brodeur
12. Leo Anthony Celi
13. Arjun Manrai
14. Aloysius Humbert
15. Adam Rodman
This article has no evaluationsLatest version Feb 12, 2025

Listed in

Abstract

Objectives

Materials and Methods

Results

Conclusion

Article activity feed

Related articles

Fair machine learning models for disease prediction: In-depth interviews with key health experts

Leveraging Temporal Learning with Dynamic Range (TLDR) for Enhanced Prediction of Outcomes in Recurrent Exposure and Treatment Settings in Electronic Health Records

Do Language Models Think Like Doctors?