Temperature-Driven Variability in Emergency Diagnostic Accuracy by a Leading Language Model

Philip C. Jarrett
Jared Hill
Marshall Howell
Kristen Grabow Moore
Joby J. Thoppil
Laura Vargas Ortiz
Samuel T. Parnell
D. Mark Courtney
Samuel A. McDonald
Deborah B. Diercks
Andrew R. Jamieson
Dazhe Cao

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

To determine the impact of the temperature parameter on GPT-4o’s diagnostic accuracy when evaluating emergency medicine cases and assess the effect on diagnostic divergence across iterations.

Methods

We conducted a simulation-based diagnostic accuracy study using four challenging emergency medicine cases adapted from the Foundations of Emergency Medicine curriculum. Each case was submitted to GPT-4o 250 times at five temperature settings (0.0, 0.25, 0.50, 0.75, 1.0), both with and without physical examination findings, yielding 10,000 total outputs. Each output contained exactly three differential diagnoses with one leading diagnosis. Diagnostic accuracy was assessed by comparing outputs against predetermined gold-standard diagnoses.

Results

At temperature 0.0, GPT-4o achieved 100% leading diagnosis accuracy across all cases with physical exam data. As temperature increased, accuracy declined systematically to 89.4% at temperature 1.0. Diagnostic divergence increased dramatically from an average of 4.5 unique diagnoses at temperature 0.0 to 26.25 at temperature 1.0 (583% increase). Case sensitivity varied significantly, with ascending cholangitis showing the greatest temperature sensitivity (accuracy dropping from 100% to 70.4%) while carbon monoxide poisoning maintained 100% accuracy across all settings.

Discussion

Higher temperatures introduced concerning diagnostic inconsistency rather than beneficial exploration, with substantial accuracy degradation in temperature-sensitive cases.

Conclusions

Lower temperature settings promote diagnostic accuracy and consistency, making them preferable for clinical applications requiring high reliability. Transparent reporting of temperature settings is essential for reproducible clinical artificial intelligence research.

KEY MESSAGES

What is already known on this topic

Large language models demonstrate promising diagnostic capabilities in medical reasoning tasks, but their non-deterministic nature and sensitivity to parameter settings remain poorly understood in clinical contexts.

What this study adds

The temperature parameter significantly affects both diagnostic accuracy and consistency, with higher settings causing dramatic increases in diagnostic divergence.

How this study might affect research, practice or policy

These findings mandate transparent reporting of temperature settings in clinical AI research and suggest that low-temperature configurations should be prioritized for high-reliability medical applications.

Version published to 10.1101/2025.06.04.25328288 on medRxiv
Jun 6, 2025

AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Jun 6, 2025
Implementation of Large Language Models in Electronic Health Records

This article has 3 authors:
1. Maxime Griot
2. Jean Vanderdonckt
3. Demet Yuksel
This article has no evaluationsLatest version Jul 4, 2025
DeepSeek as the paradigm shift in rare disease diagnosis – the power of a fully automated genetic variant classification system

This article has 9 authors:
1. Wei Ma
2. Grace Fong
3. Joe Lai
4. Heidi Wu
5. Shirley Pik Ying Hue
6. Jonson Ying
7. The Hong Kong Genome Project
8. Annie Tsz Wai Chu
9. Brian Hon Yin Chung
This article has no evaluationsLatest version Jun 4, 2025

Listed in

Abstract

Objective

Methods

Results

Discussion

Conclusions

KEY MESSAGES

What is already known on this topic

What this study adds

How this study might affect research, practice or policy

Article activity feed

Related articles

AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

Implementation of Large Language Models in Electronic Health Records

DeepSeek as the paradigm shift in rare disease diagnosis – the power of a fully automated genetic variant classification system