Temperature-Driven Variability in Emergency Diagnostic Accuracy by a Leading Language Model
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives
Large language models (LLMs) are being deployed in healthcare to improve clinical decision support. Yet, the “temperature” parameter, which controls LLM randomness, may significantly influence diagnostic performance and is often left at the default value. Understanding this core parameter’s impact on automated diagnostic tasks will be crucial to the safe application of LLMs in high-risk settings. We evaluated the impact of temperature on diagnostic accuracy and breadth through large-scale iterative sampling when assessing high-risk conditions.
Methods
We conducted a simulation-based diagnostic accuracy experiment using four emergent medical cases from a widely adopted, login-protected emergency medicine (EM) curriculum. Each case was presented to GPT-4o under two conditions (with and without physical exam). The cases (myxedema coma, ascending cholangitis, carbon monoxide poisoning, cryptococcal meningitis) were selected for their discrete, unambiguous diagnoses. Each case was presented to GPT-4o 250 times at five distinct temperature settings (0.0 - 1.0) for each data state (with/without physical exam), resulting in 10,000 LLM diagnostic outputs. Diagnostic accuracy and breadth were reported across all cases and temperatures.
Results
With inclusion of physical exam, GPT-4o achieved perfect diagnostic accuracy at temperature 0.0 for all cases. As temperature increased, overall diagnostic accuracy decreased from 100% to 89.4% (95% Confidence Interval [CI]: 87.3-91.2%). For ascending cholangitis, diagnostic accuracy fell to 70.4% at max temperature (95% CI: 64.5-75.7%). Across all cases, unique diagnoses rose from 18 (temp 0.0) to 105 (temp 1.0), representing a 583% increase in breadth as temperature increased. Physical exam was critical to diagnostic performance in some cases while non-contributory to others.
Conclusions
Temperature impacts LLM diagnostic accuracy and breadth. Lower temperatures yielded accurate diagnoses, while higher temperatures increased diagnostic breadth at the cost of occasional misdiagnosis. Appropriate tuning of temperature is critical for reliable application of LLMs; these findings are relevant for LLM-based diagnostic decision support tools, emphasizing the need for transparency in temperature reporting.