Exploring Temperature Effects on Large Language Models Across Various Clinical Tasks

Dhavalkumar Patel
Prem Timsina
Ganesh Raut
Robert Freeman
Matthew A levin
Girish N Nadkarni
Benjamin S Glicksberg
Eyal Klang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) are becoming integral to healthcare analytics. However, the influence of the temperature hyperparameter, which controls output randomness, remains poorly understood in clinical tasks. This study evaluates the effects of different temperature settings across various clinical tasks. We conducted a retrospective cohort study using electronic health records from the Mount Sinai Health System, collecting a random sample of 1283 patients from January to December 2023. Three LLMs (GPT-4, GPT-3.5, and Llama-3-70b) were tested at five temperature settings (0.2, 0.4, 0.6, 0.8, 1.0) for their ability to predict in-hospital mortality (binary classification), length of stay (regression), and the accuracy of medical coding (clinical reasoning). For mortality prediction, all models’ accuracies were generally stable across different temperatures. Llama-3 showed the highest accuracy, around 90%, followed by GPT-4 (80-83%) and GPT-3.5 (74-76%). Regression analysis for predicting the length of stay showed that all models performed consistently across different temperatures. In the medical coding task, performance was also stable across temperatures, with GPT-4 achieving the highest accuracy at 17% for complete code accuracy. Our study demonstrates that LLMs maintain consistent accuracy across different temperature settings for varied clinical tasks, challenging the assumption that lower temperatures are necessary for clinical reasoning.

Version published to 10.1101/2024.07.22.24310824 on medRxiv
Jul 22, 2024

Responsible AI for Sepsis Prediction: Bridging the Gap Between Machine Learning Performance and Clinical Trust

This article has 6 authors:
1. Thiago Q. Oliveira
2. Leandro A. Carvalho
3. Flávio R. C. Sousa
4. João B. F. Filho
5. Khalil F. Oliveira
6. Daniel A. B. Tavares
This article has no evaluationsLatest version Jan 30, 2026
Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models

This article has 2 authors:
1. Hanieh Arjmand
2. Alexandre Tomberg
This article has no evaluationsLatest version Jan 25, 2026
Benchmarking large language models for cardiovascular risk stratification using clinical vignettes

This article has 11 authors:
1. José Ferreira Santos
2. Regina Brito Duarte
3. Inês Mota
4. Rita Carvalheira Santos
5. José Maria Moreira
6. Joana Campos
7. Nuno André Silva
8. Bernardo Neves
9. Ricardo Ladeiras-Lopes
10. Francisca Leite
11. Helder Dores
This article has no evaluationsLatest version Dec 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Responsible AI for Sepsis Prediction: Bridging the Gap Between Machine Learning Performance and Clinical Trust

Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models

Benchmarking large language models for cardiovascular risk stratification using clinical vignettes