Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study evaluated confidence calibration across 48 large language models (LLM) using 300 gastroenterology board exam style questions. Regardless of response accuracy, all models demonstrated poor certainty estimation. Even the best-calibrated systems (o1 preview, GPT-4o, Claude-3.5-Sonnet) showed substantial overconfidence (Brier scores 0.15-0.2, AUROC ~0.6). Most concerning, models maintained high certainty regardless of question difficulty or their actual knowledge limitations. This metacognitive deficiency poses significant challenges for safe clinical implementation of current LLMs in gastroenterology.

Article activity feed