Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks

Nariman Naderi
Seyed Amir Ahmad Safavi-Naini
Thomas Savage
Mohammad Amin Khalafi
Peter Lewis
Zahra Atf
Girish Nadkarni
Ali Soroush

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study evaluated confidence calibration across 48 large language models (LLM) using 300 gastroenterology board exam style questions. Regardless of response accuracy, all models demonstrated poor certainty estimation. Even the best-calibrated systems (o1 preview, GPT-4o, Claude-3.5-Sonnet) showed substantial overconfidence (Brier scores 0.15-0.2, AUROC ~0.6). Most concerning, models maintained high certainty regardless of question difficulty or their actual knowledge limitations. This metacognitive deficiency poses significant challenges for safe clinical implementation of current LLMs in gastroenterology.

Version published to 10.21203/rs.3.rs-6725427/v1 on Research Square
Jun 4, 2025

Benchmarking Large Language Models for Replication of Guideline-Based PGx Recommendations

This article has 7 authors:
1. Mike Zack
2. Ioan Skobodchikov
3. Danil Stupichev
4. Alex Moore
5. David Sokolov
6. Igor Trifonov
7. Allan Gobbs
This article has no evaluationsLatest version May 15, 2025
Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

This article has 3 authors:
1. Eren Çamur
2. Turay Cesur
3. Yasin Celal Güneş
This article has no evaluationsLatest version May 7, 2025
Using OpenAI Models for Abstract Screening

This article has 4 authors:
1. Andrew Taylor
2. Josephine Usow
3. Eli Miller
4. Dilay Kalinoglu
This article has no evaluationsLatest version Jun 20, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Large Language Models for Replication of Guideline-Based PGx Recommendations

Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

Using OpenAI Models for Abstract Screening