Unmasking True Clinical Competence: The Importance of Adaptive and Open-Ended Evaluation for Large Language Models in Cardiology

Chieh-Ju Chao
Abhinav Kumar
Aakash Mishra
Yu-Chiang Wang
Chieh-Mei Tsai
Nima Baba Ali
Somanshu Sharma
Juan M. Farina
Reza Arsanjani
Yucheng Jiang
Monica Lam

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) achieve impressive accuracy on multiple-choice (MC) medical examinations, but this performance alone may not accurately reflect their clinical reasoning abilities. MC evaluations of LLMs risk inflating apparent competence by rewarding recall rather than genuine clinical adaptability, particularly in specialized medical domains such as cardiology.

Aim

To rigorously evaluate the clinical reasoning, contextual adaptability, and answer-reasoning consistency of a state-of-the-art reasoning-based LLM in cardiology using MC, open-ended, and clinically modified question formats.

Methods

We assessed GPT o1-preview using 185 board-style cardiology questions from the American College of Cardiology Self-Assessment Program (ACCSAP) in MC and open-ended formats. A subset of 66 questions underwent modifications of critical clinical parameters (e.g., ascending aorta diameter, ejection fraction) to evaluate model adaptability to context changes. The model’s answer and reasoning correctness were graded by cardiology experts. Statistical differences were analyzed using exact McNemar tests.

Results

GPT o1-preview demonstrated high baseline accuracy on MC questions (93.0% answers, 92.4% reasoning). Performance significantly decreased with open-ended questions (80.0% answers, 80.5% reasoning; p<0.001). For modified MC questions, accuracy decreased significantly (answers: 93.9% to 66.7%; reasoning: 93.9% to 71.2%; both p<0.001), as well as answer-reasoning concordance (93.9% to 66.7%, p<0.001).

Conclusions

Using existing MC question formats substantially overestimates the performance and clinical reasoning capabilities of GPT o1-preview. Incorporating open-ended, clinically adaptive questions and evaluating answer-reasoning concordance are essential for accurately assessing the real-world clinical decision-making competencies of LLMs in cardiology.

Version published to 10.1101/2025.05.17.25327814v2 on medRxiv
May 22, 2025
Version published to 10.1101/2025.05.17.25327814v1 on medRxiv
May 19, 2025

The impact of large language models on diagnostic reasoning among LLM-trained physicians: a randomized clinical trial

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Jun 6, 2025
Reasoning Over Pre-training: Evaluating LLM Performance and Augmentation in Women’s Health

This article has 13 authors:
1. Martha Imprialou
2. Nikos Kaltsas
3. Viktoriia Oliinyk
4. Tom Vigrass
5. Joel Schwarzmann
6. Rachel Rosenthal
7. Craig Glastonbury
8. Chris Wigley
9. Matt Gillam
10. Nikita Kanani
11. Pras Supramaniam
12. Ingrid Granne
13. Cecilia M. Lindgren
This article has no evaluationsLatest version May 23, 2025
Large Language Models for the assessment of medical students’ clinical decision-making

This article has 5 authors:
1. Sina Chole Benker
2. Jonathan Vollprecht
3. Cihan Papan
4. Max Hao Lu
5. Dogus Darici
This article has no evaluationsLatest version Jun 17, 2025

Listed in

Abstract

Background

Aim

Methods

Results

Conclusions

Article activity feed

Related articles

The impact of large language models on diagnostic reasoning among LLM-trained physicians: a randomized clinical trial

Reasoning Over Pre-training: Evaluating LLM Performance and Augmentation in Women’s Health

Large Language Models for the assessment of medical students’ clinical decision-making