Diagnostic Performance of Expert Physicians Versus General-Purpose Artificial Intelligence Using Standardized Static Coronary CT Images: A Dual-Reference Validation

Sefa Okar
ZİYA GÖKALP BİLGEL
İSA GÖKTÜRK BALCI
GÜRCAN ERBAY
MUSTAFA YILMAZ

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Coronary CT angiography (CCTA) is a first-line diagnostic modality for coronary artery disease (CAD), yet its interpretation requires significant expert experience. Although general-purpose multimodal artificial intelligence (GP-AI) models have shown promise in text-based medical tasks, their visual diagnostic performance in evaluating complex CCTA data remains poorly defined. Methods This single-center retrospective study included 63 patients (252 vessel-based image sets) who underwent both CCTA and invasive coronary angiography. Expert physician consensus and four frontier GP-AI models (GPT-4o, Gemini 2.5, Claude 3.5 Sonnet, and Grok 4) evaluated identical standardized static images using a zero-shot approach with default generation parameters. Obstructive disease was defined as ≥ 50% luminal stenosis. Diagnostic performance was validated against expert consensus for plaque characterization and quantitative coronary angiography (QCA) for stenosis severity. Results Expert consensus demonstrated robust agreement with QCA across all coronary territories (kappa = 0.774–0.933, p < 0.001). In contrast, a marked performance disparity was observed for the GP-AI models; none achieved statistically significant agreement with QCA in the prognostically critical left anterior descending (LAD) or left main coronary arteries (LMCA) (p > 0.05). While Gemini 2.5 showed a moderate correlation in the right coronary artery (ICC = 0.515), overall continuous stenosis assessment and plaque characterization remained uniformly limited and clinically unreliable across all models. Conclusion Expert physician interpretation remains the reference standard for CCTA. Current frontier GP-AI models are not suitable for independent clinical interpretation of coronary imaging, particularly in anatomically complex segments. These findings emphasize that general visual reasoning cannot yet replace domain-specific cardiovascular AI solutions or expert clinical judgment in specialized radiological tasks.

Version published to 10.21203/rs.3.rs-8997340/v1 on Research Square
Mar 10, 2026

Diagnostic Performance of Expert Physicians Versus General-Purpose Artificial Intelligence Using Standardized Static Coronary CT Images: A Dual-Reference Validation Study

This article has 5 authors:
1. Sefa Okar
2. Ziya Gökalp BİLGEL
3. İsa Göktürk BALCI
4. Gürcan ERBAY
5. Mustafa YILMAZ
This article has no evaluationsLatest version Apr 5, 2026
Sensitivity and Predictors of False-Negative SPECT Myocardial Perfusion Imaging in a High-Burden Coronary Artery Disease Population: A Retrospective Analysis Using Revascularization as the Reference Standard

This article has 6 authors:
1. Lauren Wright
2. Jay Hamze
3. Tarun R. Nagrani
4. Nikki Arnold
5. Raquel McGlone
6. Jeremiah Martin
This article has no evaluationsLatest version Apr 15, 2026
Detecting Errors in Coronary Computed Tomography Angiography Reports: Comparison Among Three Large Language Models and Human Reader

This article has 7 authors:
1. Jing Chen
2. Yihao Wang
3. Linlin Sun
4. Li Zhu
5. Huiyuan Zhu
6. Xingxing Cen
7. Hong Yu
This article has no evaluationsLatest version Mar 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Diagnostic Performance of Expert Physicians Versus General-Purpose Artificial Intelligence Using Standardized Static Coronary CT Images: A Dual-Reference Validation Study

Sensitivity and Predictors of False-Negative SPECT Myocardial Perfusion Imaging in a High-Burden Coronary Artery Disease Population: A Retrospective Analysis Using Revascularization as the Reference Standard

Detecting Errors in Coronary Computed Tomography Angiography Reports: Comparison Among Three Large Language Models and Human Reader