Cardiology Knowledge Assessment of Retrieval-Augmented Open versus Proprietary Large Language Models

Constantine Tarabanis
Shaan Khurshid
Areti Karamanou
Rodo Piperaki
Lucas A. Mavromatis
Aris Hatzimemos
Dimitrios Tachmatzidis
Constantinos Bakogiannis
Vassilios Vassilikos
Patrick T. Ellinor
Lior Jankelson
Evangelos Kalampokis

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives

To evaluate the performance of open and proprietary LLMs, with and without Retrieval-Augmented Generation (RAG), on cardiology board-style questions and benchmark them against the human average.

Materials and Methods

We tested 14 LLMs (6 open-weight, 8 proprietary) on 449 multiple-choice questions from the American College of Cardiology Self-Assessment Program (ACCSAP). Accuracy was measured as percent correct. RAG was implemented using a knowledge base of 123 guideline and textbook documents.

Results

The open-weight model DeepSeek R1 achieved the highest accuracy at 86.7% (95% CI: 83.7–89.9%), outperforming proprietary models and the human average of 78%. GPT 4o (80.8%, 95% CI: 77.2–84.5%) and the commercial platform OpenEvidence (80.4%, 95% CI: 76.7–84.0%) demonstrated similar performance. A positive correlation between model size and performance was observed within model families, but across families, substantial variability persisted among models with similar parameter counts. After RAG, all models improved, and open-weight models like Mistral Large 2 (78.0%, 95% CI: 74.1–81.8) performed comparably to proprietary alternatives like GPT 4o.

Discussion

Large language models (LLMs) are increasingly integrated into clinical workflows, yet their performance in cardiovascular medicine remains insufficiently evaluated. Open-weight models can match or exceed proprietary systems in cardiovascular knowledge, with RAG particularly beneficial for smaller models. Given their transparency, configurability, and potential for local deployment, open-weight models, strategically augmented, represent viable, lower-cost alternatives for clinical applications.

Conclusion

Open-weight LLMs demonstrate competency in cardiovascular medicine comparable to or exceeding that of proprietary models, with and without RAG depending on the model.

Author Summary

In this work, we set out to understand how today’s artificial intelligence systems perform when tested on the kind of questions cardiologists face during board examinations. We compared a wide range of large language models, including both freely available “open” models and commercial “proprietary” ones, and also tested whether giving the models access to trusted cardiology textbooks and guidelines could improve their answers. We found that the best open model actually outperformed all of the commercial models we tested, even exceeding the average score of practicing cardiologists. When we gave the models access to medical reference material, nearly all of them improved, with the biggest gains seen in the smaller and weaker models. This shows that careful design and support can allow smaller, more accessible systems to reach high levels of accuracy. Our results suggest that open models, which can be used locally without sending sensitive patient information to outside servers, may be a safe and cost-effective alternative to commercial products. This matters because it could make powerful AI tools more widely available across hospitals and clinics, while also reducing risks related to privacy, transparency, and cost.

Version published to 10.1101/2025.09.11.25335607 on medRxiv
Sep 12, 2025

RAGCare-QA: A Benchmark Dataset for Evaluating Retrieval-Augmented Generation Pipelines in Theoretical Medical Knowledge

This article has 8 authors:
1. Jovana Dobreva
2. Ivana Karasmanakis
3. Filip Ivanisevic
4. Tadej Horvat
5. Dimitar Kitanovski
6. Matjaz Gams
7. Kostadin Mishev
8. Monika Simjanoska Misheva
This article has no evaluationsLatest version Aug 16, 2025
Retrieval-Augmented Generation (RAG) in Healthcare: A Comprehensive Review

This article has 3 authors:
1. Fnu Neha
2. Deepshikha Bhati
3. Deepak Kumar Shukla
This article has no evaluationsLatest version Sep 11, 2025
Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)

This article has 5 authors:
1. Kensuke Umehara
2. Junko Ota
3. Tatsuya Nishii
4. Riwa Kishimoto
5. Takayuki Ishida
This article has no evaluationsLatest version Aug 24, 2025

Listed in

Abstract

Objectives

Materials and Methods

Results

Discussion

Conclusion

Author Summary

Article activity feed

Related articles

RAGCare-QA: A Benchmark Dataset for Evaluating Retrieval-Augmented Generation Pipelines in Theoretical Medical Knowledge

Retrieval-Augmented Generation (RAG) in Healthcare: A Comprehensive Review

Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)