Novel Insights into the Application of Large Language Models in the Diagnosis and Treatment of Complex Cardiovascular Diseases: A Comparative Study

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background The rapid evolution of large language models (LLMs) in the medical field, particularly in automating medical tasks and supporting diagnosis and treatment, has shown promising potential. However, their accuracy, comprehensiveness, and safety in managing complex cardiovascular diseases have not been systematically assessed. Objective This study aims to evaluate and compare the diagnostic and therapeutic performance of two prominent LLMs, GPT-4.0 and Kimi, in managing complex cardiovascular diseases, and to assess their safety, providing valuable insights for their future clinical application. Methods A total of 200 case reports from the Journal of the American College of Cardiology (JACC), published between January 2020 and August 2024, were analyzed. Standardized extraction forms were used to collect case information. GPT-4.0 and Kimi were both prompted with identical queries to generate diagnostic and treatment plans, covering diagnosis, treatment recommendations, and long-term management strategies. Three independent cardiovascular specialists evaluated the outputs on accuracy and comprehensiveness using a Likert scale, while a risk matrix scoring system was employed for safety assessment. Statistical analyses were conducted using the paired Mann-Whitney U test. Results In terms of preliminary diagnosis, the accuracy rates of GPT-4.0 and Kimi were 96.0% and 93.5%, respectively (P = 0.66), but GPT-4.0 demonstrated superior comprehensiveness (96.5% vs. 91.0%, P < 0.001). For treatment recommendations, GPT-4.0 outperformed Kimi in both accuracy (97.0% vs. 94.0%, P < 0.05) and comprehensiveness (98.0% vs. 91.5%, P < 0.001). Regarding long-term management, GPT-4.0 also exhibited superior performance (95.5% vs. 92.0%, P < 0.001). Safety assessment revealed that 93.5% of GPT-4.0’s recommendations were free of potential harm, compared to 85.5% for Kimi, with high-risk cases accounting for 1.5% and 4.5%, respectively. Conclusions LLMs, particularly GPT-4.0, exhibit significant promise in the diagnosis and treatment of complex cardiovascular diseases, showing superior accuracy, comprehensiveness, and safety compared to Kimi. Despite their high accuracy and safety, LLMs still require clinician oversight, especially in the formulation of personalized treatment plans and complex decision-making scenarios, to ensure their reliable integration into clinical practice.

Article activity feed