Benchmarking large language models for cardiovascular risk stratification using clinical vignettes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) show promise for cardiovascular risk stratification, though their performance compared with clinical guidelines requires validation. We benchmarked eleven contemporary LLMs using 30 bilingual (Portuguese/English) outpatient vignettes comparing their classifications against expert-adjudicated European Society of Cardiology guidelines using SCORE2. Models achieved near-perfect extraction of traditional risk factors (micro-F1 0.97–0.99) but only moderate agreement for three-class ESC risk categories (best weighted kappa 0.69, 95% CI 0.44–0.84). Ten out of eleven showed systematic underestimation of risk. LLMs struggled with SCORE2 numeric computation, with mean absolute error exceeding 5 percentage points in all but one. Most models correctly identified guideline exceptions requiring alternative assessment, beyond SCORE2, in more than 95% of cases. No significant performance differences between languages were found. While LLMs excel at structured data extraction and eligibility screening, their inconsistent risk stratification and poor numeric accuracy preclude autonomous clinical use, warranting further refinement.