Development of an LLM Pipeline Surpassing Physicians in Cardiovascular Risk Score Calculation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Risk scores are essential to evidence-based cardiovascular care, but manual calculation is labor-intensive and error-prone. Large language models (LLMs) could automate this process, yet LLMs are limited by their propensity for calculation errors and factual hallucinations. Pipelines that separate LLM-based data extraction from deterministic score computation may improve reliability and transparency.

Methods

We conducted a retrospective diagnostic study at a quaternary heart center in Germany (January 2020 – July 2023). Patients with atrial fibrillation (n=179) from an ablation registry and patients with severe aortic stenosis (n=76) evaluated by a heart team were included. Five LLMs (DeepSeek-R1, Qwen3, GPT-4 Turbo, Llama 3.1, and PaLM 2) were tested in standalone and pipeline configurations to compute HAS-BLED, CHA₂DS₂-VASc, and EuroSCORE II scores from routine clinical reports. Accuracy was assessed by comparing predictions to expert-adjudicated ground truth, using root mean squared error (RMSE), Krippendorff’s alpha for categorical agreement, and calibration analysis.

Results

Pipeline-generated scores showed substantially higher agreement with expert adjudication than standalone LLMs and treating clinicians (mean Krippendorff’s alpha: 0.79 vs 0.32 vs 0.31) and demonstrated superior calibration. The Qwen3-based pipeline, achieved the highest accuracy with lower RMSEs than clinicians for HAS-BLED (0.20 vs 0.87), CHA₂DS₂-VASc (0.53 vs 1.08), and EuroSCORE II (1.99 vs 2.05).

Conclusion

LLM-based pipelines enable accurate, well-calibrated, and scalable cardiovascular risk score computation from unstructured real-world clinical data, outperforming clinicians and standalone LLMs with the potential to reduce clinician workload and support evidence-based care.

Article activity feed