Development of an LLM Pipeline Surpassing Physicians in Cardiovascular Risk Score Calculation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Risk scores are essential to evidence-based cardiovascular care, but manual calculation is labor-intensive and error-prone. Large language models (LLMs) could automate this process, yet LLMs are limited by their propensity for calculation errors and factual hallucinations. Pipelines that separate LLM-based data extraction from deterministic score computation may improve reliability and transparency.
Methods
We conducted a retrospective diagnostic study at a quaternary heart center in Germany (January 2020 – July 2023). Patients with atrial fibrillation (n=179) from an ablation registry and patients with severe aortic stenosis (n=76) evaluated by a heart team were included. Five LLMs (DeepSeek-R1, Qwen3, GPT-4 Turbo, Llama 3.1, and PaLM 2) were tested in standalone and pipeline configurations to compute HAS-BLED, CHA₂DS₂-VASc, and EuroSCORE II scores from routine clinical reports. Accuracy was assessed by comparing predictions to expert-adjudicated ground truth, using root mean squared error (RMSE), Krippendorff’s alpha for categorical agreement, and calibration analysis.
Results
Pipeline-generated scores showed substantially higher agreement with expert adjudication than standalone LLMs and treating clinicians (mean Krippendorff’s alpha: 0.79 vs 0.32 vs 0.31) and demonstrated superior calibration. The Qwen3-based pipeline, achieved the highest accuracy with lower RMSEs than clinicians for HAS-BLED (0.20 vs 0.87), CHA₂DS₂-VASc (0.53 vs 1.08), and EuroSCORE II (1.99 vs 2.05).
Conclusion
LLM-based pipelines enable accurate, well-calibrated, and scalable cardiovascular risk score computation from unstructured real-world clinical data, outperforming clinicians and standalone LLMs with the potential to reduce clinician workload and support evidence-based care.