Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background/Objectives: Secondary hypertension requires complex diagnostic reasoning and guideline-based management, posing challenges for artificial intelligence–based clinical decision-support systems. This study aimed to comparatively evaluate the performance of three large language models (LLMs) in diagnostic reasoning, clinical management, follow-up planning, and patient-oriented communication related to secondary hypertension. Methods: In this cross-sectional blinded study, three LLMs (ChatGPT-5.2, Claude Sonnet 4.6, and Gemini 3.0 Pro) were evaluated using 10 expert-developed clinical case vignettes representing major etiologies of secondary hypertension. Model outputs were anonymized and independently assessed by three senior clinicians (two endocrinologists and one cardiologist) using a 7-point Likert scale across five domains: (1) accuracy and hallucination control, (2) quality and comprehensiveness, (3) reliability and clinical guidance, (4) cost-efficiency, and (5) clinical usability. Group differences were analyzed using Kruskal–Wallis tests with Bonferroni-corrected pairwise comparisons. Inter-rater agreement was evaluated using two-way mixed-effects intraclass correlation coefficients with absolute agreement. Results: A total of 90 blinded expert ratings were analyzed. Claude Sonnet 4.6 achieved the highest composite performance score (6.63 ± 0.45), followed by ChatGPT-5.2 (5.82 ± 0.55) and Gemini 3.0 Pro (5.27 ± 0.89) (H = 40.055, p < 0.001). Claude Sonnet 4.6 significantly outperformed both models across all evaluation domains. ChatGPT-5.2 demonstrated intermediate performance and significantly exceeded Gemini 3.0 Pro in reliability and clinical usability. Performance differences were most pronounced in domains requiring complex clinical reasoning, whereas cost-efficiency scores were relatively comparable among models. Claude Sonnet 4.6 ranked first in nine of ten clinical vignettes. Inter-rater agreement demonstrated consistent ranking patterns among evaluators. Conclusions: Large language models exhibit heterogeneous performance in secondary hypertension–related clinical tasks. Although advanced models show promising capabilities as clinical decision-support tools, performance remains model-dependent, particularly in complex endocrine–metabolic scenarios. Domain-specific validation and prospective clinical studies are required before routine clinical implementation.