Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background/Objectives: Secondary hypertension requires complex diagnostic reasoning and guideline-based management, posing challenges for artificial intelligence–based clinical decision-support systems. This study aimed to comparatively evaluate the performance of three large language models (LLMs) in diagnostic reasoning, clinical management, follow-up planning, and patient-oriented communication related to secondary hypertension. Methods: In this cross-sectional blinded study, three LLMs (ChatGPT-5.2, Claude Sonnet 4.6, and Gemini 3.0 Pro) were evaluated using 10 expert-developed clinical case vignettes representing major etiologies of secondary hypertension. Model outputs were anonymized and independently assessed by three senior clinicians (two endocrinologists and one cardiologist) using a 7-point Likert scale across five domains: (1) accuracy and hallucination control, (2) quality and comprehensiveness, (3) reliability and clinical guidance, (4) cost-efficiency, and (5) clinical usability. Group differences were analyzed using Kruskal–Wallis tests with Bonferroni-corrected pairwise comparisons. Inter-rater agreement was evaluated using two-way mixed-effects intraclass correlation coefficients with absolute agreement. Results: A total of 90 blinded expert ratings were analyzed. Claude Sonnet 4.6 achieved the highest composite performance score (6.63 ± 0.45), followed by ChatGPT-5.2 (5.82 ± 0.55) and Gemini 3.0 Pro (5.27 ± 0.89) (H = 40.055, p < 0.001). Claude Sonnet 4.6 significantly outperformed both models across all evaluation domains. ChatGPT-5.2 demonstrated intermediate performance and significantly exceeded Gemini 3.0 Pro in reliability and clinical usability. Performance differences were most pronounced in domains requiring complex clinical reasoning, whereas cost-efficiency scores were relatively comparable among models. Claude Sonnet 4.6 ranked first in nine of ten clinical vignettes. Inter-rater agreement demonstrated consistent ranking patterns among evaluators. Conclusions: Large language models exhibit heterogeneous performance in secondary hypertension–related clinical tasks. Although advanced models show promising capabilities as clinical decision-support tools, performance remains model-dependent, particularly in complex endocrine–metabolic scenarios. Domain-specific validation and prospective clinical studies are required before routine clinical implementation.

Article activity feed