Boundary-Specific Failure Modes and Safety Trade-offs of Large Language Models in Chronic Kidney Disease Renoprotective Therapy Review: A Stratified Synthetic Benchmark

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Renoprotective therapies - SGLT2 inhibitors, finerenone, and renin-angiotensin system inhibitors (RASi) - remain underutilised in chronic kidney disease (CKD). Large language models (LLMs) may detect therapy omissions, but their performance across CKD severity strata and at clinical decision boundaries has not been evaluated.

Methods

We constructed 100 synthetic CKD vignettes (G3a-G5D; 75 with prespecified omissions, 25 decoys) and queried four LLMs three times each at temperature 0 (1,200 calls). Omission criteria were adapted from KDIGO 2024, including an investigator-defined gray-zone RASi initiation criterion at eGFR<15. Two nephrologists independently classified a stratified 20-case subset.

Results

For SGLT2 inhibitor and finerenone omissions, all models achieved near-ceiling sensitivity (97-100%). For RASi, performance diverged at the eGFR<15 boundary: Grok 4.1 Fast 85% versus GPT-5.4 55%, Gemini 10%, DeepSeek 10%. Gap-detection inter-rater agreement was perfect (kappa = 1.000). Clinically incorrect reasoning rates ranged from 0% (GPT-5.4) to 27% (DeepSeek R1); of 52 instances, 31 were factual pharmacology errors and 21 reflected conservative boundary-discordant reasoning. Reproducibility (Jaccard) ranged from 0.74 to 0.93.

Conclusions

This boundary-aware synthetic benchmark showed that aggregate sensitivity can conceal clinically important operational-rule discordance. Rule-based SGLT2 inhibitor and finerenone omissions were detected with near-ceiling sensitivity, whereas an investigator-defined gray-zone RASi criterion at eGFR<15 exposed model-specific boundary behaviour. Evaluation of LLM-based CKD decision support should report boundary-specific performance, reproducibility, and clinically incorrect reasoning alongside aggregate metrics.

Key Learning Points

What was known

  • SGLT2 inhibitors, finerenone, and renin-angiotensin system inhibitors reduce kidney and cardiovascular risk in eligible CKD patients but remain underused.

  • Large language models are being considered for clinical decision support, but most benchmarks report aggregate accuracy rather than boundary-specific safety.

This study adds

  • A boundary-aware synthetic benchmark was constructed to evaluate CKD renoprotective-therapy omission detection across clear rule-based indications, an investigator-defined eGFR<15 gray-zone RASi criterion, decoys, reproducibility, and clinically incorrect reasoning.

  • All four evaluated LLMs detected SGLT2 inhibitor and finerenone omissions with near-ceiling sensitivity, but RASi detection diverged sharply at the eGFR<15 boundary, revealing a model-specific conservative non-initiation pattern hidden by aggregate results.

  • Safety profiles differed across models: reproducibility ranged from 60% to 89% full agreement, and clinically incorrect reasoning rates ranged from 0% to 27%.

Potential impact

  • LLM evaluation for nephrology decision support should report boundary performance, reproducibility, and clinically incorrect reasoning rates alongside aggregate sensitivity.

  • Based on the observed boundary discordance and 0–27% clinically incorrect reasoning rate, none of the four evaluated models demonstrated sufficient reproducibility or reasoning accuracy to support unsupervised use for advanced-CKD renoprotective therapy recommendations, particularly around eGFR<15 decisions; comparative human–LLM validation is required before any deployment decision.

Article activity feed