Boundary-Specific Failure Modes and Safety Trade-offs of Large Language Models in Chronic Kidney Disease Renoprotective Therapy Review: A Stratified Synthetic Benchmark

Shang-En Yeh
Hsuan-Jen Lin
Wu-Wei Lai
Hsuan-Ming Lin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Renoprotective therapies - SGLT2 inhibitors, finerenone, and renin-angiotensin system inhibitors (RASi) - remain underutilised in chronic kidney disease (CKD). Large language models (LLMs) may detect therapy omissions, but their performance across CKD severity strata and at clinical decision boundaries has not been evaluated.

Methods

We constructed 100 synthetic CKD vignettes (G3a-G5D; 75 with prespecified omissions, 25 decoys) and queried four LLMs three times each at temperature 0 (1,200 calls). Omission criteria were adapted from KDIGO 2024, including an investigator-defined gray-zone RASi initiation criterion at eGFR<15. Two nephrologists independently classified a stratified 20-case subset.

Results

For SGLT2 inhibitor and finerenone omissions, all models achieved near-ceiling sensitivity (97-100%). For RASi, performance diverged at the eGFR<15 boundary: Grok 4.1 Fast 85% versus GPT-5.4 55%, Gemini 10%, DeepSeek 10%. Gap-detection inter-rater agreement was perfect (kappa = 1.000). Clinically incorrect reasoning rates ranged from 0% (GPT-5.4) to 27% (DeepSeek R1); of 52 instances, 31 were factual pharmacology errors and 21 reflected conservative boundary-discordant reasoning. Reproducibility (Jaccard) ranged from 0.74 to 0.93.

Conclusions

This boundary-aware synthetic benchmark showed that aggregate sensitivity can conceal clinically important operational-rule discordance. Rule-based SGLT2 inhibitor and finerenone omissions were detected with near-ceiling sensitivity, whereas an investigator-defined gray-zone RASi criterion at eGFR<15 exposed model-specific boundary behaviour. Evaluation of LLM-based CKD decision support should report boundary-specific performance, reproducibility, and clinically incorrect reasoning alongside aggregate metrics.

Key Learning Points

What was known

SGLT2 inhibitors, finerenone, and renin-angiotensin system inhibitors reduce kidney and cardiovascular risk in eligible CKD patients but remain underused.
Large language models are being considered for clinical decision support, but most benchmarks report aggregate accuracy rather than boundary-specific safety.

This study adds

A boundary-aware synthetic benchmark was constructed to evaluate CKD renoprotective-therapy omission detection across clear rule-based indications, an investigator-defined eGFR<15 gray-zone RASi criterion, decoys, reproducibility, and clinically incorrect reasoning.
All four evaluated LLMs detected SGLT2 inhibitor and finerenone omissions with near-ceiling sensitivity, but RASi detection diverged sharply at the eGFR<15 boundary, revealing a model-specific conservative non-initiation pattern hidden by aggregate results.
Safety profiles differed across models: reproducibility ranged from 60% to 89% full agreement, and clinically incorrect reasoning rates ranged from 0% to 27%.

Potential impact

LLM evaluation for nephrology decision support should report boundary performance, reproducibility, and clinically incorrect reasoning rates alongside aggregate sensitivity.
Based on the observed boundary discordance and 0–27% clinically incorrect reasoning rate, none of the four evaluated models demonstrated sufficient reproducibility or reasoning accuracy to support unsupervised use for advanced-CKD renoprotective therapy recommendations, particularly around eGFR<15 decisions; comparative human–LLM validation is required before any deployment decision.

Version published to 10.64898/2026.05.28.26353938 on medRxiv
May 30, 2026

Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

This article has 3 authors:
1. Kai-Chou Chuang
2. Hsuan-Jen Lin
3. Hsuan-Ming Lin
This article has no evaluationsLatest version May 26, 2026
General-Purpose vs. Domain-Specific Large Language Models in Antibiotic Clinical Decision-Making: A Double-Blind Evaluation with a 2×2 Factorial Design

This article has 7 authors:
1. Yang Liu
2. Changjing Zhang
3. Feifei Wang
4. Wei Xu
5. Yunhe Zhang
6. Shaolin Ma
7. Haitao Zhang
This article has no evaluationsLatest version Jul 13, 2026
Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark

This article has 4 authors:
1. Chao-Hsuan Wei
2. Hsuan-Jen Lin
3. Wu-Wei Lai
4. Hsuan-Ming Lin
This article has no evaluationsLatest version Jun 1, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Key Learning Points

What was known

This study adds

Potential impact

Article activity feed

Related articles

Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

General-Purpose vs. Domain-Specific Large Language Models in Antibiotic Clinical Decision-Making: A Double-Blind Evaluation with a 2×2 Factorial Design

Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark