Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

Kai-Chou Chuang
Hsuan-Jen Lin
Hsuan-Ming Lin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Patients with CKD and polypharmacy face high rates of drug-related problems, yet comprehensive medication review remains time-intensive and inconsistently performed. Large language models (LLMs) may augment this process, but existing benchmarks use multiple-choice formats that do not reflect open-ended, nephrology-specific review. We developed a trap-embedded synthetic CKD benchmark and evaluated five current-generation LLMs (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.1 Fast, DeepSeek R1; tested April–May 2026) for open-ended medication review.

Methods

Fifty synthetic CKD cases across three complexity groups (G3a–G3b [ n = 20], G4 [ n = 15], G5/G5D/transplant [ n = 15]) with 8–12 medications and ≥2 embedded clinical traps each were scored against nephrologist-adjudicated gold standards. Each model produced three independent responses per case (temperature 0; 750 total outputs). Primary endpoint was per-case macro F1; secondary endpoints were safety-critical omission rate, PI-adjudicated hallucination rate, and intra-model consistency. Blinded inter-rater reliability for gold-standard item detection was assessed on a 30% sample.

Results

Consensus-level macro F1 ranged from 0.41 (Claude Sonnet 4.6) to 0.49 (Grok 4.1 Fast) (Friedman P < 0.001). Phosphate binder timing (11%) and hyperkalemia combinations (33%) were poorly detected across all models. Safety-critical omission rate ranged from 22% to 48% ( P < 0.001); PI-adjudicated hallucination ranged from 0% (GPT-5.4) to 54% (DeepSeek R1), including fabricated dose caps and non-existent guideline citations. Blinded reliability for gold-standard item detection was high ( κ = 0.934, n = 92).

Conclusions

This nephrology-specific benchmark exposes clinically important LLM blind spots that generic multiple-choice evaluations would not detect. Heterogeneous hallucination and omission rates indicate that model selection and domain-specific guardrails should precede any clinical deployment of LLM-assisted CKD medication review. Prospective validation with real patient data and human comparators is required before deployment recommendations can be made.

Key Points

All five LLMs poorly detected phosphate binder timing (11%) and hyperkalemia combinations (33%) in nephrology-specific medication review.
Macro F1 ranged 0.41 to 0.49, safety-critical omissions 22–48%, and hallucination 0–54%, with qualitatively distinct error patterns.
Open-ended scoring exposed LLM failures invisible to multiple-choice benchmarks, supporting specialty evaluation before deployment.

Version published to 10.64898/2026.05.23.26353939 on medRxiv
May 26, 2026

Boundary-Specific Failure Modes and Safety Trade-offs of Large Language Models in Chronic Kidney Disease Renoprotective Therapy Review: A Stratified Synthetic Benchmark

This article has 4 authors:
1. Shang-En Yeh
2. Hsuan-Jen Lin
3. Wu-Wei Lai
4. Hsuan-Ming Lin
This article has no evaluationsLatest version May 30, 2026
Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark

This article has 4 authors:
1. Chao-Hsuan Wei
2. Hsuan-Jen Lin
3. Wu-Wei Lai
4. Hsuan-Ming Lin
This article has no evaluationsLatest version Jun 1, 2026
Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

This article has 10 authors:
1. Joseph M Plasek
2. Yiming Li
3. Mary G Amato
4. Dinah Foer
5. Diane L. Seger
6. Shayma Alzaidi
7. Huiyuan Zhou
8. Gretchen Purcell Jackson
9. David W Bates
10. Li Zhou
This article has no evaluationsLatest version Jun 1, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Key Points

Article activity feed

Related articles

Boundary-Specific Failure Modes and Safety Trade-offs of Large Language Models in Chronic Kidney Disease Renoprotective Therapy Review: A Stratified Synthetic Benchmark

Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark

Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models