Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark

Chao-Hsuan Wei
Hsuan-Jen Lin
Wu-Wei Lai
Hsuan-Ming Lin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Clinical LLM benchmarks rarely test whether algorithmic rankings agree with expert clinical judgment. We developed a trap-embedded peritoneal dialysis (PD) benchmark comparing multiple scoring constructs with blinded nephrologist ratings.

Methods

We generated 125 synthetic PD cases containing 13 ISPD-aligned trap types. Five LLMs (Claude Sonnet 4.5, GPT-5.4, Gemini 3.1 Pro, DeepSeek-R1, Grok 4.1 Fast) evaluated each case three times at temperature 0 (1,875 calls). Primary outcome was must-identify TDR _must , analyzed with GEE and case-clustered bootstrap. Secondary analyses included a verbosity-sensitive alarm-burden proxy, WCS, relaxed-match scoring, WCS sensitivity analyses, and a 25-output blinded expert adequacy substudy. Must-identify kappa was 0.89 in Stage 1 and 0.92 in Stage 2.

Results

Rankings were discordant. Recall ranked Claude (0.977) and GPT-5.4 (0.955) above the other models (0.86–0.90, p<0.0001). The alarm-burden proxy favored concise models (Grok 0.689; 21.6 vs 2.4 issues/case), while WCS produced a third ordering. In the expert substudy, inter-rater concordance was strong (rho 0.977), but WCS did not show a positive association with expert adequacy (rho -0.17, p=0.41).

Conclusion

Clinical LLM rankings in PD prescription review depend strongly on scoring construct. Algorithmic metrics should be reported alongside blinded expert adequacy ratings and should not alone determine deployment.

Version published to 10.64898/2026.05.28.26354383 on medRxiv
Jun 1, 2026

Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

This article has 3 authors:
1. Kai-Chou Chuang
2. Hsuan-Jen Lin
3. Hsuan-Ming Lin
This article has no evaluationsLatest version May 26, 2026
Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

This article has 5 authors:
1. Jussi Leinonen
2. Juha Knuuttila
3. Siina Pamilo
4. Samu Kurki
5. Miika Koskinen
This article has no evaluationsLatest version Jul 9, 2026
Role-Prompting in Frontier Large Language Models Influences Clinical Reasoning in Complex Medical Cases

This article has 8 authors:
1. Chintan Dave
2. Adrianna Diviero
3. Tashni Dassanayake
4. Salman J. Alshahrani
5. Anas Al Mardini
6. Widad Khadir
7. Ashaki D. Patel
8. Adithya Srivastava
This article has no evaluationsLatest version Jul 1, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

Role-Prompting in Frontier Large Language Models Influences Clinical Reasoning in Complex Medical Cases