Claim-Level Transparency Analysis of LLM-Generated Diagnostic Reports: A Metabolic and Endocrine Biomarker Study

Andrii Yasinetsky
Caleb Geniesse
Elena Ikonomovska

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models are increasingly deployed in clinical decision-support contexts, yet systematic evaluation of their factual reliability in generating patient-specific diagnostic reports remains sparse, particularly for laboratory interpretation tasks. This study presents a controlled transparency experiment in which four frontier LLMs — Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.2, and Gemini 3.1 Pro — each generated diagnostic reports for 36 patients (29 female, 7 male; aged 27–64) with biomarker profiles spanning metabolic, endocrine, and nutritional markers. A transparency engine ¹ extracted up to 50 claims per report (3,035 total), searched for supporting scientific evidence, and classified each claim as supported by science, plausible, or unsupported. Unsupported claims were uncommon: the transparency engine classified 2.7% of claims as unsupported (hereafter, the pipeline-measured hallucination rate; naive claim-level 95% Wilson CI: 2.2%–3.4%), with GPT-5.2 at the lowest observed rate (1.7%) and Claude Opus 4.6 at the highest (3.6%). However, mechanistic verification revealed a much larger plausibility gap: 915 claims (30.2%) were biologically reasonable but lacked a fully verified evidence chain, bringing the share of claims not fully supported by direct evidence to 32.9%. Gemini 3.1 Pro produced the highest plausible proportion (39.6%), suggesting a more conservative but less fully grounded reasoning profile. Although coarse support-level distributions were broadly similar across models (Cramer’s V = 0.081), claim-level analysis revealed substantial narrative divergence: 61.2% of claims were unique to a single model, and matched-claim agreement was low (Cohen’s kappa = 0.233), indicating that models generate substantively different clinical narratives for the same patient data despite comparable aggregate support profiles. These findings show that hallucination metrics alone understate the share of claims not fully verified under this protocol, and that claim-level mechanistic verification is needed to distinguish the proven from the merely plausible in metabolic and endocrine laboratory interpretation, with generalizability to other clinical domains requiring further study.

Version published to 10.64898/2026.05.03.721751 on bioRxiv
May 6, 2026

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

This article has 2 authors:
1. Moiz Sadiq Awan
2. Maryam Raza
This article has no evaluationsLatest version Apr 14, 2026
Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening

This article has 3 authors:
1. Uiyun Kim
2. Ohhyeon Kwon
3. Doheon Lee
This article has no evaluationsLatest version Apr 9, 2026
AI-driven credibility profiling of real-world patient experiences suggests overlooked kidney stone therapies warrant further investigation

This article has 3 authors:
1. Alfredo Parra Hinojosa
2. Daniel C. Elton
3. Andrés Gómez-Emilsson
This article has no evaluationsLatest version Apr 17, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening

AI-driven credibility profiling of real-world patient experiences suggests overlooked kidney stone therapies warrant further investigation