Auditing frontier general-purpose large language models in biomedical tasks: reasoning gains, extraction limits, and benchmark reliability

Yu Hou
Zaifu Zhan
Min Zeng
Yifan Wu
Shuang Zhou
Xiaoyi Chen
Huixue Zhou
Meijia Song
Rui Zhang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As large language models approach clinical deployment, their deployment-relevant reliability and the validity of the benchmarks used to assess it remain insufficiently examined. Here, we present a unified, reproducible, and human-centric audit of frontier general-purpose language models using representative biomedical text-mining tasks and nine biomedical question-answering benchmarks spanning reasoning-intensive, extraction-oriented, and multimodal settings. We observe consistent gains in clinical reasoning and multimodal biomedical QA; however, limitations in format-constrained tasks such as span-level extraction and evidence-dense summarization pose challenges for integration into structured clinical workflows, despite narrowing gaps with supervised systems. Blinded expert adjudication confirms more coherent and clinically plausible reasoning and further reveals that a substantial fraction of apparent errors arises from outdated or ambiguous benchmark annotations, suggesting that current benchmarks may misestimate model capability and potentially misguide deployment decisions. Cost-normalized analyses demonstrate that recent frontier models achieve higher accuracy at substantially lower cost per correct answer, reshaping practical deployment trade-offs for scalable digital medicine systems. Together, these findings suggest that general-purpose language models are approaching deployment-relevant reliability; however, safe and effective clinical use will require hybrid architectures, external grounding, and human-in-the-loop evaluation and expert oversight.

Version published to 10.21203/rs.3.rs-8605899/v1 on Research Square
Feb 18, 2026

Evaluating Large Language Models’ Performance in FDA Regulatory Science

This article has 8 authors:
1. Khulud Bukhari
2. Rosa Rodriguez-Monguio
3. Beatriz Lopez-Bermudez
4. Jason Yamaki
5. Lawrence Brown
6. Richard Beuttler
7. Jasmine Chiat Ling Ong
8. Enrique Seoane-Vazquez
This article has no evaluationsLatest version Feb 6, 2026
QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

This article has 10 authors:
1. David Dai
2. Jeannie She
3. Jiaee Cheong
4. Xing Han
5. Carl Harris
6. Haowen Wei
7. Farzan Vahedifard
8. Suchi Saria
9. Robert Stevens
10. Paul Liang
This article has no evaluationsLatest version Dec 30, 2025
Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating Large Language Models’ Performance in FDA Regulatory Science

QoQ-Med3: Robust Multimodal Clinical Analysis Foundation Model with Reasoning

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework