Auditing frontier general-purpose large language models in biomedical tasks: reasoning gains, extraction limits, and benchmark reliability
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
As large language models approach clinical deployment, their deployment-relevant reliability and the validity of the benchmarks used to assess it remain insufficiently examined. Here, we present a unified, reproducible, and human-centric audit of frontier general-purpose language models using representative biomedical text-mining tasks and nine biomedical question-answering benchmarks spanning reasoning-intensive, extraction-oriented, and multimodal settings. We observe consistent gains in clinical reasoning and multimodal biomedical QA; however, limitations in format-constrained tasks such as span-level extraction and evidence-dense summarization pose challenges for integration into structured clinical workflows, despite narrowing gaps with supervised systems. Blinded expert adjudication confirms more coherent and clinically plausible reasoning and further reveals that a substantial fraction of apparent errors arises from outdated or ambiguous benchmark annotations, suggesting that current benchmarks may misestimate model capability and potentially misguide deployment decisions. Cost-normalized analyses demonstrate that recent frontier models achieve higher accuracy at substantially lower cost per correct answer, reshaping practical deployment trade-offs for scalable digital medicine systems. Together, these findings suggest that general-purpose language models are approaching deployment-relevant reliability; however, safe and effective clinical use will require hybrid architectures, external grounding, and human-in-the-loop evaluation and expert oversight.