Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Operational decisions governing patient flow, cost, and quality of care demand specialized predictive models, yet most clinical NLP efforts focus on medical knowledge benchmarks. We introduce Lang1, a family of language models (100M-7B parameters) pretrained on 80 billion clinical tokens from NYU Langone Health electronic health records blended with 627 billion internet tokens. We evaluate Lang1 on the REalistic Medical Evaluation (ReMedE), an evaluation suite derived from 668,331 Electronic Health Records (EHR) notes spanning five tasks: readmission, mortality prediction, length of stay, comorbidity coding, and insurance denial. In zero-shot settings, both general-purpose and biomedical models underperform on four of five tasks. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger. Joint multi-task finetuning yields cross-task transfer, and Lang1-1B transfers effectively to unseen tasks and an external health system. These results demonstrate that effective healthcare AI requires in-domain pretraining, supervised finetuning, and evaluation beyond proxy benchmarks.

Article activity feed