A Longitudinal Clinical Foundation Model on Nationwide Veteran Health Trajectories

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We present VA-LLM, a 1.62-billion-parameter autoregressive transformer pre-trained from scratch on 1.74 trillion tokens of clinical text spanning 22 years of care for 13.8 million patients in the Veterans Health Administration, with mortality outcomes confirmed through the National Death Index for 7.8 million patients. In a retrospective–prospective evaluation on 107,555 withheld patients, VA-LLM achieved higher 5-year AUPRC than Llama-2 (7 billion parameters), BioGPT_large (1.57 billion parameters), and GatorTron (3.91 billion parameters), matching GatorTron’s 100,000-patient performance with only 10,000 labeled patients. In a clinical validation against the VA’s operational Care Assessment Need (CAN) score on 5.5 million patients one year beyond the pre-training corpus, VA-LLM achieved a 90-day mortality AUROC of 90.00% versus 87.74% ( p < 0.001) and a 45% relative improvement in AUPRC; post-hoc recalibration recovered calibration comparable to CAN (Brier 0.0091 versus 0.0093) without sacrificing discrimination. Across 21 pre-training checkpoints, discriminative performance correlated more strongly with cumulative mortality experience (CME), the total person-years contributed by patients with confirmed deaths, than with token count (Δ R 2 = 0.15; Williams p < 10 −6 ). Performance plateaued once marginal cohorts added fewer confirmed deaths, even as pre-training loss continued to decrease. These findings suggest that the clinical composition of pre-training data, particularly the completeness of documented patient trajectories, correlates with predictive performance more closely than corpus size alone.

Article activity feed