A Longitudinal Clinical Foundation Model on Nationwide Veteran Health Trajectories

Rafael Zamora-Resendiz
Junqi Yin
Nathan A. Kimbrel
Jean C. Beckham
Million Veteran Program
Silvia Crivelli

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present VA-LLM, a 1.62-billion-parameter autoregressive transformer pre-trained from scratch on 1.74 trillion tokens of clinical text spanning 22 years of care for 13.8 million patients in the Veterans Health Administration, with mortality outcomes confirmed through the National Death Index for 7.8 million patients. In a retrospective–prospective evaluation on 107,555 withheld patients, VA-LLM achieved higher 5-year AUPRC than Llama-2 (7 billion parameters), BioGPT_large (1.57 billion parameters), and GatorTron (3.91 billion parameters), matching GatorTron’s 100,000-patient performance with only 10,000 labeled patients. In a clinical validation against the VA’s operational Care Assessment Need (CAN) score on 5.5 million patients one year beyond the pre-training corpus, VA-LLM achieved a 90-day mortality AUROC of 90.00% versus 87.74% ( p < 0.001) and a 45% relative improvement in AUPRC; post-hoc recalibration recovered calibration comparable to CAN (Brier 0.0091 versus 0.0093) without sacrificing discrimination. Across 21 pre-training checkpoints, discriminative performance correlated more strongly with cumulative mortality experience (CME), the total person-years contributed by patients with confirmed deaths, than with token count (Δ R ² = 0.15; Williams p < 10 ⁻⁶ ). Performance plateaued once marginal cohorts added fewer confirmed deaths, even as pre-training loss continued to decrease. These findings suggest that the clinical composition of pre-training data, particularly the completeness of documented patient trajectories, correlates with predictive performance more closely than corpus size alone.

Version published to 10.64898/2026.05.13.26353133 on medRxiv
May 17, 2026

Calibrated and Interpretable Machine Learning for ICU Mortality Prediction Using First 24-Hour Clinical Data

This article has 3 authors:
1. Abdallah Alsammani
2. Merasia Johnson
3. Jessica Elrefaei
This article has no evaluationsLatest version Jun 2, 2026
Patient Versus Prediction-Level Evaluation of a Dynamic Clinical Prediction Model of Sepsis

This article has 8 authors:
1. Marcelle Tuttle
2. Carolien C H M Maas
3. Jennie An
4. Benjamin S Wessler
5. William F. Harvey
6. Harry P Selker
7. David van Klaveren
8. David M Kent
This article has no evaluationsLatest version May 27, 2026
Cross-Model Variability in Large Language Model Triage Behavior for Potential Stroke Symptoms

This article has 4 authors:
1. Daniel A Dworkis
2. Jon Stenstrom
3. Ayan Sen
4. Richard T Lucarelli
This article has no evaluationsLatest version May 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Calibrated and Interpretable Machine Learning for ICU Mortality Prediction Using First 24-Hour Clinical Data

Patient Versus Prediction-Level Evaluation of a Dynamic Clinical Prediction Model of Sepsis

Cross-Model Variability in Large Language Model Triage Behavior for Potential Stroke Symptoms