Generalizable CT Vision-Language Modeling for Population Health and Disease Risk
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision-language foundation models (VLMs) for computed tomography (CT) are emerging tools capable of learning generalizable representations from large-scale clinical imaging data. Yet, it remains unclear to what extent these models encode biologically meaningful information relevant to real-world clinical variation. We introduce Percival, a CT-native VLM trained on more than 400,000 CT-report pairs from the Penn Medicine BioBank using a dual-encoder symmetric contrastive framework, with the objective of characterizing the biological associations embedded through contrastive pretraining. Across over 20,000 held-out participants, Percival’s latent space shows strong alignment with clinical attributes, body-size measures, and multiple laboratory biomarkers. Phenome-wide analyses further reveal broad correspondence between latent features and disease phenotypes, including conditions not typically evaluated by CT; survival analyses demonstrate that the embeddings capture longitudinal risk patterns. Together, these findings reveal that CT-VLMs uncover a rich latent structure aligned with physiological measurements and disease phenotypes spanning the disease-prevalence spectrum.