Generalizable CT Vision-Language Modeling for Population Health and Disease Risk

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision-language foundation models (VLMs) for computed tomography (CT) are emerging tools capable of learning generalizable representations from large-scale clinical imaging data. Yet, it remains unclear to what extent these models encode biologically meaningful information relevant to real-world clinical variation. We introduce Percival, a CT-native VLM trained on more than 400,000 CT-report pairs from the Penn Medicine BioBank using a dual-encoder symmetric contrastive framework, with the objective of characterizing the biological associations embedded through contrastive pretraining. Across over 20,000 held-out participants, Percival’s latent space shows strong alignment with clinical attributes, body-size measures, and multiple laboratory biomarkers. Phenome-wide analyses further reveal broad correspondence between latent features and disease phenotypes, including conditions not typically evaluated by CT; survival analyses demonstrate that the embeddings capture longitudinal risk patterns. Together, these findings reveal that CT-VLMs uncover a rich latent structure aligned with physiological measurements and disease phenotypes spanning the disease-prevalence spectrum.

Article activity feed