A Scalable Framework to Integrate Social Determinants of Health into Disease Risk Models using Biobank Survey Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Complex diseases are a major global health burden, yet our ability to predict who is at risk remains limited. Risk is shaped by both heritable factors and non-genetic environmental, behavioral, and social determinants of health, but these are rarely modeled together. The growing availability of large-scale, multimodal biobanks creates new opportunities to integrate diverse data types into more accurate disease risk models. Here, we apply Multiple Correspondence Analysis (MCA) to over 100 environmental, behavioral, and social variables from the All of Us biobank (N = 171,614) to generate low-dimensional embeddings that quantify non-genetic risk for six common chronic conditions. These embeddings recovered known and novel risk factors and consistently improved prediction beyond demographics and polygenic scores (PGS), with contributions to model performance (ROC-AUC) ranging from 0.03 to 0.05. For five of six diseases, the gains from MCA embeddings surpassed those attributable to PGS. Genetic and non-genetic risks combined largely additively: we observed little evidence of interaction effects (ΔAUC < 0.001) and highly stable variant effect sizes when embeddings were included in genetic association models ( r > 0.98). In summary, we introduce a scalable, interpretable framework that summarizes survey-based environmental, behavioral, and social factors without prior assumptions about disease-specific variables. Our results demonstrate that these non-genetic contexts substantially improve prediction while acting largely additively with genetic risk, clarifying the role of gene-environment interplay in complex disease and supporting more equitable and robust risk modeling across diverse populations.

Article activity feed