Identifying Effect Modification of Latent Population Characteristics on Risk Factors with a Sparse Varying Coefficient Regression
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Leveraging observational data to understand the associations between risk factors and disease outcomes and conduct disease risk prediction is a common task in epidemiology. While traditional linear regression and other machine learning models have been extensively implemented for this task, the associations between risk factors and disease outcomes are typically deemed fixed. In many cases, however, such associations may vary by some underlying features of the individuals, which may involve certain subpopulation characteristics and environmental factors. While data for these latent features may not be available, the observed data on risk factors may have captured some proportion of the variation in these features. Thus extracting latent factors from risk factors and incorporating this effect modification into the model may better capture the underlying data structure and improve inference. We develop a novel regression model with some coefficients varying as functions of latent features extracted from the risk factors. We have demonstrated the superiority of our approach in various data settings via simulation studies. An application on a dataset for lung cancer patients from The Cancer Genome Atlas (TCGA) Program showed that our approach led to a 6% - 118% increase in (AUC-0.5) for distinguishing between different lung cancer stages compared to the classic lasso and elastic net regressions and identified interesting latent effect modifications associated with certain gene pathways.