Estimating population structure using epigenome-wide methylation data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Introduction

In epigenome-wide association analysis (EWAS), unaddressed population stratification often leads to inflation. We aimed to compute methylation population scores (MPSs) that predict genetic principal components (GPCs) using a feature selection and regression approach.

Methods

We used multi-ethnic methylation data (Illumina 450K/EPIC array) from unrelated MESA (n=929), CARDIA (n=1123), JHS (n=1365), ARIC (n=2338), and HCHS/SOL (n=1475) individuals, randomly assigning 85% of participants from each cohort to a training dataset and the remaining 15% to a test dataset. First, we estimated the associations of GPCs with each available CpG methylation site using linear regression within each cohort, adjusting for age, sex, smoking status, race/ethnic background (as a proxy for background information associated with lifestyle and other environmental exposures that may impact methylation), alcohol use status, body mass index, and cell type proportions. We meta-analyzed the associations across cohorts and selected CpG sites with association FDR-adjusted q-value <0.05. We next aggregated individual-level data across the cohort-specific training datasets, and applied two-stage weighted least squares Lasso regression, with the GPCs as the outcomes and the selected CpG sites as penalized predictors, adjusting for the aforementioned covariates. The developed MPSs are the weighted sum of selected CpG sites from the Lasso. To evaluate the developed MPSs, we constructed them in the test dataset, and compared them with GPCs, and with MPSs constructed based on a previously-published paper. Comparison was based on correlation analysis and data visualization. We demonstrate the use of the MPSs in EWAS.

Results

In the test dataset, the MPSs were highly correlated with GPCs, with correlation decreasing, though not monotonically, for later components. Specifically, MPS1 and GPC1 had R2= 0.99, while MPS7 and GPC7 had R2=0.27 (the lowest observed correlation). In data visualization, MPSs had similar patterns as GPCs in differentiating self-reported White, Black, and Hispanic/Latino groups, while outperforming MPC constructed using alternative published methods. MPSs showed comparable performance to GPCs in reducing some of the inflation in EWAS.

Conclusions

Methylation-based population scores provide a reliable estimate of population structure in the data and can complement GPCs when genetic data are absent. Unlike previous methods based on unsupervised methylation PCA, MPSs uses supervised learning with covariate adjustment to capture genetic structure across diverse populations. The weights for each GPCs derived in our study can be applied to generate MPSs in other studies.

Article activity feed