Limitations of principal components in quantitative genetic association models for human studies

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This simulation study is of interest to geneticists, especially those carrying out Genome-wide Association Studies (GWAS). It compares two major approaches for dealing with "population structure"in GWAS: Principal Component Analysis (PCA) and Linear Mixed-effects Models (LMMs). This is a subject of considerable practical importance and the study nicely reviews the theoretical underpinnings and concludes - based on the review and the extensive simulations - that there is every reason to believe LMMs to be superior (although PCA is more widely used). Although this point has been made before, it is worth making again given the ubiquity of these analyses. There are some concerns about the general validity of the claim given that the simulations fail to address several real-world problems.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Principal Component Analysis (PCA) and the Linear Mixed-effects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCA-LMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal components (PCs), simulating simple population structures, and inconsistent use of real data and power evaluations. We evaluate PCA and LMM both varying number of PCs in realistic genotype and complex trait simulations including admixed families, subpopulation trees, and real multiethnic human datasets with simulated traits. We find that LMM without PCs usually performs best, with the largest effects in family simulations and real human datasets and traits without environment effects. Poor PCA performance on human datasets is driven by large numbers of distant relatives more than the smaller number of closer relatives. While PCA was known to fail on family data, we report strong effects of family relatedness in genetically diverse human datasets, not avoided by pruning close relatives. Environment effects driven by geography and ethnicity are better modeled with LMM including those labels instead of PCs. This work better characterizes the severe limitations of PCA compared to LMM in modeling the complex relatedness structures of multiethnic human data for association studies.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    This is a simulation study comparing the performance of two major approaches for dealing with “population structure” when carrying out Genome-wide Association Studies - Principal Component Analysis and Linear Mixed-effects Models - a subject of considerable practical importance. The author correctly notes that previous comparisons have been quite limited. In particular, any study not concluding that LMM was superior has relied on very simple models of structure.

    The paper is clearly written and beautifully reviews the theoretical underpinnings (albeit in a manner that will be difficult to penetrate without deep knowledge of several fields). The simulations are well-designed and far better than previous studies. From a theoretical point of view, the work is somewhat limited by being strongly anchored in a very classical quantitative genetics framework that is focused on allele frequencies and inbreeding coefficients, and totally ignores coalescent theory, but this is a minor quibble. The simulations are limited by utilizing ridiculously small sample sizes by the standards of modern human GWAS. And of course, they do not include all the complexities of real data.

    The quantitative genetics framework we used was ideal for motivating and interpreting LMMs in particular, since they model relatedness with a kinship matrix which consists of IBD probabilities, all of which arose from quantitative genetics.

    We also added the following text to the discussion: “However, our conclusions are not expected to change with larger sample sizes, as cryptic family relatedness will continue to be abundant in such data, if not increase in abundance, and thus give LMMs an advantage over PCA (Henn et al., 2012; Shchur et al., 2018; Loh et al., 2018).”

    The main conclusion of the study is that LMM really are generally superior - as expected on theoretical grounds. However, the authors do address whether switching to LMM really is practicable given the sample size and lack of data sharing that characterize human genetics. Nor is it clear whether the difference in performance matters in real life given that the entire framework used is an idealized one - the fact that real human data suffers from environmental confounders that are correlated with “ancestry” is not addressed, to take the most obvious example. That said, it is surely important to note that the approach routinely used by the majority of users (PCA with 10 PCs) is most used for historical reasons and has little theoretical or empirical justification.

    We added simulations with environment effects correlated with ancestry, which we hope will make our study even more relevant as it does make our evaluations even more realistic than before. In the presence of environment effects, LMM without PCs remains among the best approaches, although occasionally LMM with PCs or PCA will perform slightly better. However, modeling environment directly (with the true variables) improves performance much more than by using PCs to model environment indirectly, so we believe that is not a strong reason for continuing to use PCs (in LMMs or otherwise) unless there is no choice.

    We also added the following text to the discussion: “However, recent approaches not tested in this work have made LMMs more scalable and applicable to biobank-scale data (Loh et al., 2015; Zhou et al., 2018; Mbatchou et al., 2021), so one clear next step is carefully evaluating these approaches in simulations with larger sample sizes.” As stated earlier, we believe that the difference in performance between LMM and PCA will remain in larger sample sizes because cryptic relatedness is more prevalent in that setting.

    We excluded the “lack of data sharing” point from our discussion because it does not align well with the goals of our manuscript. The current solution to the lack of data sharing is meta-analysis, but its use does not give PCA or LMM an inherent advantage, since it can be applied to the summary statistics of either (or even a combination of models, in theory). There is interesting recent work on “federated” PCA and LMM association (both versions exist), that allow a single model to be fit jointly to separate datasets (residing in different buildings across the world) as if they were combined into a single dataset. Thus, these issues do not explain or motivate why PCA or LMM should be used.

    Reviewer #2 (Public Review):

    Yao and Ochoa present a very nice paper examining the age-old question of whether LMM or PCA is a better way to adjust for structure (population, family, admixture). The authors provide a very nice and detailed overview of the previous research addressing this question, summarizing it in a table. They find that LMMs are generally better at accounting for population structure. However, I feel there are a couple of important factors that are missing. One is the consideration of environmental structure. Another is that the relationship between PCA and LMM is usually a bit more complicated in practice than depicted here, where the devil really lies in the details. Also, I think there are a couple of key reasons why LMMs haven’t been adapted as quickly as one might have expected, including case-control imbalance and cohort meta-analyses, which I feel the authors could point out. In fact, I believe LMMs have become sort of popular in recent years (e.g. Japan Biobank GWAS results).

    We added environment simulations, which we agree was an important shortcoming of the previous version of our work.

    We now discuss how the PCA and LMM connection can be more complicated in practice, but as the main difference is in how LD is handled, once that is correctly adjusted, PCs and random effects are still mostly modeling the same relatedness signals. Ultimately, our main conclusion is unchanged, namely that only LMMs can model family relatedness, which is their key advantage.

    We briefly commented on case-control imbalance in our discussion (now made more clear), but since this involves binary traits, which we did not explicitly test in this work, it is out of scope.

    Cohort meta-analysis does not influence whether to use PCA or LMM, since it can be performed with summary statistics from either model (and in theory even a combination of different models per cohort). The broad use of meta-analysis does not in itself prevent users from using PCA or LMM within individual cohorts. The use of meta-analysis is very interesting in its own right, but it is outside the scope of this work.

    Reviewer #3 (Public Review):

    This paper examines the relative performance of linear mixed models (LMMs), principal components (PCA), and their combination (PCA-LMM) for genetic association studies in human populations. The authors claim that previous papers examining this question are inadequate and that: (i) there remains confusion on which method is best and in which context, (ii) that the metrics used in previous evaluations were insufficient, and (iii) that the simulation settings used in previous papers were not comprehensive. To fix these problems the authors perform an extensive set of simulations within several frameworks and suggest two new metrics for evaluating performance.

    Strengths:

    The simulation framework used in this paper and the extensive number of simulations provide an opportunity to examine the relative properties of the three approaches (LMM, PCA, PCA-LMM) in a variety of contexts.

    The parameters of the simulation framework are based on highly diverged populations, which is an increasingly common analysis choice that has not been examined in detail via simulation previously.

    The evaluation metrics used in this paper are AUC and a test of the uniformity of the p-value distribution under the null. This is an improvement over some previous analyses which did not examine power and relied on less sensitive tests of type I error.

    Weaknesses:

    This paper has a limited set of population frameworks just like all papers before it. The breakdown of which method is best (LMM, PCA, PCA-LMM) will be a function of the simulation framework chosen.

    Ameliorating this issue, we added additional simulations with low heritability and with environment effects. We are pleased to report that all of our conclusions hold at low heritability (h2 = 0.3), and for the most part under environment effects (which occasionally give LMM with PCs and PCA a small advantage, but often LMM with no PCs remains best, and we show PCs are no replacement for directly modeling these environment effects).

    The frameworks chosen for this paper are certainly not comprehensive in contemporary human genetic studies. In fact, the authors make a number of unusual choices. For example, the populations in the simulated study have extremely large Fsts. While this is also a strength, the lack of more standard study designs is a weakness. More importantly, there is no simulation of family effects, which is the basis of many of the PCA-LMM papers reported in Table 1.

    We now better motivate in the introduction our focus on association studies of multiethnic and admixed individuals, which are nowadays very common and which have greater FST values than earlier studies. In reference to higher simulated FSTs, we also now cite our recent work, which has found that many previous FST estimates are downwardly biased (Ochoa and Storey, 2021, 2019). We simulated data that was fit to each of our three real datasets using our unbiased methods, so those values that (understandably) appear high are actually more correct (for multiethnic populations such as those in 1000 Genomes, HGDP, etc) than previous estimates in the literature. In our previous work we also determined that only previous pairwise FST estimators are unbiased (under some conditions), and using a previous pairwise FST estimator (from Bhatia et al., 2013) we obtained equally high values between the most diverged human populations (values from a revised version of Ochoa and Storey, 2019 that isn’t on bioRxiv yet): In HGDP, the largest pairwise FST is 0.479, between Pima and PapuanSepik; In Human Origins, the largest estimate is 0.396, between Cabecar and Baining_Malasait; Lastly, in 1000 Genomes, the largest estimate is 0.135, between YRI and JPT. (1000 Genomes was generally less structured than HGDP and Human Origins, because the latter include more diverse populations.) Several previous estimates from the literature, all between one hunter-gatherer Sub-Saharan African subpopulation and one non-African subpopulation resulted in values of about 0.25 (Bowcock et al., 1991, Henn et al., 2011, Bergstrom et al., 2020). FST estimates are also greater from whole-genome sequencing versus array data (revised version of Ochoa and Storey, 2019).

    Family (household) effects is a case where PCA is not expected to outperform LMM, though standard LMMs do not model this effect explicitly either and may not do much better. As this is a feature of family studies that ought to be absent in population studies (as usually only siblings are in the same household, and not more distant relatives), it is also not entirely relevant to the majority of our simulations. In these ways, including such a feature in our simulations does not align with the goals of this present work, but we agree this is an important framework that deserves more attention in future evaluations.

    The discussion (and simulations) of LMM vs PCA, particularly LMMs with PCs as fixed effects misses the critical distinction of whether PCs are in-sample (in which case including PCs as fixed effects effectively serves as a preconditioner for the kinship matrix, speeding up iterative methods such as BOLT), or projections of individuals onto out-of-sample principal axes. There is also no discussion of LOO methods to address “proximal contamination”, also quite relevant in evaluating power as a function of the number of PCs.

    We added the following to our discussion concerning out-of-sample PC projections: “We do not consider the case where samples are projected onto PCs estimated from an external sample (Prive et al., 2020), which is uncommon in association studies, and whose primary effect is shrinkage, so if all samples are projected then they are all equally affected and larger regression coefficients compensate for the shrinkage, although this will no longer be the case if only a portion of the sample is projected onto the PCs of the rest of the sample.”

    We also added the following to the discussion concerning the LOCO approach: “Similarly, the leave-onechromosome-out (LOCO) approach for estimating kinship matrices for LMMs prevents the test locus and loci in LD with it from being modeled by the random effect as well, which is called”proximal contamination” (Lippert et al., 2011, Yang et al., 2014). While LOCO kinship estimates vary for each chromosome, they continue to model family relatedness, thus maintaining their key advantage over PCA.”

    The same new discussion paragraph closes with the following thoughts concerning LOCO and related approaches: “LD effects must be adjusted for, if present, so in unfiltered data we advise the previous methods be applied. However, in this work, simulated genotypes do not have LD, and the real datasets were filtered to remove LD, so here there is no proximal contamination and LD confounding is minimized if present at all, so these evaluations may be considered the ideal situation where LD effects have been adjusted successfully, and in this setting LMM outperforms PCA. Overall, these alternative PCs or kinship matrices differ from their basic counterparts by either the extent to which LD influences the estimates (which may be a confounder in a small portion of the genome, by definition) or by sampling noise, neither of which are expected to change our key conclusion.”

    Lastly, we added the following to a different discussion paragraph: “A different benefit for including PCs were recently reported for BOLT-LMM, which does not result in greater power but rather in reduced runtime, a property that may be specific to its use of scalable algorithms such as conjugate gradient and variational Bayes (Loh et al., 2018).”

    There is no discussion/simulation of spatial/environmental effects or rare vs common PCs as raised in Zaidi et al 2020. There are some open questions here regarding relative performance the authors could have looked at. Same for LMMs with multiple GRMs corresponding to maf/ld bins and thresholded GRMs. For example, it would be helpful to know if multiple-GRM LMMs mitigate some of the problems raised in the Zaidi paper.

    We added simulations with environment effects, which are based on a two-level hierarchy of population labels so they are spatial to the extent that these labels capture spatial relationships between populations. However, our small sample size data are not well suited to study rare variants and their structure, so its out of scope. (The sample size limitation is also covered in a new discussion paragraph.) We hope to tackle this very interesting question in future work.

    We added the following paragraph to our discussion: “Another limitation of this work is ignoring rare variants, a necessity given our smaller sample sizes, where rare variant association is miscalibrated and underpowered. Using simulations mimicking the UK Biobank, recent work has found that rare variants can have a more pronounced structure than common variants, and that modeling this rare variant structure (with either PCA and LMM) may better model environment confounding, improve inflation in association studies, and ameliorate stratification in polygenic risk scores (Zaidi and Mathieson, 2020). Better modeling rare variants and their structure is a key next step in association studies.”

  2. Evaluation Summary:

    This simulation study is of interest to geneticists, especially those carrying out Genome-wide Association Studies (GWAS). It compares two major approaches for dealing with "population structure"in GWAS: Principal Component Analysis (PCA) and Linear Mixed-effects Models (LMMs). This is a subject of considerable practical importance and the study nicely reviews the theoretical underpinnings and concludes - based on the review and the extensive simulations - that there is every reason to believe LMMs to be superior (although PCA is more widely used). Although this point has been made before, it is worth making again given the ubiquity of these analyses. There are some concerns about the general validity of the claim given that the simulations fail to address several real-world problems.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    This is a simulation study comparing the performance of two major approaches for dealing with "population structure" when carrying out Genome-wide Association Studies - Principal Component Analysis and Linear Mixed-effects Models - a subject of considerable practical importance. The author correctly notes that previous comparisons have been quite limited. In particular, any study not concluding that LMM was superior has relied on very simple models of structure.

    The paper is clearly written and beautifully reviews the theoretical underpinnings (albeit in a manner that will be difficult to penetrate without deep knowledge of several fields). The simulations are well-designed and far better than previous studies. From a theoretical point of view, the work is somewhat limited by being strongly anchored in a very classical quantitative genetics framework that is focused on allele frequencies and inbreeding coefficients, and totally ignores coalescent theory, but this is a minor quibble. The simulations are limited by utilizing ridiculously small sample sizes by the standards of modern human GWAS. And of course, they do not include all the complexities of real data.

    The main conclusion of the study is that LMM really are generally superior - as expected on theoretical grounds. However, the authors do address whether switching to LMM really is practicable given the sample size and lack of data sharing that characterize human genetics. Nor is it clear whether the difference in performance matters in real life given that the entire framework used is an idealized one - the fact that real human data suffers from environmental confounders that are correlated with "ancestry" is not addressed, to take the most obvious example. That said, it is surely important to note that the approach routinely used by the majority of users (PCA with 10 PCs) is most used for historical reasons and has little theoretical or empirical justification.

  4. Reviewer #2 (Public Review):

    Yao and Ochoa present a very nice paper examining the age-old question of whether LMM or PCA is a better way to adjust for structure (population, family, admixture). The authors provide a very nice and detailed overview of the previous research addressing this question, summarizing it in a table. They find that LMMs are generally better at accounting for population structure. However, I feel there are a couple of important factors that are missing. One is the consideration of environmental structure. Another is that the relationship between PCA and LMM is usually a bit more complicated in practice than depicted here, where the devil really lies in the details. Also, I think there are a couple of key reasons why LMMs haven't been adapted as quickly as one might have expected, including case-control imbalance and cohort meta-analyses, which I feel the authors could point out. In fact, I believe LMMs have become sort of popular in recent years (e.g. Japan Biobank GWAS results).

  5. Reviewer #3 (Public Review):

    This paper examines the relative performance of linear mixed models (LMMs), principal components (PCA), and their combination (PCA-LMM) for genetic association studies in human populations. The authors claim that previous papers examining this question are inadequate and that: (i) there remains confusion on which method is best and in which context, (ii) that the metrics used in previous evaluations were insufficient, and (iii) that the simulation settings used in previous papers were not comprehensive. To fix these problems the authors perform an extensive set of simulations within several frameworks and suggest two new metrics for evaluating performance.

    Strengths:

    The simulation framework used in this paper and the extensive number of simulations provide an opportunity to examine the relative properties of the three approaches (LMM, PCA, PCA-LMM) in a variety of contexts.

    The parameters of the simulation framework are based on highly diverged populations, which is an increasingly common analysis choice that has not been examined in detail via simulation previously.

    The evaluation metrics used in this paper are AUC and a test of the uniformity of the p-value distribution under the null. This is an improvement over some previous analyses which did not examine power and relied on less sensitive tests of type I error.

    Weaknesses:

    This paper has a limited set of population frameworks just like all papers before it. The breakdown of which method is best (LMM, PCA, PCA-LMM) will be a function of the simulation framework chosen.

    The frameworks chosen for this paper are certainly not comprehensive in contemporary human genetic studies. In fact, the authors make a number of unusual choices. For example, the populations in the simulated study have extremely large Fsts. While this is also a strength, the lack of more standard study designs is a weakness. More importantly, there is no simulation of family effects, which is the basis of many of the PCA-LMM papers reported in Table 1.

    The discussion (and simulations) of LMM vs PCA, particularly LMMs with PCs as fixed effects misses the critical distinction of whether PCs are in-sample (in which case including PCs as fixed effects effectively serves as a preconditioner for the kinship matrix, speeding up iterative methods such as BOLT), or projections of individuals onto out-of-sample principal axes. There is also no discussion of LOO methods to address "proximal contamination", also quite relevant in evaluating power as a function of the number of PCs.

    There is no discussion/simulation of spatial/environmental effects or rare vs common PCs as raised in Zaidi et al 2020. There are some open questions here regarding relative performance the authors could have looked at. Same for LMMs with multiple GRMs corresponding to maf/ld bins and thresholded GRMs. For example, it would be helpful to know if multiple-GRM LMMs mitigate some of the problems raised in the Zaidi paper.

  6. Reviewer #4 (Public Review):

    Yao and Ochoa conducted a systematic examination of the association study using PCA and LMM with both simulated and empirical datasets. The overall finding is that LMM should be used and generally there is no need to include a few PCs in LMM. While similar studies have been conducted earlier with the comparison goal, the authors made additional effort to conduct this extensive study. Many scenarios were considered, and the results were clearly presented. This paper is interesting to researchers in statistical genetics.