Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This paper is of broad interest to infer the causal effect of exposures on outcomes. It proposed an interesting idea for the identification of risk factors amongst highly correlated traits in a Mendelian randomization paradigm. The intuition for this method is clearly presented. However, critical details about implementation are missing and its application is not sufficiently demonstrated in the current form.

This article has been Reviewed by the following groups

Read the full article See related articles


Multivariable Mendelian randomisation (MVMR) is an instrumental variable technique that generalises the MR framework for multiple exposures. Framed as a regression problem, it is subject to the pitfall of multicollinearity. The bias and efficiency of MVMR estimates thus depends heavily on the correlation of exposures. Dimensionality reduction techniques such as principal component analysis (PCA) provide transformations of all the included variables that are effectively uncorrelated. We propose the use of sparse PCA (sPCA) algorithms that create principal components of subsets of the exposures with the aim of providing more interpretable and reliable MR estimates. The approach consists of three steps. We first apply a sparse dimension reduction method and transform the variant-exposure summary statistics to principal components. We then choose a subset of the principal components based on data-driven cutoffs, and estimate their strength as instruments with an adjusted F -statistic. Finally, we perform MR with these transformed exposures. This pipeline is demonstrated in a simulation study of highly correlated exposures and an applied example using summary data from a genome-wide association study of 97 highly correlated lipid metabolites. As a positive control, we tested the causal associations of the transformed exposures on coronary heart disease (CHD). Compared to the conventional inverse-variance weighted MVMR method and a weak instrument robust MVMR method (MR GRAPPLE), sparse component analysis achieved a superior balance of sparsity and biologically insightful grouping of the lipid traits.

Article activity feed

  1. eLife assessment

    This paper is of broad interest to infer the causal effect of exposures on outcomes. It proposed an interesting idea for the identification of risk factors amongst highly correlated traits in a Mendelian randomization paradigm. The intuition for this method is clearly presented. However, critical details about implementation are missing and its application is not sufficiently demonstrated in the current form.

  2. Reviewer #1 (Public Review):

    This is an interesting paper that presents a novel idea for the identification of risk factors amongst highly correlated traits in a Mendelian randomization paradigm - a previous investigation ( has considered PCA, but not sparse PCA. There are clear conceptual reasons why sparse PCA may be an improvement, as detailed in this paper. Overall, the paper does a good job in terms of motivating this work and comparing the methods. A large chunk of the motivation for the method is conceptual (rather than empirical), and it's unlikely that any method would outperform others in all circumstances, but the authors do a good job of illustrating differences and giving a clear and qualified recommendation.

  3. Reviewer #2 (Public Review):

    The main analysis performed in the paper is to determine causal associations of 118 highly correlated lipid metabolites with coronary heart disease (CHD), using summary data from two genome-wide association studies, with 148 genetic variants identified for the exposures. A standard multivariable MR analysis is problematic in this case, as the genetic variants are not simultaneously relevant for all exposures, as clearly indicated by very low values of the conditional F-statistics. In order to reduce the multicollinearity problem, the use of (sparse) principal components techniques is proposed. For the summary data used here, this entails determining the (sparse) principal components from the matrix of the estimated univariate associations of the exposures and the genetic markers. This implicitly constructs linear combinations of the exposures. In a simulation study, this approach is shown to work well for determining whether an exposure has a causal association with the outcome. A conditional F-statistic is developed to evaluate the strength of relevance of the genetic markers for the principal components. In the application, these F-statistics show that instruments are jointly relevant for the transformed exposures. For the sparse methods, the transformed exposures are loaded on VLDL, LDL, and HDL traits, hence obtaining causal estimates for intervening on biologically meaningful pathways.

    The dimension reduction techniques and the results obtained are very interesting. As the analysis is performed on summary statistics, the univariate associations are treated as data, on which to perform the principal components analysis. This could be explained more and contrasted with a standard PCA when one has all the individual-level data available.

  4. Reviewer #3 (Public Review):

    To motivate the proposal, Karageorgiou et al. first identify a problem in applying current multivariable MR (MVMR) methods with many correlated exposures. I believe this problem can really be broken into two pieces. The first is that MVMR suffers from weak instrument bias. The second is that some traits may have nearly co-linear genetic associations, making it hard to disentangle which trait is causal. These problems connect in that inclusion of co-linear traits amplifies the problem of weak instrument bias - traits that are nearly co-linear with another trait in the study will have no or very few conditionally strong instruments.
    The authors then propose a solution: Apply a dimension reduction technique (PCA or sparse PCA) to the matrix of GWAS effect estimates for the exposures. The identified new components can then be used in MVMR in place of the directly measured exposures.

    I think that the identified problem is timely and important. I also like the idea of applying dimension reduction techniques to GWAS effect estimates. However, I don't think that the manuscript in its current form achieves the goals that it has set out. Specifically, I will outline the weaknesses of the work in three categories:
    1. The causal effects measured using this method are poorly defined.
    2. The description of the method lacks important details.
    3. Applied and simulation results are unconvincing.
    I will describe each of these in more detail below.

    1. To me, the largest weakness of this paper is that it is not clear how to interpret the putatively causal effects being measured. The authors describe the method as measuring "the causal effect of the PC on outcome" but it is not obvious what this means.

    One possible implication of this statement is that the PC is a real biological variable (say some hidden regulator) that can be directly intervened on. If this is the intention it should be discussed. However, this situation would imply that there is one correct factorization and there is no guarantee that PCs (or sparse PCs) come close to capturing that.

    The counterfactual implied by estimating the effects of PCs in MVMR is that it is possible to intervene on and alter one PC while holding all other PCs constant.
    In the introduction, the authors note (and I agree) that one weakness of MR applied to correlated traits is that "MVMR models investigate causal effects for each individual exposure, under the assumption that it is possible to intervene and change each one whilst holding the others fixed." However, it is not obvious that altering one PC while holding the others constant is more reasonable.

    2. This section combines a few items that I found unclear in the methods section. The most critical one is the lack of specification on how to select instruments.
    For the lipids application, the authors state that instruments were selected from the GLGC results, however, these only include instruments for LDL, HDL, and TG, so 1) it would not be possible to include variants that were independently instruments for one of the component traits alone and 2) there would be no instruments for the amino acids. There is no discussion of how instruments should be selected in general.
    This choice could also have a dramatic impact on the PCs estimated. The first PC is optimized to explain the largest amount of variance o of the input data which, in this case, is GWAS effect estimates. This means that the number of instruments for each trait included will drive the resulting PCs. It also means that differences in scaling across traits could influence the resulting PCs.

    The other detail that is either missing or which I missed is what is used as the variant-PC association in the MVMR analysis. Specifically, is it the PC loadings or is it a different value? Based on the computation of the F-statistic I suspect the former but it is not clear. If this is the case, what is the effect of using loadings that have been shrunk via one of the sparse methods? It would be nice to see a demonstration of the bias and variance of the resulting method, though it is not clear to me what the "truth" would be.

    3. In the lipids application, the fact that M.LDL.PL changes sign in MVMR analysis are offered as evidence of multicollinearity. I would generally associate multicollinearity with large variance and not bias. Perhaps the authors could offer some more insight on how multicollinearity would cause the observation.
    A minor point of confusion: I was unable to interpret this pair of sentences "Although the method did not identify any of the exposures as significant at Bonferroni-adjusted significance level, the estimate for M.LDL.PL is still negative but closer to zero and not statistically significant. The only trait that retains statistical significance is ApoB." The first sentence says that none of the exposures were significant while the second sentence says that Apo B is significant. The GRAPPLE results don't seem clearly bad, indeed if only Apo B is significant, wouldn't we conclude that of the 118 exposures, only Apo B is causal for heart disease? It would help to discuss more how the conclusions from the PC-based MVMR analysis compare to the conclusions from GRAPPLE.

    It is a bit hard to interpret Table 4. I wasn't able to fully determine what "VLD, LDL significance in MR" means here. From the text, it seems that it means that any PC with a non-zero lodaing on VLDL or LDL traits was significant, however, this seems like a trivial criterion for the PCA method, since all PCs will be dense. This would mean this indicator only tells us whether and PCs were found to "cause" heart disease.

    In simulations, I may be missing something about the definition of a true and false positive here. I think this is similar to my confusion in the previous paragraph. Wouldn't the true and false positive rates as computed using these metrics depend strongly on the sparsity of the components? It is not clear to me what ideal behavior would be here. However, it seems from the description that if the truth was as in Fig 7 and two methods each yielded one dense component that was found to be causal for Y, these two methods would get the same "score" for true positive and false positive rate regardless of the distribution of factor loadings. One method could produce a factor that loaded equally on all exposures while the other produced a factor that loaded mostly on X1 and X2 but this difference would not be captured in the results.