A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript describes an improved approach (MGCA) to identify risk genes for human traits and diseases using data from genome wide association studies. The authors demonstrate the utility of their approach by analyzing data from patients with schizophrenia, and narrow in meaningful biological processes and potential drug repurposing candidates. This approach will facilitate gene prioritization from large genetic datasets for downstream applications such as functional studies.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Linkage disequilibrium and disease-associated variants in the non-coding regions make it difficult to distinguish the truly associated genes from the redundantly associated genes for complex diseases. In this study, we proposed a new conditional gene-based framework called eDESE that leveraged an improved effective chi-squared statistic to control the type I error rates and remove the redundant associations. eDESE initially performed the association analysis by mapping variants to genes according to their physical distance. We further demonstrated that the isoform-level eQTLs could be more powerful than the gene-level eQTLs in the association analysis using a simulation study. Then the eQTL-guided strategies, that is, mapping variants to genes according to their gene/isoform-level variant-gene cis -eQTLs associations, were also integrated with eDESE. We then applied eDESE to predict the potential susceptibility genes of schizophrenia and found that the potential susceptibility genes were enriched with many neuronal or synaptic signaling-related terms in the Gene Ontology knowledgebase and antipsychotics-gene interaction terms in the drug-gene interaction database (DGIdb). More importantly, seven potential susceptibility genes identified by eDESE were the target genes of multiple antipsychotics in DrugBank. Comparing the potential susceptibility genes identified by eDESE and other benchmark approaches (i.e., MAGMA and S-PrediXcan) implied that strategy based on the isoform-level eQTLs could be an important supplement for the other two strategies (physical distance and gene-level eQTLs). We have implemented eDESE in our integrative platform KGGSEE ( http://pmglab.top/kggsee/ #/) and hope that eDESE can facilitate the prediction of candidate susceptibility genes and isoforms for complex diseases in a multi-tissue context.

Article activity feed

  1. Author Response:

    Reviewer #2:

    In this study, the authors develop a novel method, called MCGA, extending from their previous gene-based methods, to detect gene-trait association removing redundant signal. They further leverage expression QTL into their model to improve the resolution of gene-trait association. The overall structure is clear, and data is presented well. I am concerned about the simulation methods, and would like the authors to present some clarifications.

    1. When comparing MCGA-eQTL and MCGA-sQTL, the authors simulate a single isoform-trait association, and the simulated gene expression is averaged among isoforms, which is kind of unfair for MCGA-eQTL model. Hormozdiari et al reveal that sQTL contributes few to traits after conditioning on eQTL (Hormozdiari et al., 2018, doi: 10.1038/s41588-018-0148-2). I would suggest to simulating a case that gene-trait association is mediated by overall expression, instead of a single isoform (transcript);

    We thank Reviewer #2 overall for the numerous insightful and helpful suggestions and comments. Thanks for pointing out this problem! We agree with the reviewer that the gene-trait association can be mediated by the overall expression instead of a single isoform. However, we think that, mathematically, the two scenarios are equivalent. We also added a scenario in which gene-trait association is mediated by the overall expression of multiple susceptibility isoforms, and its power is similar to the scenario of single isoform-trait association (see Table 1 in the revised manuscript). In the real data analysis, we did observe that MCGA based on the isoform-level eQTLs detected more significant genes than that based on the gene-level eQTLs. Besides, we noticed that the sQTL (splicing QTL) in Hormozdiari et al. is different from the isoform-level eQTL used in our manuscript.

    1. When comparing MCGA-eQTL and MCGA-sQTL, only power is considered. The authors should include the analysis to demonstrate the performance in control for false positive;

    We thank the reviewer for this comment and suggestion. In the revised manuscript, we reported the results for controlling the false positive. Please refer to Essential Revisions point 2 (see line 261-262 in the revised manuscript).

    1. When choosing a favorable exponent value c (1.432 chosen in the study), the authors found that the c value is robust to trait type, sample size or variant size, but the authors didn't explain what factors affect the choosing of c. Considering the potential application of MCGA method in other studies, the authors should explain what factor affects c value, and provide the guidance how to choose an optimal c;

    We thank the reviewer for this comment and suggestion. Please refer to Question A and B of Essential Revisions point 3.

    A: "Motived from the boundary of chi-square correlation, we adopted simulation studies to empirically choose c for controlling the type I error of the effective chi-square test. Besides the correlation of chi-square statistics, the choosing of c for the effective chi-square test may also be affected by the approximated non-negative solutions. However, the correlation of chi-square statistics is the major factor. Our simulation showed that the derived boundary and influence trend of LD on chi-square statistics were also applicable to the effective chi-square test. In the revised manuscript, we showed that the correlation of chi-square statistics is affected by the non-centrality parameter of chi-square statistics (see lines 640-655 in the revised manuscript)."

    B: "As the optimal c for controlling the type I error of the effective chi-square test would be affected by the non-centrality parameter of chi-square statistics which are generally unknown in practice, we have to resort to a grid search algorithm to explore an empirically optimal c. In our last manuscript, we mixed the methods of choosing optimal c with the introduction of new effective chi-squared statistics. We wrote a new subsection in Materials and Methods to describe the procedure of choosing the optimal c in the revised manuscript (see lines 610-628 in the revised manuscript)."

    1. The mediation analysis result in Yao et al. estimates that 11% of trait heritability is mediated by gene expression (Yao et al., 2020, doi: 10.1038/s41588-020-0625-2), while in simulation section of this study, 100% of trait heritability is mediated by gene expression. Simulations mimicking real scenarios should be used;

    We thank the reviewer for this comment and suggestion and apologize for the confusion here. To our knowledge, the estimation by Yao et al. was for the entire genome. Note that many contributing variants of a trait may be far away from gene regions and beyond the scope of our approach. It is possible that some genes may have larger trait heritability (>11%) mediated by gene expression. Certainly, we agree with the reviewer that it is also necessary to mimic the scenario in which the gene expression mediates part of trait heritability. In the revised manuscript, we also added the scenario that part of trait heritability is mediated by the gene expression (see Table 1 in the revised manuscript). As expected, when the majority is mediated by other factors (except the gene expression), using all variants could be more powerful than only using eQTLs (see lines 247-279 in the revised manuscript).

    1. It is important to choose a background gene set when conducting GO enrichment analysis. It is not clear what kind of genes are used as control when evaluating significance;

    We thank the reviewer for this comment and apologize for the confusion here. We used the g:Profiler, a web server for functional enrichment analysis, to perform GO enrichment analyses. The conventional GO enrichment analysis took all annotated human protein-coding genes as a background in the present study (see lines 739-743 in the revised manuscript).

    1. GTEx v8 contains samples from diverse populations, and it is crucial to handle the issue of population structure. Based on the description on https://pmg-lab-docs.readthedocs.io/en/latest/KGGSEE_doc/KGGSEE.html#id18, it seems that eQTL/isoQTL were detected ignoring population structure. The authors should explain why they applied a pipeline like that, and show that their conclusion wouldn't be affected by the choice.

    We thank the reviewer for this comment. Indeed, in the original manuscript, we estimated the gene-level and isoform-level eQTLs without considering the population structure in GTEx v8. One reason is that though GTEx v8 contains samples from diverse populations, the majority (~85%) of the subjects are Europeans. Another reason is that the article of the GTEx consortium (https://www.science.org/doi/abs/10.1126/science.aaz1776) pointed that only 178 population-biased cis-eQTLs (pb-eQTLs) for 141 unique eGenes (FDR ≤ 25%) were identified across 31 tissues, which suggested that pb-eQTLs are hard to find at current sample sizes.

    In the revised manuscript, to avoid the potential population structure issues, we only used the expression profiles and genotype data of the Europeans for the eQTLs identification (see lines 788-801 in the revised manuscript).

    Reviewer #3:

    The manuscript, "MCGA: a multi-strategy conditional gene-based association framework integrating with isoform-level expression profiles reveals new susceptible and druggable candidate genes of schizophrenia", describes an approach to conduct gene-level association testing in GWAS data with integration of gene expression data. The authors have conducted comprehensive simulation studies for main modules involved in this framework, demonstrating the advantages of the MCGA strategy compared to established similar work. The method has also been applied to the analysis of schizophrenia GWAS, with several interesting discoveries. All methods proposed are implemented in the KGGSEE package, a command tool written in Java with good documentation, data resource and examples for the type of analysis proposed in this work.

    Overall, the framework is solid and the analyses performed are thorough. In particular, the simulation study and real data demonstration of advantages of isoQTL over conventional eQTL is novel and interesting. With the user friendly software available, I can envisage that MCGA will receive interest from the community and be adopted to many projects.

    My major reservation on the methods is the component using conditional analysis to identify gene specific signals. Even though the MCGA framework is as solid as the methods it is based on, alternative methods are available for gene-level association analysis that takes into consideration of contribution from multiple SNPs and the LD without having to rely on conditional analysis. For example, fine-mapping approach such as SuSiE (https://github.com/stephenslab/susieR) uses summary statistics and LD, and can produces gene-level evidence of association in terms of Bayes Factor, when a gene region is analyzed. Such an approach does not have a potential type I error issue, is efficient enough to analyze multiple genes in LD with each other. Most importantly it provides inferences directly for multiple genes accounting for LD, without having to rely on conditional analysis. Conditional analysis, as a greedy algorithm, suffers an obvious limitation: suppose genes A and B are two causal genes in weak LD with each other. A non-causal gene C physically in between A and B are correlated with both A and B. Then C may have a stronger marginal signal than either A or B. A conditional analysis may identify C, and conditional on C, association signals of the true causal genes A and B will become weaker. I therefore am not convinced that a conditional analysis such as ECS is the best approach on which MCGA should be based.

    We thank Reviewer #3 overall for the numerous insightful and helpful suggestions. We are happy that the reviewer found that our work will receive interest from the community and be adopted to many projects. To the best of our knowledge, MCGA had different application scenarios from SuSiE. The former worked with summary statistics, while the latter can only perform fine-mapping analysis with individual-level genotypes and phenotypes. Besides, MCGA can also be suitable for the three-gene case supposed by the reviewer. For example, if A and B are two causal genes, they may have larger selective expression scores than gene C in the phenotype-associated tissue. In the conditional analysis, A and B will enter the conditional procedure prior to gene C, which will make gene C not to be significant when conditioning on gene A and B.

  2. Evaluation Summary:

    This manuscript describes an improved approach (MGCA) to identify risk genes for human traits and diseases using data from genome wide association studies. The authors demonstrate the utility of their approach by analyzing data from patients with schizophrenia, and narrow in meaningful biological processes and potential drug repurposing candidates. This approach will facilitate gene prioritization from large genetic datasets for downstream applications such as functional studies.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  3. Reviewer #1 (Public Review):

    Xiangyi Li and colleagues developed a conditional gene-based method, MGCA, for the identification of risk genes from GWAS summary statistics. The authors performed extensive simulations to demonstrate the statistical power of MGCA, and the superiority of isoform QTLs in identifying risk genes. The authors used publicly available GWAS data from schizophrenia to demonstrate the ability of MGCA to identify risk genes, biological pathways, and drug repurposing candidates. The results suggest MGCA is likely to benefit the field with improved gene prioritisation for fine-mapping and functional studies.

    The conclusions are supported by the data, but additional work is needed to compare the MGCA with existing gene-based results. This will help the community evaluate the added benefit of MGCA for the functional interpretation of GWAS results.

    1. Please directly compare MGCA with MAGMA at all levels of the analysis (i.e. risk gene identification, biological pathway analysis, and drug repurposing analysis). It is important to demonstrate the added benefit of MGCA to existing approaches.

    2. Similarly, a systematic comparison of MGCA_eQTL should be performed with S-PrediXcan (or a similar TWAS approach), which currently dominate expression-based GWAS secondary analyses.

    3. Given MGCA_eQTL and MGCA_isoQTL are compatible with GTEx, the method would benefit from the dissection of tissue-specific effects, which play an important role in complex disease aetiology

    4. Given the preservation of gene co-expression patterns across the brain structure (in GTEx), it might be worthwhile building consensus networks using WGCNA in order to simplify the results.

    5. The drug repositioning analysis could be extended to integrate drug annotations and biological pathway information using publicly available resources. This will further prioritise drug candidates for follow-up functional studies.

  4. Reviewer #2 (Public Review):

    In this study, the authors develop a novel method, called MCGA, extending from their previous gene-based methods, to detect gene-trait association removing redundant signal. They further leverage expression QTL into their model to improve the resolution of gene-trait association. The overall structure is clear, and data is presented well. I am concerned about the simulation methods, and would like the authors to present some clarifications.

    1. When comparing MCGA-eQTL and MCGA-sQTL, the authors simulate a single isoform-trait association, and the simulated gene expression is averaged among isoforms, which is kind of unfair for MCGA-eQTL model. Hormozdiari et al reveal that sQTL contributes few to traits after conditioning on eQTL (Hormozdiari et al., 2018, doi: 10.1038/s41588-018-0148-2). I would suggest to simulating a case that gene-trait association is mediated by overall expression, instead of a single isoform (transcript);

    2. When comparing MCGA-eQTL and MCGA-sQTL, only power is considered. The authors should include the analysis to demonstrate the performance in control for false positive;

    3. When choosing a favorable exponent value c (1.432 chosen in the study), the authors found that the c value is robust to trait type, sample size or variant size, but the authors didn't explain what factors affect the choosing of c. Considering the potential application of MCGA method in other studies, the authors should explain what factor affects c value, and provide the guidance how to choose an optimal c;

    4. The mediation analysis result in Yao et al. estimates that 11% of trait heritability is mediated by gene expression (Yao et al., 2020, doi: 10.1038/s41588-020-0625-2), while in simulation section of this study, 100% of trait heritability is mediated by gene expression. Simulations mimicking real scenarios should be used;

    5. It is important to choose a background gene set when conducting GO enrichment analysis. It is not clear what kind of genes are used as control when evaluating significance;

    6. GTEx v8 contains samples from diverse populations, and it is crucial to handle the issue of population structure. Based on the description on https://pmg-lab-docs.readthedocs.io/en/latest/KGGSEE_doc/KGGSEE.html#id18, it seems that eQTL/isoQTL were detected ignoring population structure. The authors should explain why they applied a pipeline like that, and show that their conclusion wouldn't be affected by the choice.

  5. Reviewer #3 (Public Review):

    The manuscript, "MCGA: a multi-strategy conditional gene-based association framework integrating with isoform-level expression profiles reveals new susceptible and druggable candidate genes of schizophrenia", describes an approach to conduct gene-level association testing in GWAS data with integration of gene expression data. The authors have conducted comprehensive simulation studies for main modules involved in this framework, demonstrating the advantages of the MCGA strategy compared to established similar work. The method has also been applied to the analysis of schizophrenia GWAS, with several interesting discoveries. All methods proposed are implemented in the KGGSEE package, a command tool written in Java with good documentation, data resource and examples for the type of analysis proposed in this work.

    Overall, the framework is solid and the analyses performed are thorough. In particular, the simulation study and real data demonstration of advantages of isoQTL over conventional eQTL is novel and interesting. With the user friendly software available, I can envisage that MCGA will receive interest from the community and be adopted to many projects.

    My major reservation on the methods is the component using conditional analysis to identify gene specific signals. Even though the MCGA framework is as solid as the methods it is based on, alternative methods are available for gene-level association analysis that takes into consideration of contribution from multiple SNPs and the LD without having to rely on conditional analysis. For example, fine-mapping approach such as SuSiE (https://github.com/stephenslab/susieR) uses summary statistics and LD, and can produces gene-level evidence of association in terms of Bayes Factor, when a gene region is analyzed. Such an approach does not have a potential type I error issue, is efficient enough to analyze multiple genes in LD with each other. Most importantly it provides inferences directly for multiple genes accounting for LD, without having to rely on conditional analysis. Conditional analysis, as a greedy algorithm, suffers an obvious limitation: suppose genes A and B are two causal genes in weak LD with each other. A non-causal gene C physically in between A and B are correlated with both A and B. Then C may have a stronger marginal signal than either A or B. A conditional analysis may identify C, and conditional on C, association signals of the true causal genes A and B will become weaker. I therefore am not convinced that a conditional analysis such as ECS is the best approach on which MCGA should be based.