Heritability enrichment in context-specific regulatory networks improves phenotype-relevant tissue identification

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This manuscript is of interest to scientists studying the genetics of complex human diseases. The approach introduced here is potentially useful for the identification of tissues linked to complex disease heritability. Currently, the key claims of the paper are not entirely supported by the data. The claims may become well supported once the authors improve statistical rigor and perform a more comprehensive comparison with other methods.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Systems genetics holds the promise to decipher complex traits by interpreting their associated SNPs through gene regulatory networks derived from comprehensive multi-omics data of cell types, tissues, and organs. Here, we propose SpecVar to integrate paired chromatin accessibility and gene expression data into context-specific regulatory network atlas and regulatory categories, conduct heritability enrichment analysis with genome-wide association studies (GWAS) summary statistics, identify relevant tissues, and estimate relevance correlation to depict common genetic factors acting in the shared regulatory networks between traits. Our method improves power upon existing approaches by associating SNPs with context-specific regulatory elements to assess heritability enrichments and by explicitly prioritizing gene regulations underlying relevant tissues. Ablation studies, independent data validation, and comparison experiments with existing methods on GWAS of six phenotypes show that SpecVar can improve heritability enrichment, accurately detect relevant tissues, and reveal causal regulations. Furthermore, SpecVar correlates the relevance patterns for pairs of phenotypes and better reveals shared SNP-associated regulations of phenotypes than existing methods. Studying GWAS of 206 phenotypes in UK Biobank demonstrates that SpecVar leverages the context-specific regulatory network atlas to prioritize phenotypes’ relevant tissues and shared heritability for biological and therapeutic insights. SpecVar provides a powerful way to interpret SNPs via context-specific regulatory networks and is available at https://github.com/AMSSwanglab/SpecVar , copy archived at swh:1:rev:cf27438d3f8245c34c357ec5f077528e6befe829 .

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    The authors define regulatory networks across 77 tissue contexts using software they have previously published (PECA2, Duren et al. 2020). Each regulatory network is a set of nodes (transcription factors (TF), target genes (TG), and regulatory elements (RE)) and edges (regulatory scores connecting the nodes). For each context, the authors define context-specific REs, as those that do not overlap REs from any of the other 76 contexts, and context-specific regulatory networks as the collection of TFs, TGs, and REs connected to at least one context-specific RE. This approach essentially creates annotations that are aggregated across genes, elements, and specific contexts. For each tissue, the authors use linkage disequilibrium score regression (LDSC) to calculate enrichment for complex trait heritability within the set of all REs from the corresponding context-specific regulatory network. Heritability enrichments in context-specific regulatory network REs are compared with heritability enrichments in regions defined using other approaches.

    We thank the reviewers for the pertinent and precise summary of our paper.

    Reviewer #2 (Public Review):

    In this manuscript the authors develop a method, SpecVar, to perform heritability estimation from regulatory networks derived from gene expression and chromatin accessibility data. They apply this approach to public datasets available in ENCODE and Roadmap Epigenomics consortia as well as GWAS phenotype associations in UK Biobank. It promises to be a powerful method to interpret mechanisms from genetic associations. Below are some strengths and weaknesses of the paper.

    Strengths

    • The method performs heritability enrichment on two major genomic data types: gene expression and chromatin accessibility.
    • This method leverages gene regulatory networks to perform the heritability estimation, which may better capture complex disease architecture.
    • The authors perform an extensive comparison to other LDSC-based approaches using different tissue datasets.

    Weaknesses

    (1) This approach may represent a modest advance over existing LDSC methods when looking at other complex traits.

    (2) The authors only compare with LDSC using different functional annotations as input, which may not be appropriate. A more broad comparison with other heritability methods would be helpful.

    (3) The method seems to be applied to "paired" data, but this is still bulk profiles not paired single-cell RNA/ATAC data.

    The authors successfully applied a regulatory network approach to improving the heritability estimation of complex traits by using both gene expression and chromatin accessibility data. While the results could be further strengthened by comparing them to other network and non-network-based methods, it provides important insight into a few traits beyond the standard LDSC model with different functional annotations.

    Given that this method is based on the widely used LDSC approach it should be broadly applied in the field. However, the authors should consider adapting this to single-cell data as well as admixed human population genetic data.

    We thank the reviewer for the positive comment on our work by specifically pointing out that SpecVar is a powerful method to interpret mechanisms from genetic associations. We appreciate that the reviewer’s summarized “Strength” part well captures our major contribution in building an atlas of regulatory networks by integrating paired gene expression and chromatin accessibility data, leveraging regulatory networks to perform the heritability enrichment, and identifying relevant tissues and estimate relevance correlation. We also thank the reviewer for pointing out the weakness to further enhance our results. To address the comments, we (1) performed ablation studies and added more description to clarify the novelty of our methods; (2) conducted extensive comparison to another network-based method CoCoNet and non-network-based method RolyPoly; (3) discussed the promising direction in identification of relevant contexts at cell type level by leveraging single cell multi-omics profiles and application on admixed populations.

    Reviewer #3 (Public Review):

    Identifying the critical tissues and cell types in which genetic variants exert their effects on complex traits is an important question that has attracted increasing attention. Feng et al propose a new method, SpecVar, to first construct context-specific regulatory networks by integrating tissue-specific chromatin states and gene expression data, and then run stratified LD score regression (LDSC) to test if the constructed regulatory network in tissue is significantly associated with the trait, measured by a statistic called trait relevance score in this study. They apply their method to 6 traits for which there exists prior evidence on the most relevant tissues in the literature, and then further apply to 206 traits in the UK Biobank. They find that compared to LDSC using other sources of information to define context-specific annotations, their method can "improve heritability enrichment", "accurately detect relevant tissues", helps to "interpret SNPs" identified from GWAS, and "better reveals shared heritability and regulations of phenotypes" between traits.

    We thank the reviewer for the summary and appreciation of our efforts to address the important question: identifying the critical tissues and cell types in which genetic variants exert their effects on complex traits.

    However, I think it requires more work to understand where exactly the benefits come from and the statistical properties of their proposed test statistic (e.g., how to perform hypothesis tests with their relevance score and whether the false positive rate is under control). In addition, it's not clear to me what they can conclude about the shared heritability (which means genetic correlation) by comparing their relevance score correlation across tissues to the phenotypic correlation between traits.

    We thank the reviewer’s advice to do more work to enhance the statistical rigorousness of SpecVar. We have added the significant test of heritability enrichment and our proposed R score in the revision. We also clarified that SpecVar can use common relevant contexts and shared SNP-associated regulatory networks as potential explanation for the correlation between traits.

    They show that SpecVar gives much higher heritability enrichment than the other methods in the trait-relevant tissues (Fig. 2). The fold enrichment from SpecVar is extremely high, e.g., more than 600x in the right lobe of the liver for LDL. First, I think a standard error should be given so that the significance of the differences can be assessed. Second, it is very rare (hence suspicious) to observe such a huge enrichment. Since SpecVar is based on LDSC, the same methodology that other methods in comparison depend on, the differences to the other methods must come from the set of SNPs annotated for each tissue. I think it is important to understand the difference between the SpecVar annotated SNPs and those from other methods. For example, is the extra heritability enrichment mainly from the SpecVar-specific annotation or from the intersection narrowed down by SpecVar?

    The reviewer has pinpointed a question about one important advantage of our method to improve heritability enrichment. We addressed this question by first providing standard errors, p values, and q values of heritability enrichment. Second, we conduct the ablation analysis to study the source of extra heritability enrichment. This question greatly helps us to clarify the main contribution of our method.

    They propose to use the relevance score (R score) to prioritise trait-relevant tissues. In Fig. 3, they show tissue-trait pairs with the highest R scores, and from there they prioritise several tissues for each trait (Table 1). I can see that some tissue has an outstanding R score, however, it is not clear to me where they draw the line to declare a positive result. The threshold doesn't seem to be even consistent across traits. For example, for LDL, only the right lobe of the liver is identified although other tissues have R scores greater than 100, whereas, for EA, Ammor's horn and adrenal gland are identified although their R scores are apparently smaller than 100. It seems to me they use some subjective criteria to pick the results. It leads to a serious question on how to apply their R score in a hypothesis test: how to measure the uncertainty of their R score? What significance threshold should be used? Whether the false positive rate is under control? (Without knowing these statistical properties, readers won't be able to use this method with confidence in their own research.

    We thank the reviewer to raise the question about the hypothesis test of the R score. We used the block Jackknife stratagem to estimate standard errors, p values, and q values in our revision. We added the new result to the main text and they greatly enhanced the statistical rigorousness of our method.

    Another related comment to the above is to investigate false positive associations, they should show the results for all tissues tested to see if SpecVar tends to give higher R scores even in tissues that are not relevant to the trait. It would also be useful to include some negative control traits, such as height for brain tissues.

    We agree that negative control is important and the six phenotypes in our manuscript are negative for each other. For example, LDL is relevant to liver tissue and not relevant to brain tissue. Educational attainment is relevant to brain tissue but not relevant to liver tissue.

    Fig. 3 shows that tissues prioritised by LDSC-SAP and LDSC-SEG seem to make less sense than those from SpecVar. However, some of the results are not consistent with the LDSC-SEG paper (Finucane et al 2018). For example, LDL was significantly associated with the liver in Finucane et al (Fig. 2), but not in this study. How to explain the difference? (Question 3)

    We checked the results in Figure 3 and found that even though the liver was not ranked to be top 5 tissues, it has a significant P-value to LDL in our implementation. There is indeed some difference in heritability enrichment and P-value between the LDSC-SEG paper and our implementation. And the difference was from the different sets of tissues (77 tissues in our paper and 53 tissues in the LDSC-SEG paper) for the two applications.

    The authors highlight an example where SpecVar facilitates the interpretation of GWAS signals near FOXC2. They find GWAS-significant SNPs located in a CNCC-specific RE downstream of FOXC2 and reason these SNPs affect brain shape by regulating the expression of FOXC2. I think more work can be done to consolidate the conclusion. For example, if the GWAS signals are colocalised with the eQTL for FOXC2 in the brain. Also, note that the top GWAS signal is actually on the left of the CNCC-specific RE (Fig. 4b). A deeper investigation should be warranted.

    We agree that more work should be done to consolidate the regulation of FOXC2. In our revision, we used the HiChIP loop in the brain to support the SNP-associated regulation of FOXC2. We also thank the reviewer’s suggestion for the idea of eQTL colocalization and we conduct eQTL colocalization analysis on our method-revealed SNP-associated regulation to show our method can facilitate the fine mapping of GWAS signals. Lastly, brain shape is a complex trait and may be relevant to multiple tissues. Hence it is reasonable to suspect that the top GWAS signal may be active in other relevant tissues’ regulatory elements.

    They show that SpecVar's relevance score correlation across tissues can better approximate phenotypic correlation between traits. However, the estimation of the phenotypic correlation between traits is neither very interesting nor a thing difficult to do (it can be directly estimated from GWAS summary statistics). A more interesting question is to which extent the observed phenotypic correlation is due to common genetic factors acting in the shared tissues/cell types/pathways/regulatory networks between traits. Note that in their Abstract, they use words "depict shared heritability and regulations" but I don't seem to see results supporting that.

    We are sorry that we didn’t make it clear how SpecVar “depict shared heritability and regulations”. We added more results and one example in the UKBB application to show SpecVar can use common relevant contexts and shared SNP-associated regulatory networks as potential explanation for the correlation between traits.

    Line 396-402: "For example, ... heritability could select most relevant tissues ... but failed to get correct tissues for other phenotypes ... P-value could obtain correct tissues for CP ... but failed to get correct tissues for ... SpecVar could prioritize correct relevant tissues for all the six phenotypes." Honestly, I find hard to judge which tissues are "correct" or "incorrect" for a trait in real life. It would be more straightforward to compare methods using simulation where we know which tissues are causal.

    We thank the reviewers to pinpoint the improper statement of “correct”. It is difficult to find phenotypes with gold-standard relevant tissues and we used six relatively well-studied phenotypes with prior knowledge of possible relevant tissues in our paper. We revised the “correct” statement in our revision.

  2. eLife assessment

    This manuscript is of interest to scientists studying the genetics of complex human diseases. The approach introduced here is potentially useful for the identification of tissues linked to complex disease heritability. Currently, the key claims of the paper are not entirely supported by the data. The claims may become well supported once the authors improve statistical rigor and perform a more comprehensive comparison with other methods.

  3. Reviewer #1 (Public Review):

    The authors define regulatory networks across 77 tissue contexts using software they have previously published (PECA2, Duren et al. 2020). Each regulatory network is a set of nodes (transcription factors (TF), target genes (TG), and regulatory elements (RE)) and edges (regulatory scores connecting the nodes). For each context, the authors define context-specific REs, as those that do not overlap REs from any of the other 76 contexts, and context-specific regulatory networks as the collection of TFs, TGs, and REs connected to at least one context-specific RE. This approach essentially creates annotations that are aggregated across genes, elements, and specific contexts. For each tissue, the authors use linkage disequilibrium score regression (LDSC) to calculate enrichment for complex trait heritability within the set of all REs from the corresponding context-specific regulatory network. Heritability enrichments in context-specific regulatory network REs are compared with heritability enrichments in regions defined using other approaches.

  4. Reviewer #2 (Public Review):

    In this manuscript the authors develop a method, SpecVar, to perform heritability estimation from regulatory networks derived from gene expression and chromatin accessibility data. They apply this approach to public datasets available in ENCODE and Roadmap Epigenomics consortia as well as GWAS phenotype associations in UK Biobank. It promises to be a powerful method to interpret mechanisms from genetic associations. Below are some strengths and weaknesses of the paper.

    Strengths

    - The method performs heritability enrichment on two major genomic data types: gene expression and chromatin accessibility.
    - This method leverages gene regulatory networks to perform the heritability estimation, which may better capture complex disease architecture.
    - The authors perform an extensive comparison to other LDSC-based approaches using different tissue datasets.

    Weaknesses
    - This approach may represent a modest advance over existing LDSC methods when looking at other complex traits.
    - The authors only compare with LDSC using different functional annotations as input, which may not be appropriate. A more broad comparison with other heritability methods would be helpful.
    - The method seems to be applied to "paired" data, but this is still bulk profiles not paired single-cell RNA/ATAC data.

    The authors successfully applied a regulatory network approach to improving the heritability estimation of complex traits by using both gene expression and chromatin accessibility data. While the results could be further strengthened by comparing them to other network and non-network-based methods, it provides important insight into a few traits beyond the standard LDSC model with different functional annotations.

    Given that this method is based on the widely used LDSC approach it should be broadly applied in the field. However, the authors should consider adapting this to single-cell data as well as admixed human population genetic data.

  5. Reviewer #3 (Public Review):

    Identifying the critical tissues and cell types in which genetic variants exert their effects on complex traits is an important question that has attracted increasing attention. Feng et al propose a new method, SpecVar, to first construct context-specific regulatory networks by integrating tissue-specific chromatin states and gene expression data, and then run stratified LD score regression (LDSC) to test if the constructed regulatory network in tissue is significantly associated with the trait, measured by a statistic called trait relevance score in this study. They apply their method to 6 traits for which there exists prior evidence on the most relevant tissues in the literature, and then further apply to 206 traits in the UK Biobank. They find that compared to LDSC using other sources of information to define context-specific annotations, their method can "improve heritability enrichment", "accurately detect relevant tissues", helps to "interpret SNPs" identified from GWAS, and "better reveals shared heritability and regulations of phenotypes" between traits. However, I think it requires more work to understand where exactly the benefits come from and the statistical properties of their proposed test statistic (e.g., how to perform hypothesis tests with their relevance score and whether the false positive rate is under control). In addition, it's not clear to me what they can conclude about the shared heritability (which means genetic correlation) by comparing their relevance score correlation across tissues to the phenotypic correlation between traits.

    They show that SpecVar gives much higher heritability enrichment than the other methods in the trait-relevant tissues (Fig. 2). The fold enrichment from SpecVar is extremely high, e.g., more than 600x in the right lobe of the liver for LDL. First, I think a standard error should be given so that the significance of the differences can be assessed. Second, it is very rare (hence suspicious) to observe such a huge enrichment. Since SpecVar is based on LDSC, the same methodology that other methods in comparison depend on, the differences to the other methods must come from the set of SNPs annotated for each tissue. I think it is important to understand the difference between the SpecVar annotated SNPs and those from other methods. For example, is the extra heritability enrichment mainly from the SpecVar-specific annotation or from the intersection narrowed down by SpecVar?

    They propose to use the relevance score (R score) to prioritise trait-relevant tissues. In Fig. 3, they show tissue-trait pairs with the highest R scores, and from there they prioritise several tissues for each trait (Table 1). I can see that some tissue has an outstanding R score, however, it is not clear to me where they draw the line to declare a positive result. The threshold doesn't seem to be even consistent across traits. For example, for LDL, only the right lobe of the liver is identified although other tissues have R scores greater than 100, whereas, for EA, Ammor's horn and adrenal gland are identified although their R scores are apparently smaller than 100. It seems to me they use some subjective criteria to pick the results. It leads to a serious question on how to apply their R score in a hypothesis test: how to measure the uncertainty of their R score? What significance threshold should be used? Whether the false positive rate is under control? Without knowing these statistical properties, readers won't be able to use this method with confidence in their own research.

    Another related comment to the above is to investigate false positive associations, they should show the results for all tissues tested to see if SpecVar tends to give higher R scores even in tissues that are not relevant to the trait. It would also be useful to include some negative control traits, such as height for brain tissues.

    Fig. 3 shows that tissues prioritised by LDSC-SAP and LDSC-SEG seem to make less sense than those from SpecVar. However, some of the results are not consistent with the LDSC-SEG paper (Finucane et al 2018). For example, LDL was significantly associated with the liver in Finucane et al (Fig. 2), but not in this study. How to explain the difference?

    The authors highlight an example where SpecVar facilitates the interpretation of GWAS signals near FOXC2. They find GWAS-significant SNPs located in a CNCC-specific RE downstream of FOXC2 and reason these SNPs affect brain shape by regulating the expression of FOXC2. I think more work can be done to consolidate the conclusion. For example, if the GWAS signals are colocalised with the eQTL for FOXC2 in the brain. Also, note that the top GWAS signal is actually on the left of the CNCC-specific RE (Fig. 4b). A deeper investigation should be warranted.

    They show that SpecVar's relevance score correlation across tissues can better approximate phenotypic correlation between traits. However, the estimation of the phenotypic correlation between traits is neither very interesting nor a thing difficult to do (it can be directly estimated from GWAS summary statistics). A more interesting question is to which extent the observed phenotypic correlation is due to common genetic factors acting in the shared tissues/cell types/pathways/regulatory networks between traits. Note that in their Abstract, they use words "depict shared heritability and regulations" but I don't seem to see results supporting that.

    Line 396-402: "For example, ... heritability could select most relevant tissues ... but failed to get correct tissues for other phenotypes ... P-value could obtain correct tissues for CP ... but failed to get correct tissues for ... SpecVar could prioritize correct relevant tissues for all the six phenotypes." Honestly, I find hard to judge which tissues are "correct" or "incorrect" for a trait in real life. It would be more straightforward to compare methods using simulation where we know which tissues are causal.