DNA-binding factor footprints and enhancer RNAs identify functional non-coding genetic variants

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

Genome-wide association studies (GWAS) have revealed a multitude of candidate genetic variants affecting the risk of developing complex traits and diseases. However, these highlighted regions are typically in the non-coding genome, and uncovering the functional causative single nucleotide variants (SNVs) is challenging. Prioritisation of variants is commonly based on functional genomic annotation with markers of active regulatory elements, but current approaches still poorly predict functional variants. To address this, we systematically analyse six markers of active regulatory elements for their ability to identify functional variants. We benchmark against molecular quantitative trait loci (molQTL) from assays of regulatory element activity that identify allelic effects on DNA-binding factor occupancy, reporter assay expression, and chromatin accessibility. We identify the combination of DNase footprints and divergent enhancer RNA as markers for functional variants. This signature provides high precision, trading-off low recall, thus substantially reducing candidate variant sets to prioritise variants for functional validation. We present this as a framework called FINDER – Functional SNV IdeNtification using DNase footprints and Enhancer RNA, and demonstrate its utility to prioritise variants using leukocyte count trait and analyse variants in linkage disequilibrium with a lead variant to predict a functional variant in asthma. Our findings have implications for prioritising variants from GWAS, in development of predictive scoring algorithms, and for functionally informed fine mapping approaches.

Article activity feed

  1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    'The authors do not wish to provide a response at this time.'

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    Biddie et al. address the important question of what genomic annotations are relevant to the fine-mapping of trait-associated variation. They assessed data identifying DNase I hypersensitive sites, footprints, H3K27ac and other ChIP-seq peaks, ATAC-seq peaks, and eRNA locations using a benchmark set of allelically imbalanced DNase I hypersensitive sites and ChIP-seq peaks, and MPRA data. They find that a combination of DNase footprints and eRNA locations gives high enrichment for functional variants (albeit with low sensitivity) and demonstrate their FINDER on 53 traits from the GWAS catalog.

    I have some questions to address before publication, all of which relate to clarifying the description in the manuscript.

    1. I found this line in the abstract unclear: "This signature provides high precision, trading-off low recall".
    2. Figure 1C is missing a y-axis label and a colour legend.
    3. The authors note a high genomic coverage "by ATAC-seq (75.2%), [and] H3K27ac (61.5%)", which they attribute in part to "too low a threshold used in peak calling". However, both the benchmark datasets and the genomic predictors are utilized without consideration of the effect of thresholding. To what extent are DNase footprints and eRNA specifically informative, vs. representing datasets processed with highly selective cutoffs?
    4. The benchmark datasets are described as molQTL, bWTL, and caQTL. "QTL" implies a regression of a trait (e.g. accessibility) on a genotype. This is not accurate here: the Vierstra and Abramov datasets investigate allelic imbalance (a related but orthogonal approach), while van Arensbergen is an MPRA.
    5. QTLbase is cited alongside the Vierstra paper. I had some trouble searching in QTLbase but did not find the Vierstra dataset. It should be clarified whether QTLbase was used to download the Vierstra results, or to supplement it with other studies.
    6. Fig. 4 discusses the effect of variant centrality in a DHS for prioritization, but it isn't included in the FINDER schematic in Fig. 7. How come it wasn't employed in FINDER?
    7. The Discussion notes "Firstly, the identification of DNase footprints may be related to residency time of TFs on DNA, where rapidly exchanging factors impart poor footprints (Sung et al., 2014). Variants associated with altering binding of dynamic factors may therefore be missed. To overcome this, detection of footprints could be improved by enzymatic digestion bias correction". I don't see how enzymatic digestion bias is related to sensitivity to detect rapidly exchanging TFs.
    8. It looks like the eRNA data were obtained from GRO-seq or PRO-seq data. It would be helpful to note key details like this directly rather than leaving it to the reader to try to figure out what is in the PINTS database.
    9. The github link https://github.com/sbiddie/FINDER gives a 404 not found error. Is FINDER an actual tool implementation, or more generally describing the approach?

    Minor comments:

    1. Some of the figure text is quite small or blurry (e.g. Fig. 1A/B, Fig. S2).
    2. Typo in Figure 1A legend: "Heatpmap".
    3. The author of the last entry in the References is "Zhen Z" but should be "Zheng Z".

    Significance

    This work is highly relevant and well-done and provides practical information to guide future fine-mapping studies. The authors partially address the tradeoff between enrichment and recall, which is frequently swept under the rug. Their approach ought to be of high interest to the broad genetics and gene regulation communities.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary: Disease and trait associated genetic variation is primarily localizes to non-coding regulatory DNA. A wide-variety methods and assays are routinely used to delineate putative regulatory elements. In this manuscript Bidde et al. sought to evaluate which of these assays is useful for identifying and prioritize trait-associated regulatory variation. To these ends the authors perform enrichment analysis on a genetic variation from different sources encompassing both molecular phenotypes (QTLs) and trait-associated variation via GWA-studies. The authors show that genetic variants localizing to DNase I footprints and within elements associated with RNA production (enhancer RNA) are maximally enriched for both molQTLs and trait-associated variation vs. other markers of regulatory DNA. Overall, I find that this manuscript is technically sound, and is consistent with prior studies (namely the enrichment of GWAS variants within DNase I footprints -- Vierstra et al. 2020). I suggest only a few additional analyses and edits to the presentation.

    Major comments:

    1. The authors should expand the QTL studies and the GWAS variants via LD and recompute the enrichments. For example, I would take all variants in high LD (r2 > 0.8 or 0.9) with either a QTL or GWA-variant. This will likely increase total variants overlapping an annotation, but reduce overall enrichment (odds-score), and possibly provide some information about which chromatin marks are more associated with "causal" variants.
    2. Can the authors comment on why eRNAs seem to be such a strong marker of functional variation? Are these just "strongest" (most accessible) distal elements? I would assume that these peaks have high overlap with chromatin accessibility peaks.
    3. Would the ATAC-seq enrichment increase if the authors stratified regions by signal rather than aggregating all peaks? The vast majority of chromatin accessibility peaks are very weak and could be false-positives. Lets imagine that in each dataset 1% of peaks are FPs and that the FP peaks are mostly randomly distributed accross the genome. As such, aggregating hundreds to thousands of samples would have many FP peaks and greatly affect the enrichment analysis. Conversely, DNase I footprints are found in high signal peaks that are less likely to be false-positives. One approach to deal with this is to select ATAC-seq peaks matched to the peak signal in DHS peaks with footprints.

    Minor comments:

    1. Figure 1c -- no legend is provided specifying what the bar colors represent.
    2. Pg. 10 -- "To overcome this, detection of footprints could be improved by enzymatic digestion bias correction (Calviello et al., 2019)." The DNase I footprinting dataset used in this paper performs extensive bias correction using a 6mer statistical model. Nevertheless, I completely agree the sentiment of the authors that low and variable sensitivity of footprinting is certainly driving a high false-negative rate with regards to comprehensively identifying function variants.
    3. Figure 3 is a little too complicated for its purpose, which is to show the enrichment of bQTLS, caQTLs and raQTLs.

    Significance

    This manuscript provides a rigorous analysis characterize how various markers of chromatin help aid in the interpretation of non-coding genetic variation. The findings are not entirely novel, however, the analyses and approaches described are nevertheless useful for variant prioritization. This manuscript is broadly applicable and useful to anyone interested studying non-coding genetic variation.