Cell type-specific interpretation of noncoding variants using deep learning-based methods

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Interpretation of non-coding genomic variants is one of the most important challenges in human genetics. Machine learning methods have emerged recently as a powerful tool to solve this problem. State-of-the-art approaches allow prediction of transcriptional and epigenetic effects caused by non-coding mutations. However, these approaches require specific experimental data for training and can not generalize across cell types where required features were not experimentally measured. We show here that available epigenetic characteristics of human cell types are extremely sparse, limiting those approaches that rely on specific epigenetic input. We propose a new neural network architecture, DeepCT , which can learn complex interconnections of epigenetic features and infer unmeasured data from any available input. Furthermore, we show that DeepCT can learn cell type-specific properties, build biologically meaningful vector representations of cell types and utilize these representations to generate cell type-specific predictions of the effects of non-coding variations in the human genome.

Article activity feed

  1. n hum

    Reviewer3-: Borbala Mifsud

    Gigascience - Cell type-specific interpretation of noncoding variants using deep learning-based methods Sindeeva et al. have developed DeepCT, a convolutional neural network-based model that predicts sequence and cell type-specific epigenetic profiles from available epigenetic data. The novelty of the approach is that the model can learn unmeasured epigenetic profiles in a given cell type, if there is another cell type that has the target feature measured and shares one or more other epigenetic data types with the cell type it aims to predict in. The authors demonstrated that the framework works well and that the model learns both sequence context and cell type-specificity and used the model to predict which de novo variants, identified in the Simon Simplex Collection, have the highest effect in any of the cell types they studied. Focusing on one variant with high predicted effect in glial cells they suggested a mechanism, whereby the variant in a putative enhancer element within the SMG6 gene reduces FOS binding, which might affect SMG6 expression in these cells. I have a few minor comments to clarify the applicability of this model. Minor comments:

    1. In Figure 2D the authors showed that adult heart and fetal heart cell state representations cluster together even though they did not share the measured epigenetic features. This is an interesting observation, however one of them had ATAC-seq data while the other had DNase-seq data which are highly correlated. It would be good to know how much this can be generalized to other cases. What is the level of correlation between two epigenetic features that is required for correct clustering of the cell states between two cell types that do not share epigenetic features?
    2. In both Figure 2C and in Supplementary Figure 1, the 2D visualization of the cell state representations, show that some cell types cluster well together while others do not cluster at all. Even those that cluster well, like "Digestive", "Kidney" or "Muscle" cells have many cell types that do not cluster with the others. Apart from biological differences, could this be also reflective of cell types with lower quality epigenetic tracks? How much does the quality of the tracks effect the model?
    3. Figure 3E shows that there are some points where the accuracy of the model is much higher when leaving out certain epigenetic tracks from the training of the model. Is that also related to quality of those data or is there a specific epigenetic feature where the model consistently shows higher accuracy when the feature is left out?
    4. The authors used 1000bp for representation of the sequence, but the target sequence that is checked for overlap of the epigenetic features is only 200bp. Does the model learn from the additional 800bp?
    5. For the cell state tail the chosen emb_length was 32. Based on Supplementary Figure 1, I assume this is due to the number of cell type groups expected, but it would be good to include the rationale in the methods.
    6. For the GO term enrichment what background was used? I would expect that the nearest genes of all de novo variants found in autism cases would show enrichment for similar GO terms.
    7. Pg.11 last line should be "FOS transcription factor binding" instead of "grinding".
  2. Interpretation

    Reviewer2-Yuwen Liu

    The manuscript entitled "Cell type-specific interpretation of noncoding variants using deep learningbased methods" interpreted the non-coding genomic variants by integrating the single-cell epigenetic profiles with the convolution neural network. The author found the CNN can capture the cell typespecific properties and generate a biologically meaningful cell state representation by embedding the cell to the latent space. In general, the architecture of the convolution neural network is novel, and, to a certain extent, the model may be helpful for improving our understanding of genomic non-coding variant effects at single-cell level. Major comments:

    1. In Figure1C the author intended to quantify how often unmeasured epigenetic marks can be inferred from available profiles. Although, in fact, the modification of the epigenetic marks is correlated and sometimes colocated in the genome (Ernst and Kellis 2015). However, the connected graph is not a piece of strong and solid evidence or data for quantify the predictive ability of the epigenetic marks. They should provide other compelling evidences or undertake more analysis.
    2. The author used an empirical p-value threshold to detect the peak position along the genome. The definition of the peaks for epigenetic mark is crucial for the whole study. At least they should plot the distribution of p-value and explain why they choose the empirical threshold of p-value as 4.4 in detail. Furthermore, the false positive outcome of the test should be corrected.
    3. Some epigenetic marks present broad modified regions of the genome, the 150 bp DNA sequence may not contain all the sequence determinants for that broad peak. That is may the prediction performance is poor for most of epigenetic marks.
    4. In Figure3D and Supplementary Figure 2, the majority of epigenetic marks presented very poor prediction performance. The author should discuss the potential biological reasons that lead to this result and perform some analyses to preclude these confounding factors. 5.The author should scrutinize their data because they also use some epigenetic profiles form heterogeneous tissues which are composed of different cell types. And these heterogeneous profiles may weaken the predictive power of the convolution neural network model and impair the interpretability of the model.
    5. The authors only used SSC data to showcase their predictive power in pinpointing potential causal non-coding variants of ASD. I suggest use GWAS data from a wide varieties of complex traits and diseases to generate a more thorough evaluation of the specificity of their prediction. Furthermore, the authors used prediction leveraging signals from 794 cell types in predicting non-coding causal variants for ASD. Including a large number of ASD-irrelevant cell types would likely bring strong noise and make the results hard to interpret. I suggest the authors mask the epigenetic marks of ASD-relevant cell types (treating these cells as if they do not have available epigenetic data), and then use epigenetic marks from other cell types to predict non-coding variants with high impact on epigenetic marks in ASDrelevant cells. Then use this new prediction to rerun Fig 4A and 4B. Achieving good performance with this new analysis would better demonstrate the core advantage of their new model, i.e., predicting celltype specific non-coding effects of cell using epigenetic information from other cell types. Minor comments:
    6. The author defined peaks as 150 bp genomic intervals, however, they use 200 bp DNA sequence as the center when preparing the data for the CNN input.
    7. The resolution of the figure should be greatly improved.
  3. Abstract

    Reviewer1-: Fangfang Yan

    In this manuscript, Sindeeva and colleagues describe a novel neural network-based algorithm, DeepCT, to cluster epigenetically similar cell types and infer unmeasured epigenetic features, which then can be used to interpret non-coding variants. The manuscript is well structured and well written, it is potentially interesting to a broad readership. Yet, the algorithm itself in the manuscript lacks rigor and thoroughness. Major points:

    1. Lack of comparison with competing methods
    2. As the authors state themselves in the results and discussion, the performance of DeepCT among some features is very low, such as H3K9 and H4K20 monomethylation. Could authors add more discussion and explanations of this almost zero average precision?
    3. The authors said "statistically higher" or "outperforms" in a lot of statements but no statical test results. For example, on page 8, the authors write: "This analysis confirmed that average cosine similarity for embeddings representing cell types from the same tissue was significantly higher than for embeddings of randomly selected cell types". On page 9, "we note that this baseline has performance metrics substantially higher than expected in random (baseline AP=0.417)."
    4. On page 8, the authors write "we show co-localization of muscle cells, as well as co-localization of digestive cells (Fig. 2C)". However, Figure 2C looks not quite convincing. Minor points:
    5. Providing high-resolution vector-friendly figures will help a lot. I can barely see the content of the figure in the current version.
    6. A jupyter notebook tutorial on the Github repo would be helpful for users to apply DeepCT quickly.