SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Review Commons)
Abstract
Single-cell multi-omics assays offer unprecedented opportunities to explore epigenetic regulation at cellular level. However, high levels of technical noise and data sparsity frequently lead to a lack of statistical power in correlative analyses, identifying very few, if any, significant associations between different molecular layers. Here we propose SCRaPL, a novel computational tool that increases power by carefully modelling noise in the experimental systems. We show on real and simulated multi-omics single-cell data sets that SCRaPL achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson and Spearman correlation.
Article activity feed
-
-
Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Reviewer 1
This paper proposes a noise-aware approach SCRaPL for modelling the associations of single cell multi-omic data. For gene expression, it uses Poisson-lognormal model. For DNAm data, it uses Binomial noise model which explicitly takes into account the average within the region. The Bayesian hierarchical framework employed by SCRaPL could achieve higher sensitivity and better robustness in identifying correlations, and also offer a template for the application of more complex analysis techniques to multi-omics data. The symbols of this paper are a little bit confusing, and I suggest authors to carefully check them.
We thank the reviewer for his/ …
Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Reviewer 1
This paper proposes a noise-aware approach SCRaPL for modelling the associations of single cell multi-omic data. For gene expression, it uses Poisson-lognormal model. For DNAm data, it uses Binomial noise model which explicitly takes into account the average within the region. The Bayesian hierarchical framework employed by SCRaPL could achieve higher sensitivity and better robustness in identifying correlations, and also offer a template for the application of more complex analysis techniques to multi-omics data. The symbols of this paper are a little bit confusing, and I suggest authors to carefully check them.
We thank the reviewer for his/ her appreciation, and apologise for the confusion arising from the dense notation, which we will thoroughly revise.
The symbols used in this paper are messy. For example, "1" and "2" are subscripts in Eq.(2) but become superscripts in Figure 5. Besides, there are many symbols not explained such as mj, Hj, Ψ0, etc. Also, I don't know if x_{j,i}^{(1)} , x_{j,i}^{(2)} in Figure 5 are same with x_{ij1} and x_{ij2} in Eq.(3). There are many places mismatch, authors should check carefully.
Why the equations in Fig.5 are totally different with Section 4.2? For example, pj ∼Beta(αj ,βj ) in Fig.5 but ρj ∼ Beta[−1,1](d1, d2) in Eq.(8).
We apologise for the notational confusion, this will be fully revised.
The paper involves a lot of hyper-parameters which doesn't demonstrate their selection. For example, c1, c2, d1, d2.
This is a good point. We will include a sensitivity analysis on the hyperparameters, justifying the choices on both simulated and real data.
In section4.8, I am confused about $ρ_j$ the experiment 2, 5, 8, 11. Why $ρ_j$ both represents ZI rate and correlation?
We apologise for the notational oversight, which will be rectified.
In Section 4.5, it is difficult to understand the sentence "for me threshold u". Besides, what is $r$ represent in Section 4.5?
We apologise for the confusing sentence. $r$ is the Pearson correlation coefficient, as explained at the start of 4.5
Why there is "(6a)Agreement between SCRaPL and Pearson" in Fig. 4?
This simply means that the panel shows a methylation/ expression scatterplot for a gene where estimation by SCRaPL and Pearson return both a significant association. We will expand the caption to explain further.
For Fig.1, I cannot see the text in the rectangle.
Apologies, we will improve the readability of the figures
I would like to see the efficiency analysis for SCRaPL.
As part of part of re-implementation in a more accessible programming language, we have preformed preliminary efficiency analysis for MCMC , demonstrating linear scalability. Results will appear in the revised manuscript.
Reviewer 2
The authors present a Bayesian model to determine noise-corrected correlation coefficients for gene expression (RNA) and DNA-methylation data at single-cell resolution. The authors present a series of simulation data and an example of matched multi-omics data, and compare their results with Pearson correlation. Noise modelling allows the model to determine gene-methylation correlation patterns more accurately. While the authors demonstrate a neat application on accurate quantification of correlation coefficients, I see a limited use of the model for the broader single-cell community. The authors may therefore improve their manuscript on several aspects.
We thank the reviewer for the encouraging words, and thank him/ her for the critical observations, which we have taken at heart, considerably broadening the scope of our paper to make it more attractive to a larger community.
- Abstract: please specify the omics layers that you are analyzing (RNA + DNA methylation) in the abstract
We acknowledge that, while SCRaPL is potentially general, in the first submission we focused only on RNA and DNA methylation. We have now decided to expand our analyses to include 10X data of simultaneous chromatin accessibility (ATAC-seq) and RNA.
- What is the benefit of using a Bayesian model formulation in this setting?
The benefit is twofold: a principled treatment of noise, and a quantification of the resulting uncertainty which allows for a meaningful way to compute Bayesian significance levels. We will expand the discussion of the relative merits of a Bayesian vs frequentist approach.
- Does it also apply to unmatched data?
In principle, given measurements with the same number of cells in all modalities, it is possible to apply SCRaPL. However, unless there is a natural pairing between different cells, the scaling of this approach will be quadratic in the number of cells, hence potentially expensive (although largely parallelizable). We will discuss this now, particularly in the light of applying SCRaPL in conjunction with other suites such as Seurat.
Would SCRaPL allow for differential correlation testing?
At the moment, SCRaPL does not allow for differential correlation testing. Of course, one may run SCRaPL separately on two groups of cells and compare the resulting estimates, which would be informative. Nevertheless, extending SCRaPL to perform differential correlation testing (e.g. using Bayesian model selection) would be a non-trivial effort. We will add a comment on this issue to the discussion section.
- Figure 1: The graphical description of the model is rudimentary. I believe that the model description could profit from a graphical model representation of SCRaPL (as presented in figure 5).
We will redraw Fig. 1 and incorporate the graphical model from Fig 5.
- Simulated data: all experiments seem to have rather low cell numbers (max. 200) and genes (max. 300). Given that 10X Genomics is the most widely-used sequencing platform with approx. 10,000 cells and 3,000 (highly variable) genes per experiment, and given that the authors show a use-case with 9480 genes in 487 cells, it seems appropriate to extend the simulations and runtime estimates of the presented model to several thousands of cells and genes, respectively.
Thank you for this comment. The original simulation settings were designed with scMT data in mind, where indeed only a few hundred cells can be assayed at most. Partly because of this feedback, and also because of the request of implementing SCRaPL in a different language, we are working on a more scalable Tensorflow implementation which will be able to handle thousands of cells and genes in a matter of tens of minutes . The new simulated data will therefore extend into this regime with larger data sets.
- Figure 4: Please revise the figure legend as I did not understand the plotted results based on the description.
We will do so.
- Results section 2.5: Please formulate your whole argument about epigenetic regulators. I do not think that "For further information please refer to supplementary figure XYZ." Is an appropriate closing statement for a paragraph, nor does it motivate the reader to look at the supplementary figures (I did look at them and I do not see how they support the point made in the paragraph). Please elaborate and consider a "take home message" for the paragraph such that the reader is able to understand the benefit of SCRaPL without revisiting the original data publication.
Thank you for this pointer, we will take it on board in the full revision.
- Conclusion: The authors mention that SCRaPL would further offer a "template for the application of more complex analysis techniques (such as clustering, dimensionality reduction and network inference)". If that was the case, the authors should consider a comparison to other tools, which offer exactly that (e.g. Seurat's CCA or non-negative matrix factorization in LIGER). Further, the authors should set their work into context with tools like bindSC.
Thank you for the suggestion. As far as we can tell, all of these methods are thought for unmatched data, rather than multi-omics assays performed in the same cells. Having said that, it is in principle possible to “preprocess” data with SCRaPL and then feed to Seurat or other tools the latent means computed by SCRaPL. We will include an example of how this may be done in the revision.
- Implementation: Matlab is used in about 6% of the single-cell RNAseq tools (according to scrna-tools.org). To reach a larger scientific community, do the authors plan to provide an R or Python implementation of their model?
We are now implementing SCRaPL in Python using Tensorflow probability, hoping to achieve substantial speedups (see response to previous point).
Additional minor points about formatting by Reviewer 2 will all be addressed.
Reviewer 3
Maniatis et al propose a sound strategy to analyse single-cell multi-comic data sets. A key advance is to use bespoke error models for each of the omics data. These are integrated into a multivariate gaussian model. This method is a novel and, in my opinion, a valuable addition to the analyses of the growing multi-omics single-cell data sets.
We thank this reviewer for his/ her appreciation of our work.
- Authors make a convincing argument of the importance of principle methods and in particular to use noise models that tailored to the data at hand. To further support this, can authors elaborate on how results would be different from using commonly applied methods ? Eg those embedded in the Seurat, OSCA, and scanpy 'suites'? Authors compare to Pearson correlation-based methods but is not clear if that is the true state-of-the-art on those methods
As far as we know, volcano plots of p-value versus Pearson correlation are the most commonly employed approaches to assess correlations amongst different molecular modalities in single-cell multi-omics (see e.g. Argelaguet et al, Nature 2020). Seurat and other methods normally do not deal with single-cell multi-omics (i.e., multiple omics measured in the same cell), rather with multiple single-cell omics (different molecular modalities assayed in different cells). Nevertheless, it is possible to pre-apply SCRaPL to non-matched data and then use another suite; as an illustration, we will perform an analysis on scMT data using SCRaPL followed by Seurat.
- In the case study on mouse embryonic stem cells, authors excluded the chromatic accessibilty. Why not using it to more clearly show the value of the method?
We did use SCRaPL also on chromatin accessibility, however the signal was weaker and we did not include it in the manuscript, we will now present these results as supplementary material.
- It would also be great if authors would use a different single-cell multi-comic data sets, using other dat modalities, e.g. CITE-Seq data. If this not possible, at least they should elaborate on which omics SCRAPL can handle, what would be the noise models for different data types, etc.
We have started analysing a joint scATAC-scRNA- seq data set generated using the new 10X commercial platform, and will add the results of this analysis to the revised manuscript. We will also expand the description of the suitability for different data types.
*- As the authors acknowledge, computational burden is high, which presumably limits scalability. Are authors able to further explore this (scalability on Insilico data)? Or how complex is adopting the variational inference method suggested? I appreciate that the variational inference implementation might be out of the scope of this paper, though.
- It is a pity that the method is in Matlab. Nearly no-one in single-cell omics use Matlab. Our own lab is largely invested in this topic and we do not even have Matlab licenses. I strongly encourage authors to implement their method in e.g. R or python, ideally compatible with the broadly used 'suites' (Seurat, OSCA, and scanpy,...).*
We are addressed these two comments jointly by re-implementing SCRaPL in Tensorflow probability (Python based), which allow us to leverage powerful libraries for variational inference. We hope that this will lead to a substantial increase of scalability, providing the possibility of running on thousands of cells / genes in under one hour (results will appear in ).
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Maniatis et al propose a sound strategy to analyse single-cell multi-comic data sets. A key advance is to use bespoke error models for each of the omics data. These are integrated into a multivariate gaussian model. This method is a novel and, in my opinion, a valuable addition to the analyses of the growing multi-omics single-cell data sets.
I have some comments below that I hope are helpful for the authors:
- Authors make a convincing argument of the importance of principle methods and in particular to use noise models that tailored to the data at hand. To further support this, can authors elaborate on how results would be different …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Maniatis et al propose a sound strategy to analyse single-cell multi-comic data sets. A key advance is to use bespoke error models for each of the omics data. These are integrated into a multivariate gaussian model. This method is a novel and, in my opinion, a valuable addition to the analyses of the growing multi-omics single-cell data sets.
I have some comments below that I hope are helpful for the authors:
- Authors make a convincing argument of the importance of principle methods and in particular to use noise models that tailored to the data at hand. To further support this, can authors elaborate on how results would be different from using commonly applied methods ? Eg those embedded in the Seurat, OSCA, and scanpy 'suites'? Authors compare to Pearson correlation-based methods but is not clear if that is the true state-of-the-art on those methods
- In the case study on mouse embryonic stem cells, authors excluded the chromatic accessibilty. Why not using it to more clearly show the value of the method?
- It would also be great if authors would use a different single-cell multi-comic data sets, using other dat modalities, e.g. CITE-Seq data. If this not possible, at least they should elaborate on which omics SCRAPL can handle, what would be the noise models for different data types, etc.
Minor:
- As the authors acknowledge, computational burden is high, which presumably limits scalability. Are authors able to further explore this (scalability on Insilico data)? Or how complex is adopting the variational inference method suggested? I appreciate that the variational inference implementation might be out of the scope of this paper, though.
- It is a pity that the method is in Matlab. Nearly no-one in single-cell omics use Matlab. Our own lab is largely invested in this topic and we do not even have Matlab licenses. I strongly encourage authors to implement their method in e.g. R or python, ideally compatible with the broadly used 'suites' (Seurat, OSCA, and scanpy,...).
This also precludes checking software and reproducibility of results.
Significance
I think this is an important methodological development for the analysis of single-cell multi-comic data.
To my knowledge, it goes beyond existing methods and does so in a principled manner.
The audience is mostly bioinformaticians dealing with the analysis of this type of data, ie single-cell multi-omics.
My expertise is in the computational analysis of omics data, though less on the statistical fundaments of it. Hence, my group members and I are probable users of this method (if implemented in free software, as mentioned above).
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Summary:
The authors present a Bayesian model to determine noise-corrected correlation coefficients for gene expression (RNA) and DNA-methylation data at single-cell resolution. The authors present a series of simulation data and an example of matched multi-omics data, and compare their results with Pearson correlation. Noise modelling allows the model to determine gene-methylation correlation patterns more accurately. While the authors demonstrate a neat application on accurate quantification of correlation coefficients, I see a limited use of the model for the broader single-cell community. The authors may therefore improve their …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Summary:
The authors present a Bayesian model to determine noise-corrected correlation coefficients for gene expression (RNA) and DNA-methylation data at single-cell resolution. The authors present a series of simulation data and an example of matched multi-omics data, and compare their results with Pearson correlation. Noise modelling allows the model to determine gene-methylation correlation patterns more accurately. While the authors demonstrate a neat application on accurate quantification of correlation coefficients, I see a limited use of the model for the broader single-cell community. The authors may therefore improve their manuscript on several aspects.
Major comments:
- Abstract: please specify the omics layers that you are analyzing (RNA + DNA methylation) in the abstract
- What is the benefit of using a Bayesian model formulation in this setting?
- Does it also apply to unmatched data?
- Would SCRaPL allow for differential correlation testing?
- Figure 1: The graphical description of the model is rudimentary. I believe that the model description could profit from a graphical model representation of SCRaPL (as presented in figure 5).
- Simulated data: all experiments seem to have rather low cell numbers (max. 200) and genes (max. 300). Given that 10X Genomics is the most widely-used sequencing platform with approx. 10,000 cells and 3,000 (highly variable) genes per experiment, and given that the authors show a use-case with 9480 genes in 487 cells, it seems appropriate to extend the simulations and runtime estimates of the presented model to several thousands of cells and genes, respectively.
- Figure 4: Please revise the figure legend as I did not understand the plotted results based on the description.
- Results section 2.5: Please formulate your whole argument about epigenetic regulators. I do not think that "For further information please refer to supplementary figure XYZ." Is an appropriate closing statement for a paragraph, nor does it motivate the reader to look at the supplementary figures (I did look at them and I do not see how they support the point made in the paragraph). Please elaborate and consider a "take home message" for the paragraph such that the reader is able to understand the benefit of SCRaPL without revisiting the original data publication.
- Conclusion: The authors mention that SCRaPL would further offer a "template for the application of more complex analysis techniques (such as clustering, dimensionality reduction and network inference)". If that was the case, the authors should consider a comparison to other tools, which offer exactly that (e.g. Seurat's CCA or non-negative matrix factorization in LIGER). Further, the authors should set their work into context with tools like bindSC.
Minor comments:
- Implementation: Matlab is used in about 6% of the single-cell RNAseq tools (according to scrna-tools.org). To reach a larger scientific community, do the authors plan to provide an R or Python implementation of their model?
- Fig. 2: Legends for mean, median and y=0 are hardly legible.
- Figure order: 6a is referenced before 4b and 4c (what about 4a?) - seems like a referencing issue as 6a is also listed in the figure legend of Figure 4.
- Figure 6: AIC histogram is difficult to make out behind the blue bars of the DIC histogram. Please adapt.
Reference:
Unbiased integration of single cell multi-omics data Jinzhuang Dou, Shaoheng Liang, Vakul Mohanty, Xuesen Cheng, Sangbae Kim, Jongsu Choi, Yumei Li, Katayoun Rezvani, Rui Chen, Ken Chen, bioRxiv, 2020 https://www.biorxiv.org/content/10.1101/2020.12.11.422014v1
Significance
Significance:
The use of a single-cell specific noise-model to infer accurate correlation coefficients for multi-omic analysis is a novel approach to assess information from DNA-methylation and RNA-sequencing data at single-cell resolution. As far as I am aware, methods like canonical correlation analysis (CCA), as used in Seurat, rely on the accuracy of Pearson correlation, yet, the authors of this manuscript made a convincing point on the devastating impact of noise from transcription and methylation levels on Pearson correlation.
Audience:
In order to address downstream analysis questions such as gene regulatory network inference, it is essential to have an accurate metric to assess the regulatory impact of methylation on gene expression at hand. However, an efficient implementation in a more common language (e.g. R, Python or C++) would be advisable to create a broader applicability of the model.
The reviewer's field of expertise: single-cell RNAsequencing, data analysis, data integration, Bayesian modelling
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
This paper proposes a noise-aware approach SCRaPL for modelling the associations of single cell multi-omic data. For gene expression, it uses Poisson-lognormal model. For DNAm data, it uses Binomial noise model which explicitly takes into account the average within the region. The Bayesian hierarchical framework employed by SCRaPL could achieve higher sensitivity and better robustness in identifying correlations, and also offer a template for the application of more complex analysis techniques to multi-omics data. The symbols of this paper are a little bit confusing, and I suggest authors to carefully check them. My comments are as …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
This paper proposes a noise-aware approach SCRaPL for modelling the associations of single cell multi-omic data. For gene expression, it uses Poisson-lognormal model. For DNAm data, it uses Binomial noise model which explicitly takes into account the average within the region. The Bayesian hierarchical framework employed by SCRaPL could achieve higher sensitivity and better robustness in identifying correlations, and also offer a template for the application of more complex analysis techniques to multi-omics data. The symbols of this paper are a little bit confusing, and I suggest authors to carefully check them. My comments are as following:
- The symbols used in this paper are messy. For example, "1" and "2" are subscripts in Eq.(2) but become superscripts in Figure 5. Besides, there are many symbols not explained such as mj, Hj, Ψ0, etc. Also, I don't know if x_{j,i}^{(1)} , x_{j,i}^{(2)} in Figure 5 are same with x_{ij1} and x_{ij2} in Eq.(3). There are many places mismatch, authors should check carefully.
- Why the equations in Fig.5 are totally different with Section 4.2? For example, pj ∼Beta(αj ,βj ) in Fig.5 but ρj ∼ Beta[−1,1](d1, d2) in Eq.(8).
- The paper involves a lot of hyper-parameters which doesn't demonstrate their selection. For example, c1, c2, d1, d2.
- In section4.8, I am confused about $ρ_j$ the experiment 2, 5, 8, 11. Why $ρ_j$ both represents ZI rate and correlation?
- In Section 4.5, it is difficult to understand the sentence "for me threshold u". Besides, what is $r$ represent in Section 4.5?
- Why there is "(6a)Agreement between SCRaPL and Pearson" in Fig. 4?
- For Fig.1, I cannot see the text in the rectangle.
- I would like to see the efficiency analysis for SCRaPL.
Significance
Audience who interested in multi-omic data, single-cell rna, machine learning will be interested in this paper.
My field of expertise: machine learning, single-cell RNA
-
