scShapes: a statistical framework for identifying distribution shapes in single-cell RNA-sequencing data

Abstract

Background

Single-cell RNA sequencing (scRNA-seq) methods have been advantageous for quantifying cell-to-cell variation by profiling the transcriptomes of individual cells. For scRNA-seq data, variability in gene expression reflects the degree of variation in gene expression from one cell to another. Analyses that focus on cell–cell variability therefore are useful for going beyond changes based on average expression and, instead, identifying genes with homogeneous expression versus those that vary widely from cell to cell.

Results

We present a novel statistical framework, scShapes, for identifying differential distributions in single-cell RNA-sequencing data using generalized linear models. Most approaches for differential gene expression detect shifts in the mean value. However, as single-cell data are driven by overdispersion and dropouts, moving beyond means and using distributions that can handle excess zeros is critical. scShapes quantifies gene-specific cell-to-cell variability by testing for differences in the expression distribution while flexibly adjusting for covariates if required. We demonstrate that scShapes identifies subtle variations that are independent of altered mean expression and detects biologically relevant genes that were not discovered through standard approaches.

Conclusions

This analysis also draws attention to genes that switch distribution shapes from a unimodal distribution to a zero-inflated distribution and raises open questions about the plausible biological mechanisms that may give rise to this, such as transcriptional bursting. Overall, the results from scShapes help to expand our understanding of the role that gene expression plays in the transcriptional regulation of a specific perturbation or cellular phenotype. Our framework scShapes is incorporated into a Bioconductor R package (https://www.bioconductor.org/packages/release/bioc/html/scShapes.html).

Background

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac126), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Shiping Liu

How to model the statistical distribution of the gene expression, is a basic question for the field of single cell sequencing data mining. Dharmaratne and colleagues looked details at the distribution of very gene. By using the generalized linear models (GLM), the authors present a new program scShapes, which matched a specific gene with a distribution from one of the four shapes, Poisson, Negative Binomial (NB), Zero-inflated Poisson (ZIP), and Zero-inflated Negative Binomial (ZINB). As the authors present in this manuscript, not all genes adapted to a single distribution, neither NB or Poisson, and some of the genes actually adapted to the zero-inflated models because of the property of high drop-out rate in the modern single cell sequencing, says 3' tag sequenced. It is has been popular to employ GLM in single cell data mining recently, but it also got both praise and blame. So it is a good forward step to model a specific model for an individual gene. But the bad side is the computing cost, especially for the number of cells been sequenced reach to millions in currently research, and it believed that the dataset will be reached even bigger in the future. So it make a great obstacle arise to the application of the method presented by the author here. How to speed up the calculation using the mixed model or scShapes? The authors also performed the scShapes on some datasets, including the metformin, human T cells, and PBMCs. They found some potential genes that changed the distribution shape, but didn't easy to be identified by other methods. It demonstrated that scShapes could identified the subtle change in gene expression.

Major points: (1) We didn't see any details about the metformin dataset, the segueing depth and quality, number of genes/UMIs per cell, and so on. It makes hard to evaluate the quality and reliability of the results generated by scShapes. If this dataset is another manuscript could not possible to be presented at the same time, I suggest the author could perform on alternative dataset, as there are so many single cell datasets has been published could be used in this study.

(2) Even the authors taken the cell type account in the GLM, I wonder for a specific gene, whether the distribution shape will change in different cell type. If so, it will becoming more complex, that is need to model the distribution shape for individual gene for every cell type alone.

(3) To identify the different gene expression in scShapes, the author didn't consider the influence of different cell number, or the proportion of cell number, in the different cell type. A possible way to evaluate or eliminate this bias is to down sampling from a big dataset, instead of just simulated total number 2k ~ 5k from the PBMC. To evaluate the influence both the total number cell and the proportion in cell type.

(4) The author should present the comparative results of the computational cost for different methods. Says the accuracy, time and memory consuming under different number of cells. I suggest the authors use much a larger dataset, because currently single cell research may include millions of cells, and the ability to process big data is very important to the application and becoming a widely used one.

Minor points: (1) No figure legends for Fig.2 c and d.

(2) It is unclear whether the total 30% genes undergo shape change, or just the proportion of the remaining after the pipeline. So please clarify the details.

Reviewer 2: Yuchen Yang

In this manuscript, authors presented a novel statistical framework scShapes using GLM approach for identifying differential distributions in genes across scRNA-seq data of different conditions. scShapes quantifies gene-specific cell-to-cell variability by testing for differences in the expression distribution. scShapes was shown to be able to identify biologically-relevant switch in gene distribution shapes between different conditions. However, there are still several concerns required to be addressed.

In this study, authors compared scShapes to scDD and edgeR. However, besides these two, there are many other methods for calling DEGs from scRNA-seq. Wang et al. (2019) systematically evaluated the performance of eight methods specifically designed for scRNA-seq data (SCDE, MAST, scDD, D3E, Monocle2, SINCERA, DEsingle, and SigEMD) and two methods for bulk RNA-seq (edgeR and DESeq2). Thus, it is also worthy to compare scShapes to other methods, such as SigEMD, DEsingle and DESeq2, which were supposed to perform better than scDD or edgeR.
When scShapes was compared to scDD, authors mainly focused on the distribution shifting. However, to users, it would be better to present a venn diagram showing the numbers of the genes detected by both scShapes and scDD, and the genes specifically identified by scShapes and scDD, respectively. In addition, authors showed the functional enrichment results for DEGs identified by scShapes. It is also worthy to perform enrichment analysis for the genes detected by both scShapes and scDD or specifically identified by scShapes or scDD.
Since scShapes detects differential gene distribution between different conditions, it would be better to show users how to interpret the significant results biologically. For example, authors mentioned that RXRA is differentially distributed between Old and Young and Old and Treated, so what does this results mean? Can this differential distribution be associated with differential expression?
In Discussion, authors mentioned that scRATE is another tool that can model droplet-based scRNA-seq data. It would be clearer to discuss that why authors develop their own algorithm rather than using scRATE to model the distribution.
In Introduction, authors talked about the zero counts in scRNA-seq data, and presented evidence in Results part. Since 2020, there are several publications also focusing on this issue, such as Svensson, 2020 and Cao 2021. These discussions should be included in this manuscript.

Read the original source

scShapes: a statistical framework for identifying distribution shapes in single-cell RNA-sequencing data

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed