From Noise to Models to Numbers: Evaluating Negative Binomial Models and Parameter Estimations in Single-Cell RNA-seq
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Review Commons)
Abstract
The Negative Binomial (NB) distribution effectively approximates the transcript count distribution in many single-cell RNA sequencing (scRNA-seq) datasets. This has led to its widespread use in various computational tools for scRNA-seq analysis. However, the underlying reasons for its ubiquitousness remain unclear. Here, we use a computationally efficient model selection technique to precisely map the relationship between the choice of the best-fit models — Beta-Poisson (Telegraph), NB and Poisson — and the kinetic parameters that control the stochasticity of gene expression. We find that the NB distribution is an excellent approximation to simulated data, that accounts for both biological and technical noise, in an intermediate range of an effective parameter — the sum of the gene activation and inactivation rates normalized by the mRNA degradation rate. The size of this range increases with decreasing mean expression, increasing technical noise, and increasing sample size (number of cells). These findings have important implications: (i) excellent NB fits span diverse parameter regimes and are not exclusive indicators of transcriptional bursting; (ii) for small sample sizes, biological noise generally becomes the primary factor shaping the NB characteristics of the count distribution, even when technical noise is significant; (iii) under the assumption of steady-state conditions, gene-specific parameters (burst size and frequency) estimated in regions where the NB model fits well, typically show large relative errors, even after corrections for technical noise; (iv) gene ranking by burst frequency remains accurate, indicating that burst parameter magnitudes are often only relatively informative.
Article activity feed
-
Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
The authors do not wish to provide a response at this time.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Summary
In this paper, Wang and Shu et al. investigate the extent to which the negative binomial (NB) distribution captures the statistical properties of single-cell like count data and the effects of using this model to interpret biophysical parameters. Assuming an underlying telegraph model of transcription, they demonstrate how the NB can produce similar if not equivalent fits to simulated data from various parameter regimes, regimes which can, notably, fall outside of the bursty transcription limit in which the telegraph model is known to have a NB form. The authors then assess how model selection favors the NB or Poisson models …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Summary
In this paper, Wang and Shu et al. investigate the extent to which the negative binomial (NB) distribution captures the statistical properties of single-cell like count data and the effects of using this model to interpret biophysical parameters. Assuming an underlying telegraph model of transcription, they demonstrate how the NB can produce similar if not equivalent fits to simulated data from various parameter regimes, regimes which can, notably, fall outside of the bursty transcription limit in which the telegraph model is known to have a NB form. The authors then assess how model selection favors the NB or Poisson models over the underlying telegraph model, and how technical noise can lead to greater selection/representation of the NB over the parameter regime. Finally, they demonstrate how the broader applicability of the NB impacts inference of burst size and frequency (commonly inferred from NB fits on single-cell data), preserving relative rather than absolute information.
The authors use both method of moments and MLE-based approaches to obtain and compare model fits over the same parameter regimes. They also develop the aeBIC metric which balances parametric complexity and distributional similarity to the desired, ground truth distribution, to more quickly approximate the BIC (used for model selection).
Major comments:
The likelihood of model fits is used as a main criteria for model selection and comparison (e.g., in the BIC/aeBIC metrics), however it is possible that analysis of the curvature of the likelihood may suggest greater uncertainty/less information about parameter estimates for the different statistical models across the transcriptional regimes tested. Since a major component of this study is to demonstrate to readers that nuanced model selection is important for interpreting single-cell data, it would support these efforts to see if the telegraph versus NB model fits, for example, demonstrate differences in their respective Hessian matrices for the MLE estimates. This would help determine, for those interested in comparing these fits on their data, if there is potential here to distinguish the more optimal/true model or not (i.e., what the extent of the limitations are). The authors describe how in the infinite limit of N_sigma the NB and telegraph models converge to the same distribution, which provides another biological scenario outside transcriptional bursting where the NB can be interpreted as a good statistical model. However, though many parameter regimes are possible not all are observed in real data. Thus for readers to understand how likely these regimes are to be present in the data it would be helpful to discuss in what biological scenarios such a limit may appear and if it is likely to be a common instance, etc (perhaps given the ranges of on/off times observed in the literature https://pmc.ncbi.nlm.nih.gov/articles/PMC10860890/). This would parallel the discussion in the study on the bursty transcription model, often described in the literature as a widespread phenomenon. The p_cap parameter is described as representing technical capture and affords the conclusion in the Discussion that the NB can improve capture of technical noise beyond the biological noise in the system. However, as mentioned later in the Discussion, this effect could also arise from cell to cell differences in transcription rate (extrinsic, biological noise), which cannot be distinguished in this model. This point should be made clearer earlier on, as without use of control genes/spike-ins/etc we cannot distinguish the biological and technical components encompassed by the p_cap term (i.e., whether or not a spread in total UMIs observed over droplets is due to biological or technical capture differences). Since the aeBIC is being presented as a new, faster method in this study, the timing and memory usage in performing these calculations, for each model, should be presented somewhere. The Methods should also have a more explicit description of the steps/tools used to calculate the aeBIC.
Minor comments:
Figure S4 mentioned comparison of scRNA-seq with smFISH data to approximate p_cap, however given that smFISH data would have its own technical biases it does not seem exactly clear how a map from smFISH to scRNA-seq would work such as to illuminate the gap incurred by technical bias/capture. Perhaps previous literature/methods doing this can be cited here, or this idea can be fleshed out in the Discussion text for readers interested in better estimating p_cap. In Figure 4 the pink color of the Poisson in c is hard to see, and it may be easier to write the names of the different models in the respective regions that they cover (similarly in Figure 5 c) For Figure 8, it may be easier for the reader to interpret the several plots in a row by repeating the x-axis labels under each set of plots and collating all the legend labels into one box somewhere near the first plots.
Significance
General assessment: Overall, the paper is a clear and concise view on the use of the NB in analysis of sparse, transcriptomic count data, the potential effects of technical and biological noise on the pertinence of the NB as the statistical representation, and the impacts on user interpretation of biophysical parameters from these model fits. This study is useful for both biologists and computational scientists looking to gain mechanistic insight from single-cell data.
The strength of the paper is that the methodology is straightforward and uses simple numerical experiments to demonstrate how and when several common distributions can describe the type of data we encounter in single-cell genomics. They additionally connect these results to common biological interpretations from single-cell measurements and outline regimes in which inferences are likely to be incorrect.
The paper could benefit from more discussion on the biological interpretations of the findings and regimes analyzed, particularly to help readers interested in how this impacts their data analysis. Supplemental analysis on whether other criteria could potentially distinguish the models in question would also help support the conclusions of model selection/identifiability and if other properties of these model fits can be used for selection or not.
Advance: The study builds on others in the field by not just fitting several common models to this type of sparse, transcriptomic count data but also describing why these overlapping fits arise and how that affects biological interpretation. Often the focus is more on choosing a sufficient statistical representation without the underlying, mechanistic connections between the models. The results here are thus more technical and mechanistic in nature, describing both the theoretical connections between common single-cell count models and their biophysical interpretations.
Audience: This result is likely to be of interest to scientists performing data analysis and method development in single-cell genomics, particularly with mechanistic insight in mind. This would be more of interest within the domain of transcriptomics, but it also presents a methodology for studying limitations of identifiability in noisy systems which could be of interest to other biological domains.
My expertise is in developing representation learning methods and stochastic models of transcription for single-cell biology, which covers the classical models described in this study.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The study is generally well reasoned and thorough, and should be of interest to the community. My only critique relates to the treatment of extrinsic noise and the related discussion: Many studies have concluded that extrinsic noise (e.g., cell-to-cell variability in the transcription rate) is a larger contribution to noise in gene expression than intrinsic noise. (For example, see the seminal review by Raj and van Oudenaarden (PMID: 18957198) and early examples such as Raser and O'Shea: Raser JM, O'Shea EK. Science. 2004;304:1811. doi: 10.1126/science.1098641). For this reason, one must be careful in assuming that the telegraph …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The study is generally well reasoned and thorough, and should be of interest to the community. My only critique relates to the treatment of extrinsic noise and the related discussion: Many studies have concluded that extrinsic noise (e.g., cell-to-cell variability in the transcription rate) is a larger contribution to noise in gene expression than intrinsic noise. (For example, see the seminal review by Raj and van Oudenaarden (PMID: 18957198) and early examples such as Raser and O'Shea: Raser JM, O'Shea EK. Science. 2004;304:1811. doi: 10.1126/science.1098641). For this reason, one must be careful in assuming that the telegraph model by itself fully captures biological variability. I believe this point could be more clearly made in the paper. Did the authors treat a case in which the gene undergoes state switching, but where there is also a significant contribution of extrinsic noise, for example, through variability in the transcription rate and/or other papers? I could not tell for sure if this was explicitly studied. This would be an important scenario to study, because it may be the most likely. I would have thought that this is the most biologically realistic scenario (i.e., strong contributions of both intrinsic and extrinsic noise, along with state switching). My prior assumption has been that the NB model is often empirically indicated because it somehow well captures this combination of intrinsic (including state switching) + extrinsic noise. Could the authors comment on whether this assumption is consistent with their findings? (Neither Case I or Case II in the manuscript captures this scenario). Related to the treatment of extrinsic noise, I was confused by this sentence: "Any variation in the effective transcription rate due to variability in the transcription rate (extrinsic noise on the transcription rate) between cells is indistinguishable from variability in the transcript capture probability and hence is automatically accounted for in our present method. " But doesn't the distribution of transcription rates vary significantly, depending on whether the variation comes from technical noise versus extrinsic biological variability? For example, one source of extrinsic biological variability is differences in RNA polymerase concentrations in different cells. Wouldn't one need to know what kind of distribution to use to capture these effects? In this case, I believe one would need to study various types of compound distributions, depending on the assumptions underlying the biological extrinsic variability.
Significance
This paper presents a thorough study of the conditions under which the negative binomial model of transcript distributions can map onto other widely used models, namely the telegraph model of stochastic gene expression. The study is generally well reasoned and thorough, and should be of interest to the community (namely: single cell transcriptomics community, bio mathematicians, biological noise community).
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
The authors present an investigation into the surprising effectiveness of the negative-binomial distribution in modelling transcript counts in single cell RNA sequencing experiments. With experimentally motivated ground-truth models that incorporate transcriptional bursting, they show that when transcription activity is large compared to degradation these distributions coincide. With a novel model selection metric, they indicate the regions of parameter space in which the negative-binomial model is a good approximation to the underlying true model. With this procedure, they also indicate that transcriptional burst parameters are …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
The authors present an investigation into the surprising effectiveness of the negative-binomial distribution in modelling transcript counts in single cell RNA sequencing experiments. With experimentally motivated ground-truth models that incorporate transcriptional bursting, they show that when transcription activity is large compared to degradation these distributions coincide. With a novel model selection metric, they indicate the regions of parameter space in which the negative-binomial model is a good approximation to the underlying true model. With this procedure, they also indicate that transcriptional burst parameters are unlikely to be reconstructed by an effective negative-binomial function, but that nevertheless, relative rankings between genes can be identified robustly.
I would like to commend the authors on an interesting and fairly comprehensive investigation on a topic of considerable importance in the interpretation of single cell RNA sequencing experiments, and on a well written paper. I have no major comments on issues that affect the conclusions of the paper, although I have a few minor suggestions that might aid reader's understanding of the results and their applicability.
General
It would be nice to have a comparison with some real data for the burst frequency and size, just to indicate to the reader how important these regions are compared to what might be measured. For example, if most genes are outside of the region that does not accommodate the NB distribution, then the conclusion is quite different than if most real counts are unlikely to accommodated by the NB.
Inter-cellular variability of transcription dynamics is quite a significant point of interest, so it would be good to have stated earlier that this is not considered, with the mitigation that is noted later. This is particularly important given that in the introduction, the cases mentioned seem to imply that an NB distribution would be more likely with higher inter-cellular variability.
Introduction
It would be nice to have a bit more detail here, for example on what UMIs are, and what the parameters of the NB distribution represent in general.
For smFISH, I would have thought that the more simple explanation is that the NB is often the simplest distribution with some overdispersion that fits the data, and the parameters don't necessarily need to be biologically interpretable?
It's noted later that the capture probability of modern RNASeq protocols can be ~0.3, which doesn't seem very different compared to 0.7-0.9 of smFISH, so some context here would be good.
Results
Eq 1: I don't think you lose anything by giving the Pochammer symbol and Kummer confluent geometric function explicitly here, and it would make it it a lot easier to read. That said, this equation also seems to come out of nowhere, so a reference would be nice.
I think the moment matching is reasonably convincing, but it might require a little more explicit motivation for a more general audience.
Thm. 1: Do these converge at similar rates, and if not, does that have any implications for the interpretation of the comparisons (as these are evaluated with specific values)? This might be worth a short comment.
Fig 3. In the description for this in the text, it would be nice to have an expression of the KL divergence (and what order the arguments are in), for anyone unfamiliar.
The discussion of the aeBIC seems a bit circuitous. A reasonable prior intention might be to average (or apply a voting function) to individual BIC values, rather than the aeBIC constructed here. And in fact the text goes on to note after the description that this is a good estimate of the expectation of the BIC after all, with some computational advantages. So it might be better to have a more straightforward presentation where this is proposed as an approximation to the expectation of the BIC in the first place.
Section 2.4: The intro to this section could do with a bit more background of the capture, PCR, sequencing, etc, stages, and what exactly the data generated here represents. Otherwise the discussion of zero inflation and UMIs is a little confusing.
It would also be nice to have a comment here on the effect of sequencing depth, or similar (compared to capture probability), even if this wouldn't change the interpretation.
Significance
The paper provides novel arguments towards the support of the negative-binomial distribution in describing single cell RNA sequencing data, with particular relevance to transcriptional bursting observed in numerous datasets. The paper follows from some notable prior work in the field, and integrates these into a more consistent description, particularly in relation to newer techniques such as UMIs.
The ubiquity of the negative-binomial distribution means that these arguments will be of relevance to those that perform theoretical or statistical modelling of single cell RNA sequencing data, and theoretically justifies many widely held assumptions. However, the paper does not make any reference to specific reference datasets or commonly observed values, so where in the parameter space data likely lies would still need to be evaluated on a case-by-case basis.
My expertise is in mathematical modelling and statistics, with some experience of the analysis of single cell RNA sequencing data.
-
