GTestimate: Improving relative gene expression estimation in scRNA-seq using the Good-Turing estimator

Martin Fahrenberger
Christopher Esk
Arndt von Haeseler

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.

Results

We present GTestimate , a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate ’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.

Conclusion

By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.

GigaScience
Oct 30, 2025

AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. …

AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.Conclusion By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf084), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 2: Amichai Painsky

This paper introduces a Good-Turing (GT) estimation scheme for relative gene expression estimation and cell-cell distance estimation. The proposed methods, namely GTestimate, claims to improve upon conventional normalization methods by accounting for unobserved genes. The idea behind this contribution is fairly straightforward - since the relative gene expression is of large alphabet, a GT estimator is expected to preform better than a naive ML approach. However, I am not convinced that the authors applied it correctly. First, the proposed GT estimator (as appears in (GT)) in the text), assigns a zero estimate to unobserved genes (Cg = 0). This contradicts the entire essence of using a GT estimator. Second, it makes no since to use this expression for every Cg > 0. In fact, any reasonable GT based estimator applies GT for relatively small Cg, and ML estimator for large Cg. See [1] for a through discussion. The choice of a threshold between "small" and "large" Cg's is subject to many studied (for example [2], [1]), but it makes no sense to use the above expression for any Cg. Finally, notice that if N_{Cg} > 0 for some g but N_{Cg+1} = 0, the proposed estimator is not defined. There exists several smoothing solutions for such cases (for example [3]), but they need to be properly discussed. to conclude, I am not sure what is the effect of these issues on the experiments in the paper, which makes it difficult to assess the results.

REFERENCES

[1] A. Painsky, "Convergence guarantees for the good-turing estimator," Journal of Machine Learning Research, vol. 23, no. 279, pp. 1-37, 2022. [2] E. Drukh and Y. Mansour, "Concentration bounds for unigram language models." Journal of Machine Learning Research, vol. 6, no. 8, 2005. [3] W. A. Gale and G. Sampson, "Good-Turing frequency estimation without tears," Journal of quantitative linguistics, vol. 2, no. 3, pp. 217-237, 1995.

Read the original source
GigaScience
Oct 30, 2025

AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. …

AbstractBackground Single-cell RNA-seq suffers from unwanted technical variation between cells, caused by its complex experiments and shallow sequencing depths. Many conventional normalization methods try to remove this variation by calculating the relative gene expression per cell. However, their choice of the Maximum Likelihood estimator is not ideal for this application.Results We present GTestimate, a new normalization method based on the Good-Turing estimator, which improves upon conventional normalization methods by accounting for unobserved genes. To validate GTestimate we developed a novel cell targeted PCR-amplification approach (cta-seq), which enables ultra-deep sequencing of single cells. Based on this data we show that the Good-Turing estimator improves relative gene expression estimation and cell-cell distance estimation. Finally, we use GTestimate’s compatibility with Seurat workflows to explore three common example data-sets and show how it can improve downstream results.Conclusion By choosing a more suitable estimator for the relative gene expression per cell, we were able to improve scRNA-seq normalization, with potentially large implications for downstream results. GTestimate is available as an easy-to-use R-package and compatible with a variety of workflows, which should enable widespread adoption.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf084), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Gregory Schwartz

In this manuscript, Fahrenberger et al. propose a new scRNA-seq normalization method to more accurately report UMI counts of individual cells. They specifically use a Good-Turing estimator, compared with a more commonly used Maximum Likelihood estimator, to adjust raw UMI counts. Using their own cta-seq, a cell targeted PCR-amplification strategy, as ground truth, they compare their estimator with a traditional size-corrected estimator. Furthermore, they illustrate downstream changes using their method, including changes to clustering results and spatial transcriptomic readouts. The manuscript was a clear read and presents an interesting alternative solution to an often overlooked, but important, problem. However, there are some aspects of the manuscript that need to be addressed. Some major content missing includes comparisons with more widely-used normalization methods throughout the manuscript, and better ground truth data sets in their downstream analysis. Specific comments are as follows:

l. 34: To my knowledge, most groups do not use a single division by total UMI count as the only normalization. Seurat has NormalizeData, but also heavily promotes scTransform, a completely different method. Many use log transform (as I believe was done here), some use quantile transform, others use regression techniques etc. It was odd to see these standard normalizations missing in comparisons. The authors should use such standard procedures to demonstrate the superiority of GT.

l. 42: Is there a justification for the successor function being applied within the frequency ((cg + 1) / total) instead of outside ((cg / total) + 1) as is expected with the Good-Turing estimation?

Furthermore, there is typically a smoothing function for erratic N_cg values, which I would expect with single-cell data. In the methods there is a brief mention of linear smoothing, but that would imply that the GT equation is misleading and oversimplified. The actual equation should be included in the main text to avoid confusion.

l. 58: Compared to 16,965 reads average per cell, what is the equivalent for the ultra-deep sequencing (not 23 million reads, as that is not 7.4 fold increase)?

I am not entirely convinced on the use of cta-seq as a ground-truth for the cells, especially in comparison with ML. The authors should show that cta-seq has similar UMI and gene count distributions to more popular scRNA-seq technologies (e.g. 10x Chromium) or the application may be specific to cta-seq only.

l. 110: Instead of using unknown classification data sets, there are existing cell-sorted data sets with ground truths (many even on the 10x website). The authors should use these data sets to compare downstream analysis.

l. 125: The spatial transcriptomic results were very subjective, with no statistical hypotheses. The entire manuscript is missing any sort of statistics when comparing methods, which is a major flaw and should be rectified. Here specifically, the color scale stops at 3, but does this carry over to the relative differential expression? The claim is that it is constant, but if they are all greater than 3 then they must be quite variable, so it is surprising to see such a constant value of 0. Maybe the complete color scale should be shown on all figures to clarify this.

From my understanding of the manuscript, the 18 cells for analysis and comparison were chosen based on a typical Seurat analysis. This technique introduces a range of biases into the comparison and makes the argument a bit circular.

For a bias example, the top 2000 most variable genes were used, suggesting that entire classes of genes may be ignored even when highly or lowly expressed, such as housekeeping genes.

There also appears to be many steps that were not entire justified outside of a "typical analysis", for example excluding a cluster in the analysis (just because it was not that large?), only selection 18 cells (why 6 from each cluster?), removing cells with less than 1000 expressed genes or over 8% mitochrondrial reads (this may be an issue, and removing specific cell types or proliferating cells, this should be a bivariate removal with justification). All of these filterings remove generalizeability of GT.

Supplementary Figures in the text hyperlink to the main figures which is confusing. More importantly, the caption of Supplementary Figures read "Figure" rather than "Supplementary Figures".

Read the original source
Version published to 10.1101/2024.07.02.601501 on bioRxiv
Jul 3, 2024