A comparative study of statistical methods for identifying differentially expressed genes in spatial transcriptomics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Spatial transcriptomics (ST) provides unprecedented insights into gene expression patterns while retaining spatial context, making it a valuable tool for understanding complex tissue architectures, such as those found in cancers. Seurat, by far the most popular tool for analyzing ST data, uses the Wilcoxon rank-sum test by default for differential expression analysis. However, as a nonparametric method that disregards spatial correlations, the Wilcoxon test can lead to inflated false positive rates and misleading findings. This limitation highlights the need for a more robust statistical approach that effectively incorporates spatial correlations. To this end, we propose a Generalized Estimating Equations (GEE) framework as a robust solution for differential gene expression analysis in ST. We conducted a comprehensive comparison of the GEE-based tests with existing methods, including the Wilcoxon rank-sum test and z-test. By appropriately accounting for spatial correlations, extensive simulations showed that the GEE test with robust standard error, referred to as the Independent GEE, demonstrated superior Type I error control and comparable power relative to other methods. Applications to ST datasets from breast and prostate cancer showed poor calibration of the p-values and potential false positive findings from the Wilcoxon rank-sum test. Our comparative study based on simulations and real data applications suggests that the Independent GEE test is well-suited for ST data, offering more accurate identification of biologically relevant gene expression changes and complementing the Wilcoxon rank-sum test. We have implemented the proposed method in R package “SpatialGEE”, available on GitHub.
Author Summary
Spatial transcriptomics (ST) provides unprecedented insights into gene expression patterns while retaining spatial context, making it a valuable tool for studying complex tissue architectures and disease etiology. Seurat, a widely used software tool for analyzing ST data, relies on the Wilcoxon rank-sum test for differential expression analysis. However, this test ignores spatial correlations, leading to inaccurate control of false positive rates and misleading findings. This limitation highlights the need for a more robust statistical approach that effectively incorporates spatial correlations. To this end, we have proposed a Generalized Estimating Equation (GEE) framework as a robust solution for differential gene expression analysis in ST. By appropriately accounting for spatial correlations, extensive simulations showed that the GEE-based test demonstrated superior false positive rate control and comparable power relative to other methods. Applications to ST datasets from breast and prostate cancer showed potential false positive findings from the Wilcoxon rank-sum test. We recommend the GEE method to be a useful complement to the Wilcoxon rank-sum test. We have implemented the proposed method in R package “SpatialGEE”, available on GitHub.