Information Geometry Reconciles Discrete and Continuous Variation in Single-Cell and Spatial Transcriptomic Analysis
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Single-cell and spatial transcriptomics provide high-resolution cellular characterization, yet standard analytical approaches remain theoretically misaligned with the probabilistic nature of the data. After UMI normalization, current pipelines rely on Euclidean or log-transformed Euclidean distance for similarity measurement. Both are fundamentally ill-suited to model the multinomial count data. Euclidean distance in normalized space overemphasizes high-variance genes, while log-transformation inverts this bias but at the cost of distorting subtle, continuous expression modulations. Neither approach naturally captures the dual nature of gene expression: both discrete presence/absence transitions and continuous quantitative variation. To overcome these limitations, we introduce GAIA (Geometric Analysis from an Information Aspect), an information-geometric framework for cell representation learning and inter-cell similarity measurement. By anchoring analysis in the true probabilistic model, treating cells as multinomial distributions over genes and projecting cells to a statistical manifold, GAIA organically reconciles both the presence/absence effect and the more continuous expression modulations. Mathematically, GAIA exploits the equivalence between Fisher-Rao distance in multinomial space and geodesic distance on the unit hypersphere, a property that enables both theoretical guarantees and computational efficiency. Experiments in synthetic and real scRNA-seq and spatial transcriptomic datasets demonstrate that GAIA preserves robust and consistent cell-to-cell relationships, delineates biologically nuanced sub-types, mitigates batch effects arising from sequencing depth variation, and eliminates the dependence on knowledge-restricted gene selection for learning meaningful cell representations. Overall, GAIA offers a knowledge-lean, variance-stabilizing framework for analyzing single-cell and spatial transcriptomic data, enhancing discrimination between nuanced cell sub-type and -states.
Article activity feed
-
Information Geometry Reconciles Discrete and Continuous
Dear Authors,
Congratulations on the excellent preprint!
I have a question with regard to the dimensionality reduction step on the square-root transformed sphere. The methodology employs Tangent PCA, which creates a local linearization by projecting points onto the tangent space at the global Fréchet mean. As noted in the text, the Euclidean distance in this tangent plane effectively approximates the geodesic distance for points that are close to the Fréchet mean.
Given this constraint, how does GAIA perform on highly heterogeneous datasets, like whole-organism or maybe cross-tissue atlases, where distinct cell populations might be located very far from a single, global Fréchet mean on the hypersphere? Does the tangent approximation begin to distort the macro-relationships …
Information Geometry Reconciles Discrete and Continuous
Dear Authors,
Congratulations on the excellent preprint!
I have a question with regard to the dimensionality reduction step on the square-root transformed sphere. The methodology employs Tangent PCA, which creates a local linearization by projecting points onto the tangent space at the global Fréchet mean. As noted in the text, the Euclidean distance in this tangent plane effectively approximates the geodesic distance for points that are close to the Fréchet mean.
Given this constraint, how does GAIA perform on highly heterogeneous datasets, like whole-organism or maybe cross-tissue atlases, where distinct cell populations might be located very far from a single, global Fréchet mean on the hypersphere? Does the tangent approximation begin to distort the macro-relationships between highly divergent lineages at the edges of the projection, and have you explored the possibility of using multiple local tangent spaces (or something more clever) to preserve global geometry in these extreme cases?
Thank you for sharing this with the community.
-