Information Geometry Reconciles Discrete and Continuous Variation in Single-Cell and Spatial Transcriptomic Analysis

Jinpu Cai
Yuxuan Wang
Yunhao Qiao
Cheng Wang
Ziqi Rong
Luting Zhou
Haoyang Liu
Meng Jiang
Hongbin Shen
Jingyi Jessica Li
Hongyi Xin

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Single-cell and spatial transcriptomics provide high-resolution cellular characterization, yet standard analytical approaches remain theoretically misaligned with the probabilistic nature of the data. After UMI normalization, current pipelines rely on Euclidean or log-transformed Euclidean distance for similarity measurement. Both are fundamentally ill-suited to model the multinomial count data. Euclidean distance in normalized space overemphasizes high-variance genes, while log-transformation inverts this bias but at the cost of distorting subtle, continuous expression modulations. Neither approach naturally captures the dual nature of gene expression: both discrete presence/absence transitions and continuous quantitative variation. To overcome these limitations, we introduce GAIA (Geometric Analysis from an Information Aspect), an information-geometric framework for cell representation learning and inter-cell similarity measurement. By anchoring analysis in the true probabilistic model, treating cells as multinomial distributions over genes and projecting cells to a statistical manifold, GAIA organically reconciles both the presence/absence effect and the more continuous expression modulations. Mathematically, GAIA exploits the equivalence between Fisher-Rao distance in multinomial space and geodesic distance on the unit hypersphere, a property that enables both theoretical guarantees and computational efficiency. Experiments in synthetic and real scRNA-seq and spatial transcriptomic datasets demonstrate that GAIA preserves robust and consistent cell-to-cell relationships, delineates biologically nuanced sub-types, mitigates batch effects arising from sequencing depth variation, and eliminates the dependence on knowledge-restricted gene selection for learning meaningful cell representations. Overall, GAIA offers a knowledge-lean, variance-stabilizing framework for analyzing single-cell and spatial transcriptomic data, enhancing discrimination between nuanced cell sub-type and -states.

Arcadia Science
Feb 28, 2026

Information Geometry Reconciles Discrete and Continuous

Dear Authors,

Congratulations on the excellent preprint!

I have a question with regard to the dimensionality reduction step on the square-root transformed sphere. The methodology employs Tangent PCA, which creates a local linearization by projecting points onto the tangent space at the global Fréchet mean. As noted in the text, the Euclidean distance in this tangent plane effectively approximates the geodesic distance for points that are close to the Fréchet mean.

Given this constraint, how does GAIA perform on highly heterogeneous datasets, like whole-organism or maybe cross-tissue atlases, where distinct cell populations might be located very far from a single, global Fréchet mean on the hypersphere? Does the tangent approximation begin to distort the macro-relationships …

Information Geometry Reconciles Discrete and Continuous

Dear Authors,

Congratulations on the excellent preprint!

I have a question with regard to the dimensionality reduction step on the square-root transformed sphere. The methodology employs Tangent PCA, which creates a local linearization by projecting points onto the tangent space at the global Fréchet mean. As noted in the text, the Euclidean distance in this tangent plane effectively approximates the geodesic distance for points that are close to the Fréchet mean.

Given this constraint, how does GAIA perform on highly heterogeneous datasets, like whole-organism or maybe cross-tissue atlases, where distinct cell populations might be located very far from a single, global Fréchet mean on the hypersphere? Does the tangent approximation begin to distort the macro-relationships between highly divergent lineages at the edges of the projection, and have you explored the possibility of using multiple local tangent spaces (or something more clever) to preserve global geometry in these extreme cases?

Thank you for sharing this with the community.

Read the original source
Version published to 10.64898/2026.02.25.707866 on bioRxiv
Feb 26, 2026

Microenvironment-aware transcriptome reconstruction in spatial transcriptomics

This article has 7 authors:
1. Shi-Tong Yang
2. Pai Peng
3. Hui-Feng He
4. Meng-Guo Wang
5. Bo-Han Si
6. Xiao-Fei Zhang
7. Luonan Chen
This article has no evaluationsLatest version Jan 13, 2026
Ribosomal DNA copy number variation shapes human physiology and disease risk

This article has 12 authors:
1. Anil Raj
2. Jordan Brown
3. Nathaniel Thayer
4. Manuel Hotz
5. Irene Lam
6. Nicole Fong
7. Elena Sorokin
8. Marjola Thanaj
9. Daphna Rothschild
10. Jonathan Pritchard
11. Maria Barna
12. David Hendrickson
This article has no evaluationsLatest version Jan 21, 2026
A sensitive and accurate framework for population-scale structural variant discovery and genotyping across sequence types

This article has 4 authors:
1. Xin Wang
2. Guangbao Luo
3. Li Xiao
4. Zhangjun Fei
This article has no evaluationsLatest version Feb 18, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Microenvironment-aware transcriptome reconstruction in spatial transcriptomics

Ribosomal DNA copy number variation shapes human physiology and disease risk

A sensitive and accurate framework for population-scale structural variant discovery and genotyping across sequence types