Uncovering Developmental Lineages from Single-cell Data with Contrastive Poincaré Maps
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Embeddings play a central role in single-cell RNA sequencing (scRNA-seq) data analysis by transforming complex gene expression profiles into interpretable, low-dimensional representations. While Euclidean embeddings distort hierarchical relationships in low dimensions, hyperbolic geometry can represent hierarchies accuractely in low dimensions. However, existing hyperbolic methods, such as Poincaré Maps (PM), lose accuracy in deeper hierachies and require extensive feature engineering and memory. We present Contrastive Poincaré Maps (CPM), a scalable approach that reliably preserves inherent hierarchical structures. On synthetic trees with up to five generations and 34,000 individuals, CPM reduces distortion by 99% (1.9 vs. 126.3) and requires 13-fold less memory than PM. We demonstrate CPM’s utility across three case studies: scalable analysis of 116,312 mouse gastrulation cells, accurate reconstruction of hierarchical structure in mouse hematopoiesis, and faithful representation of multi-lineage hierarchies in chicken cardiogenesis. By integrating hyperbolic geometry with contrastive learning, CPM enables scalable, structure-preserving embeddings for developmental scRNA-seq data. Code: https://github.com/NithyaBhasker/ContrastivePoincareMaps
Article activity feed
-
The first important observation is that state-of-the-art approaches,except CPM, fail to produce an embedding for the complete dataset (containing 100,000 cells),due to their reliance on pairwise distances for the computation of embeddings, which scalesquadratically in the number of cells
This doesn't feel quite fair, as UMAP and tSNE were designed to handle datasets of this size and have been widely used to generate embeddings for single-cell datasets of this size and larger. Also, I believe at least UMAP is sub-quadratic in the number of samples, as it uses an approximate kNN algorithm that is n log n.
-
Figure 3: Space and time complexity analysis.
Minor comment: using a log-log scale for these plots would be helpful, as it would prevent the reference methods (UMAP, tSNE, PHATE) from appearing as a flat line.
-
On synthetic trees with up to 5 generations and 34,000individuals, CPM cuts distortion by > 99%
It would be helpful to clarify what this claim is based on, as I can't see anything in Figure 2 that indicates a 99% change in any of the metrics between CPM and PM.
-
The dataset was normalized to 10000 counts per cell, Log1p transformed and filtered to contain2000 highly variable genes. The first important observation is that state-of-the-art approaches,except CPM
Does marker‑gene expression change monotonically along the CPM geodesic from root to leaf?
-