Uncovering Developmental Lineages from Single-cell Data with Contrastive Poincaré Maps

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Single-cell RNA-sequencing (scRNA-seq) enables the study of hierarchical and branching patterns in organismic development at high resolution. Analyzing such data requires visualization and analysis tools that faithfully represent the deep, tree-like structures formed by developmental lineages. Popular Euclidean embedding methods, such as UMAP and t-SNE, as well as domain-specific approaches like PHATE, distort hierarchical relationships in low dimensions, leading to a decrease in performance with growing tree depth. Hyperbolic geometry, which can represent trees with high accuracy in low dimensions, provides a natural remedy. However, existing hyperbolic methods, such as Poincaré Maps (PM), lose accuracy in deeper trees and require extensive feature engineering and memory. We present Contrastive Poincaré Maps (CPM), a self-supervised hyperbolic encoder that leverages contrastive learning in hyperbolic space to efficiently learn robust low-dimensional representations from scRNA-seq data. On synthetic trees with up to 5 generations and 34,000 individuals, CPM cuts distortion by > 99% and requires 13-fold less memory relative to PM. We further demonstrate CPM’s utility on three biological case studies. CPM uncovers accurate hierarchies across 9 developmental stages in the mouse gastrulation dataset comprising 116,312 cells, disentangles global multi-lineage hierarchies in the chicken cardiogenesis dataset while preserving intra-lineage developmental trends, and enables sampling-densityinvariant hierarchical analysis in the mouse hematopoiesis dataset. By leveraging hyperbolic geometry in combination with contrastive learning, CPM delivers a scalable framework that preserves hierarchical dependencies in developmental lineages, accelerates exploratory data analysis and opens new avenues for biological insights into developmental processes using scRNA-seq data.

A preliminary version of a part of this work was presented at the ICLR Workshop on Machine Learning for Genomics Explorations (Bhasker et al., 2024).

Article activity feed

  1. The first important observation is that state-of-the-art approaches,except CPM, fail to produce an embedding for the complete dataset (containing 100,000 cells),due to their reliance on pairwise distances for the computation of embeddings, which scalesquadratically in the number of cells

    This doesn't feel quite fair, as UMAP and tSNE were designed to handle datasets of this size and have been widely used to generate embeddings for single-cell datasets of this size and larger. Also, I believe at least UMAP is sub-quadratic in the number of samples, as it uses an approximate kNN algorithm that is n log n.

  2. Figure 3: Space and time complexity analysis.

    Minor comment: using a log-log scale for these plots would be helpful, as it would prevent the reference methods (UMAP, tSNE, PHATE) from appearing as a flat line.

  3. On synthetic trees with up to 5 generations and 34,000individuals, CPM cuts distortion by > 99%

    It would be helpful to clarify what this claim is based on, as I can't see anything in Figure 2 that indicates a 99% change in any of the metrics between CPM and PM.

  4. The dataset was normalized to 10000 counts per cell, Log1p transformed and filtered to contain2000 highly variable genes. The first important observation is that state-of-the-art approaches,except CPM

    Does marker‑gene expression change monotonically along the CPM geodesic from root to leaf?