A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Single-cell transcriptomics has revolutionized our understanding of cellular diversity, but integrating this knowledge across evolutionary distances remains challenging. Here we present TranscriptFormer, a family of generative foundation models representing a cross-species generative cell atlas trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species. TranscriptFormer jointly models genes and transcripts using a novel generative architecture, enabling it to function as a virtual instrument for probing cellular biology. In zero-shot settings, our models demonstrate superior performance on both in-distribution and out-of-distribution cell type classification, with robust performance even for species separated by over 685 million years of evolutionary distance. TranscriptFormer can also perform zero-shot disease state identification in human cells and accurately transfers cell type annotations across species boundaries. Being a generative model, TranscriptFormer can be prompted to predict cell type-specific transcription factors and gene-gene interactions that align with independent experimental observations. This work establishes a powerful framework for integrating and interrogating cellular diversity across species as well as offering a foundation for in silico experimentation with a generative single-cell atlas model.

Article activity feed

  1. where the count matrix C ∈ ℝ (M +1) × (M +1) is constructed by repeating the count vector c = (1, c1, c2, …, cM) across all rows.

    Employing the target 'counts' to define an attention bias introduces apparent circularity, since these 'counts' are precisely what the model aims to predict. This poses a challenge for inference: how would the model predict gene expression levels if the attention bias 'C' must be defined using these same, yet-to-be-predicted, expression levels?

  2. Despite this technical variation, cell types cluster consistently across species (Fig. 2E), highlighting the biological relevance of the learned embeddings. TranscriptFormer learns to group cells in a biologically relevant fashion by species and cell types (Fig. 2E), without the model being trained or run with species or cell type labels.

    With an emphasis on model generalizability, the most interesting signal one could observe is that embeddings cluster not by species (which is driven by strong phylogenetic signal) but rather by other conserved biologically meaningful differences like cell type. An explicit quantification of how much clustering in UMAP (or some other dimensionality reduction method) is explained by species identity vs cell type would be convincing of model generalizability (and the benefit of having multiple species in the training data). At a glance the plots right now suggest the dominant driver in clustering does seem to be species identity but it is hard to tell.

  3. Despite this ceiling effect, the multi-species variants (TF-Metazoa and TF-Exemplar) performed marginally better than the human-only model (TF-Sapiens) despite the same number of active parameters during inference and identical pretraining protocols,

    Is there some quantification of this? In the challenging cell types for example the performance seems roughly equivalent between the Metazoa and Exemplar versions of the model. Overall it is hard to see evidence of a benefit of adding diverged species to the training data.

  4. Evidence supporting the role of evolutionary diversity in enhancing model generalization is provided by the superior performance of TF-Metazoa, which was trained on twelve phylogenetically diverse species.

    What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?

    Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets.

    It seems likely that data leakage will happen to varying degrees based on the conservation of each gene. If this is happening to a substantial degree (i.e., the model is learning high copy number/homologous/pseudoreplicated sequences best), then it wouldn't be surprising that performance would scale with evolutionary distance from humans; the amount of shared homology would predict model performance. It also wouldn't be surprising that TF-Metazoa outperforms other models; by having more pseudoreplicated sequences, it provides more opportunities for data leakage and, thus, overfitting.

    Is there a convincing way to show that this isn't the case?