A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

James D Pearce
Sara E Simmonds
Gita Mahmoudabadi
Lakshmi Krishnan
Giovanni Palla
Ana-Maria Istrate
Alexander Tarashansky
Benjamin Nelson
Omar Valenzuela
Donghui Li
Stephen R. Quake
Theofanis Karaletsos

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Single-cell transcriptomics has revolutionized our understanding of cellular diversity, but integrating this knowledge across evolutionary distances remains challenging. Here we present TranscriptFormer, a family of generative foundation models representing a cross-species generative cell atlas trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species. TranscriptFormer jointly models genes and transcripts using a novel generative architecture, enabling it to function as a virtual instrument for probing cellular biology. In zero-shot settings, our models demonstrate superior performance on both in-distribution and out-of-distribution cell type classification, with robust performance even for species separated by over 685 million years of evolutionary distance. TranscriptFormer can also perform zero-shot disease state identification in human cells and accurately transfers cell type annotations across species boundaries. Being a generative model, TranscriptFormer can be prompted to predict cell type-specific transcription factors and gene-gene interactions that align with independent experimental observations. This work establishes a powerful framework for integrating and interrogating cellular diversity across species as well as offering a foundation for in silico experimentation with a generative single-cell atlas model.

Arcadia Science
May 9, 2025

where the count matrix C ∈ ℝ (M +1) × (M +1) is constructed by repeating the count vector c = (1, c1, c2, …, cM) across all rows.

Employing the target 'counts' to define an attention bias introduces apparent circularity, since these 'counts' are precisely what the model aims to predict. This poses a challenge for inference: how would the model predict gene expression levels if the attention bias 'C' must be defined using these same, yet-to-be-predicted, expression levels?

Read the original source
Arcadia Science
May 9, 2025

Despite this technical variation, cell types cluster consistently across species (Fig. 2E), highlighting the biological relevance of the learned embeddings. TranscriptFormer learns to group cells in a biologically relevant fashion by species and cell types (Fig. 2E), without the model being trained or run with species or cell type labels.

With an emphasis on model generalizability, the most interesting signal one could observe is that embeddings cluster not by species (which is driven by strong phylogenetic signal) but rather by other conserved biologically meaningful differences like cell type. An explicit quantification of how much clustering in UMAP (or some other dimensionality reduction method) is explained by species identity vs cell type would be convincing of model generalizability (and the benefit of having multiple species …

Despite this technical variation, cell types cluster consistently across species (Fig. 2E), highlighting the biological relevance of the learned embeddings. TranscriptFormer learns to group cells in a biologically relevant fashion by species and cell types (Fig. 2E), without the model being trained or run with species or cell type labels.

With an emphasis on model generalizability, the most interesting signal one could observe is that embeddings cluster not by species (which is driven by strong phylogenetic signal) but rather by other conserved biologically meaningful differences like cell type. An explicit quantification of how much clustering in UMAP (or some other dimensionality reduction method) is explained by species identity vs cell type would be convincing of model generalizability (and the benefit of having multiple species in the training data). At a glance the plots right now suggest the dominant driver in clustering does seem to be species identity but it is hard to tell.

Read the original source
Arcadia Science
May 9, 2025

Despite this ceiling effect, the multi-species variants (TF-Metazoa and TF-Exemplar) performed marginally better than the human-only model (TF-Sapiens) despite the same number of active parameters during inference and identical pretraining protocols,

Is there some quantification of this? In the challenging cell types for example the performance seems roughly equivalent between the Metazoa and Exemplar versions of the model. Overall it is hard to see evidence of a benefit of adding diverged species to the training data.

Read the original source
Arcadia Science
May 2, 2025

Evidence supporting the role of evolutionary diversity in enhancing model generalization is provided by the superior performance of TF-Metazoa, which was trained on twelve phylogenetically diverse species.

What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?

Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets.

It seems likely that data leakage will happen to …

Evidence supporting the role of evolutionary diversity in enhancing model generalization is provided by the superior performance of TF-Metazoa, which was trained on twelve phylogenetically diverse species.

What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?

Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets.

It seems likely that data leakage will happen to varying degrees based on the conservation of each gene. If this is happening to a substantial degree (i.e., the model is learning high copy number/homologous/pseudoreplicated sequences best), then it wouldn't be surprising that performance would scale with evolutionary distance from humans; the amount of shared homology would predict model performance. It also wouldn't be surprising that TF-Metazoa outperforms other models; by having more pseudoreplicated sequences, it provides more opportunities for data leakage and, thus, overfitting.

Is there a convincing way to show that this isn't the case?

Read the original source
Version published to 10.1101/2025.04.25.650731v1 on bioRxiv
Apr 29, 2025

Joint probabilistic modeling of pseudobulk and single-cell transcriptomics enables accurate estimation of cell type composition

This article has 5 authors:
1. Simon Grouard
2. Khalil Ouardini
3. Yann Rodriguez
4. Jean-Philippe Vert
5. Almudena Espin-Perez
This article has no evaluationsLatest version Jun 1, 2025
Fundamental Limitations of Foundation Models in Single-Cell Transcriptomics

This article has 2 authors:
1. Srijan Atti
2. Shankar Subramaniam
This article has no evaluationsLatest version Jun 28, 2025
Denoising Single-Cell RNA-Seq Data with a Deep Learning-Embedded Statistical Framework

This article has 3 authors:
1. Qinhuan Luo
2. Yongzhen Yu
3. Tianying Wang
This article has no evaluationsLatest version May 26, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Joint probabilistic modeling of pseudobulk and single-cell transcriptomics enables accurate estimation of cell type composition

Fundamental Limitations of Foundation Models in Single-Cell Transcriptomics

Denoising Single-Cell RNA-Seq Data with a Deep Learning-Embedded Statistical Framework