A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Single-cell transcriptomics has revolutionized our understanding of cellular diversity, yet our understanding of the transcriptional programs across the tree of life remains limited. Here we present TranscriptFormer, a family of generative foundation models trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species. By jointly modeling gene identities and expression levels using a novel generative architecture, TranscriptFormer encodes multi-scale biological structure, functioning as a queryable virtual cell atlas. We demonstrate state-of-the-art performance on both in-distribution and out-of-distribution cell type classification, with robust performance even for species separated by over 685 million years of evolution. TranscriptFormer can also perform zero-shot disease state identification in human cells and accurately transfers cell state annotations across species boundaries. As a generative model, TranscriptFormer can be prompted to predict cell type-specific transcription factors and gene-gene interactions that align with independent experimental observations. Developmental trajectories, phylogenetic relationships and cellular hierarchies emerge naturally in TranscriptFormer’s representations without any explicit training on these annotations. This work establishes a powerful framework for quantitative single-cell analysis, and comparative cellular biology, thus demonstrating that universal principles of cellular organization can be learned and predicted across the tree of life.
Article activity feed
-
where the count matrix C ∈ ℝ (M +1) × (M +1) is constructed by repeating the count vector c = (1, c1, c2, …, cM) across all rows.
Employing the target 'counts' to define an attention bias introduces apparent circularity, since these 'counts' are precisely what the model aims to predict. This poses a challenge for inference: how would the model predict gene expression levels if the attention bias 'C' must be defined using these same, yet-to-be-predicted, expression levels?
-
Despite this technical variation, cell types cluster consistently across species (Fig. 2E), highlighting the biological relevance of the learned embeddings. TranscriptFormer learns to group cells in a biologically relevant fashion by species and cell types (Fig. 2E), without the model being trained or run with species or cell type labels.
With an emphasis on model generalizability, the most interesting signal one could observe is that embeddings cluster not by species (which is driven by strong phylogenetic signal) but rather by other conserved biologically meaningful differences like cell type. An explicit quantification of how much clustering in UMAP (or some other dimensionality reduction method) is explained by species identity vs cell type would be convincing of model generalizability (and the benefit of having multiple species …
Despite this technical variation, cell types cluster consistently across species (Fig. 2E), highlighting the biological relevance of the learned embeddings. TranscriptFormer learns to group cells in a biologically relevant fashion by species and cell types (Fig. 2E), without the model being trained or run with species or cell type labels.
With an emphasis on model generalizability, the most interesting signal one could observe is that embeddings cluster not by species (which is driven by strong phylogenetic signal) but rather by other conserved biologically meaningful differences like cell type. An explicit quantification of how much clustering in UMAP (or some other dimensionality reduction method) is explained by species identity vs cell type would be convincing of model generalizability (and the benefit of having multiple species in the training data). At a glance the plots right now suggest the dominant driver in clustering does seem to be species identity but it is hard to tell.
-
Despite this ceiling effect, the multi-species variants (TF-Metazoa and TF-Exemplar) performed marginally better than the human-only model (TF-Sapiens) despite the same number of active parameters during inference and identical pretraining protocols,
Is there some quantification of this? In the challenging cell types for example the performance seems roughly equivalent between the Metazoa and Exemplar versions of the model. Overall it is hard to see evidence of a benefit of adding diverged species to the training data.
-
Evidence supporting the role of evolutionary diversity in enhancing model generalization is provided by the superior performance of TF-Metazoa, which was trained on twelve phylogenetically diverse species.
What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?
Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets.
It seems likely that data leakage will happen to …
Evidence supporting the role of evolutionary diversity in enhancing model generalization is provided by the superior performance of TF-Metazoa, which was trained on twelve phylogenetically diverse species.
What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?
Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets.
It seems likely that data leakage will happen to varying degrees based on the conservation of each gene. If this is happening to a substantial degree (i.e., the model is learning high copy number/homologous/pseudoreplicated sequences best), then it wouldn't be surprising that performance would scale with evolutionary distance from humans; the amount of shared homology would predict model performance. It also wouldn't be surprising that TF-Metazoa outperforms other models; by having more pseudoreplicated sequences, it provides more opportunities for data leakage and, thus, overfitting.
Is there a convincing way to show that this isn't the case?
-