A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Single-cell transcriptomics has revolutionized our understanding of cellular diversity, but integrating this knowledge across evolutionary distances remains challenging. Here we present TranscriptFormer, a family of generative foundation models representing a cross-species generative cell atlas trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species. TranscriptFormer jointly models genes and transcripts using a novel generative architecture, enabling it to function as a virtual instrument for probing cellular biology. In zero-shot settings, our models demonstrate superior performance on both in-distribution and out-of-distribution cell type classification, with robust performance even for species separated by over 685 million years of evolutionary distance. TranscriptFormer can also perform zero-shot disease state identification in human cells and accurately transfers cell type annotations across species boundaries. Being a generative model, TranscriptFormer can be prompted to predict cell type-specific transcription factors and gene-gene interactions that align with independent experimental observations. This work establishes a powerful framework for integrating and interrogating cellular diversity across species as well as offering a foundation for in silico experimentation with a generative single-cell atlas model.
Article activity feed
-
Evidence supporting the role of evolutionary diversity in enhancing model generalization is provided by the superior performance of TF-Metazoa, which was trained on twelve phylogenetically diverse species.
What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?
Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets.
It seems likely that data leakage will happen to …
Evidence supporting the role of evolutionary diversity in enhancing model generalization is provided by the superior performance of TF-Metazoa, which was trained on twelve phylogenetically diverse species.
What are your thoughts on the possibility of homology-based data leakage (e.g. https://www.biorxiv.org/content/10.1101/2025.01.22.634321v1.full)?
Given the hierarchical relationships generated by evolution, many genes will share some degree of relatedness both among the "phylogenetically diverse species" and with the held-out species. For example, a deeply conserved metazoan gene with little sequence diversity will be a nonindependent data point; it's sequence will be pseudoreplicated (possibly its expression too), allowing information leakage between pretraining and validation sets.
It seems likely that data leakage will happen to varying degrees based on the conservation of each gene. If this is happening to a substantial degree (i.e., the model is learning high copy number/homologous/pseudoreplicated sequences best), then it wouldn't be surprising that performance would scale with evolutionary distance from humans; the amount of shared homology would predict model performance. It also wouldn't be surprising that TF-Metazoa outperforms other models; by having more pseudoreplicated sequences, it provides more opportunities for data leakage and, thus, overfitting.
Is there a convincing way to show that this isn't the case?
-