Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Foundation models have transformed natural language processing and computer vision, yet their potential in single-cell biology particularly for complex diseases such as cancer remains underexplored. We present Tahoe-x1 (Tx1), a family of perturbation-trained single-cell foundation models with up to 3 billion parameters. Tx1 is pretrained on large-scale single-cell transcriptomic datasets, including the Tahoe-100M perturbation compendium, and fine-tuned for cancer-relevant tasks. Through architectural optimizations, data loader refinements, and efficient training strategies, Tx1 achieves 3-30x higher compute efficiency than prior implementations of cell-state models. Tx1 jointly learns representations of genes, cells, and compounds using a masked-expression generative objective that incorporates a drug token, enabling flexible adaptation to diverse downstream applications. We evaluate Tx1 across four key disease-relevant benchmarks: (1) prediction of overall and context-specific gene essentiality, (2) identification of genes contributing to the hallmarks of cancer, (3) cell-type classification, and (4) prediction of perturbation responses in held-out cellular contexts. Tx1 achieves state-of-the-art performance across all tasks. We release pretrained checkpoints, training code, and evaluation workflows to accelerate the development of perturbation-trained single-cell foundation models for applications in precision oncology and beyond.