Assessing Scale and Predictive Diversity in Models for Single-Cell Transcriptomics based on Geneformer
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Foundation models are increasingly applied to single-cell transcriptomics, where they promise to capture generalizable representations that support diverse downstream analyzes. However, two central questions remain: Does scaling pre-training data reliably improve performance, and do models trained on rank-ordered expression profiles confer advantages for mitigating batch effects? We addressed these questions by systematically assessing transformer-based models pre-trained on ranked single-cell profiles with varying data scales. The models were evaluated on masked gene prediction and downstream tasks, including cell type classification, perturbation response prediction, and zero-shot batch integration. To complement prediction accuracy, we further quantified prediction repetition, uniqueness, and diversity. In addition, we evaluated architectural refinements that incorporate cumulative prediction adjustment and similarity-based regularization; however, the study mainly focused on comparative evaluations rather than the development of new models. Our results indicated that scaling pretraining corpora improved masked prediction accuracy but did not consistently enhance downstream performance. Smaller models often matched or exceeded larger ones, indicating diminishing returns relative to scale. Rank-based models offered limited robustness to batch effects and consistently underperformed relative to domain-specific correction methods. Across all scales, high redundancy in predicted genes remains a major limitation. Together, these findings challenge assumptions that larger datasets or rank-order modeling automatically confer stronger generalization. Progress in single-cell foundation models may depend less on scale and more on pre-training objectives that enhance predictive diversity and biological plausibility.
Author summary
In recent years, large machine learning models have been adapted to study single cells, raising hopes that simply training on more data would automatically lead to better biological insights. Our study aimed to test this assumption by carefully evaluating models trained on small versus very large collections of single-cell data. We also asked whether models that learn from ranked lists of gene expressions offer an advantage in correcting technical differences between experiments, a common challenge in biology.
We found that bigger is not always better. Although larger models improved performance in predicting missing genes, this advantage did not consistently translate to stronger results in tasks such as cell type classification or data integration across studies. In fact, smaller models often matched or outperformed their larger counterparts. Models trained on ranked data were also not sufficient to reliably correct batch effects, and traditional tools still worked better for this purpose.
Our findings suggest that future progress may come less from ever-larger models and more from rethinking how models learn from biological data. This work provides guidance for researchers designing new approaches and helps clarify realistic expectations for applying artificial intelligence to single-cell biology.