Assessing Scale and Predictive Diversity in Models for Single-Cell Transcriptomics based on Geneformer

Junfan Chen
Fabian Schmidt
Ricardo Henao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Foundation models are increasingly applied to single-cell transcriptomics, where they promise to capture generalizable representations that support diverse downstream analyzes. However, two central questions remain: Does scaling pre-training data reliably improve performance, and do models trained on rank-ordered expression profiles confer advantages for mitigating batch effects? We addressed these questions by systematically assessing transformer-based models pre-trained on ranked single-cell profiles with varying data scales. The models were evaluated on masked gene prediction and downstream tasks, including cell type classification, perturbation response prediction, and zero-shot batch integration. To complement prediction accuracy, we further quantified prediction repetition, uniqueness, and diversity. In addition, we evaluated architectural refinements that incorporate cumulative prediction adjustment and similarity-based regularization; however, the study mainly focused on comparative evaluations rather than the development of new models. Our results indicated that scaling pretraining corpora improved masked prediction accuracy but did not consistently enhance downstream performance. Smaller models often matched or exceeded larger ones, indicating diminishing returns relative to scale. Rank-based models offered limited robustness to batch effects and consistently underperformed relative to domain-specific correction methods. Across all scales, high redundancy in predicted genes remains a major limitation. Together, these findings challenge assumptions that larger datasets or rank-order modeling automatically confer stronger generalization. Progress in single-cell foundation models may depend less on scale and more on pre-training objectives that enhance predictive diversity and biological plausibility.

Author summary

In recent years, large machine learning models have been adapted to study single cells, raising hopes that simply training on more data would automatically lead to better biological insights. Our study aimed to test this assumption by carefully evaluating models trained on small versus very large collections of single-cell data. We also asked whether models that learn from ranked lists of gene expressions offer an advantage in correcting technical differences between experiments, a common challenge in biology.

We found that bigger is not always better. Although larger models improved performance in predicting missing genes, this advantage did not consistently translate to stronger results in tasks such as cell type classification or data integration across studies. In fact, smaller models often matched or outperformed their larger counterparts. Models trained on ranked data were also not sufficient to reliably correct batch effects, and traditional tools still worked better for this purpose.

Our findings suggest that future progress may come less from ever-larger models and more from rethinking how models learn from biological data. This work provides guidance for researchers designing new approaches and helps clarify realistic expectations for applying artificial intelligence to single-cell biology.

Version published to 10.1101/2025.11.04.686458 on bioRxiv
Nov 5, 2025

Accurate, scalable, and unified single-cell atlas integration with scBIOT

This article has 2 authors:
1. Haihui Zhang
2. Peiwu Qin
This article has no evaluationsLatest version Jan 19, 2026
Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

This article has 2 authors:
1. Xiuwei Zhang
2. Yuqi Cheng
This article has no evaluationsLatest version Dec 10, 2025
Microenvironment-aware transcriptome reconstruction in spatial transcriptomics

This article has 7 authors:
1. Shi-Tong Yang
2. Pai Peng
3. Hui-Feng He
4. Meng-Guo Wang
5. Bo-Han Si
6. Xiao-Fei Zhang
7. Luonan Chen
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Author summary

Article activity feed

Related articles

Accurate, scalable, and unified single-cell atlas integration with scBIOT

Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

Microenvironment-aware transcriptome reconstruction in spatial transcriptomics