Fundamental Limitations of Foundation Models in Single-Cell Transcriptomics
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent applications of foundation models in biology have focused on pretraining using large-scale single-cell datasets comprising millions of cells, across diverse patho-physiological states. These models are then fine-tuned for downstream tasks such as cell-type classification. In this study, we evaluated the performance of three widely-used foundation models in biology—scGPT, SCMAMBA-2, and Geneformer—and a statistical baseline (Seurat v5) on cell-type classification under Gaussian noise perturbation. We used two curated datasets, referred to as Myeloid (13k cells) and hPancreas (15k cells). Surprisingly, we found that the baseline performance of the foundation models was inferior to that of the statistical model, even without any added perturbation. Although we note that model size can affect performance, Geneformer’s accuracy outperformed scGPT and SCMAMBA-2 by 5% on average across all datasets despite having 40% fewer trainable parameters. Nonetheless, the statistical baseline still outperformed Geneformer by 9% in accuracy. Based on these findings, we hypothesized that the conventional training paradigm used by foundation models for single-cell tasks consistently underperforms statistical models due to a lack of essential biological context. To investigate this, we evaluated whether performance degradation stems from early data embedding steps—such as binning or gene normalization during tokenization, and sampling bias. First, to better understand tokenization-related artifacts, we introduced controlled Gaussian noise to gene expression values before tokenization, amplifying downstream distortions introduced by the tokenization process (all models were trained for identical step durations with identical hyperparameters). On the Myeloid dataset, following the introduction of Gaussian noise perturbation to 20% of cells, both scGPT and SCMAMBA-2 saw an 11% decrease in accuracy while Geneformer saw an 8% decrease in accuracy. This difference in performance may be due to the difference in encoding methods used by both models. scGPT and SCMAMBA-2 use a bin-based tokenization strategy, in contrast to Geneformer’s rank-value encoding, which normalizes gene expression values using a predetermined encoding constant. Although binning captures general trends in count data, it fails to preserve relative expression at the gene level, resulting in significant information loss. Applying the three models to the Myeloid dataset also revealed that scGPT’s prediction distribution is biased toward overrepresented cell types in the training data, while underrepresenting rarer classes. CD14 cells are overpredicted by 16% (among most abundant cell types) by scGPT. Geneformer, however, maintains a more stable prediction distribution with a 6% (CD14) increase and outperforms scGPT and SCMAMBA-2 by 26% in macro F1 score (unweighted metric). Based on our findings we assert that the gap in contextual encoding in bin-based tokenization is what contributes to the less-nuanced learning. Recent research that integrated cellular-ontology during training showed improved performance to both scGPT and Geneformer. Our results underscore a fundamental issue, that foundation models lack critical biological context that would allow for them to make the nuanced inferences required for complex biological analyses. The compression of single-cell data from raw counts to embedding vectors can span several orders of magnitude, and lead to significant loss of information. As a result, methods must adapt to prioritize contextual integration during tokenization to ensure sufficient information for the model.