Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance

Alan DenAdel
Madeline Hughes
Akshaya Thoutam
Anay Gupta
Andrew W. Navia
Nicolo Fusi
Srivatsan Raghavan
Peter S. Winter
Ava P. Amini
Lorin Crawford

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger transcriptomic datasets, scaling from initial studies with 1 million cells to newer atlases with over 100 million cells. This study investigates the role of pre-training dataset size and diversity on the performance of single-cell foundation models on both zero-shot and fine-tuned tasks. Using a large corpus of 22.2 million cells, we pre-train a total of 375 models which we evaluate by conducting 3,750 experiments. Our results show that current methods tend to plateau in performance with pre-training datasets that are only a fraction of the size.

Version published to 10.1101/2024.12.13.628448v1 on bioRxiv
Dec 17, 2024

A large-scale foundation model for bulk transcriptomes

This article has 5 authors:
1. Boming Kang
2. Rui Fan
3. Meizheng Yi
4. Chunmei Cui
5. Qinghua Cui
This article has no evaluationsLatest version Jun 17, 2025
Benchmarking DNA Foundation Models for zero-shot variant effect prediction: the role of context, training, and architecture

This article has 4 authors:
1. Ilaria Alfisi
2. Francesca Ciapi
3. Marta Baragli
4. Alberto Magi
This article has no evaluationsLatest version Jun 19, 2025
Biological Reasoning with Reinforcement Learning through Natural Language Enables Generalizable Zero-Shot Cell Type Annotations

This article has 4 authors:
1. Xi Wang
2. Runzi Tan
3. Bo Wang
4. Simona Cristea
This article has no evaluationsLatest version Jun 24, 2025

Listed in

Abstract

Article activity feed

Related articles

A large-scale foundation model for bulk transcriptomes

Benchmarking DNA Foundation Models for zero-shot variant effect prediction: the role of context, training, and architecture

Biological Reasoning with Reinforcement Learning through Natural Language Enables Generalizable Zero-Shot Cell Type Annotations