A systematic assessment of single-cell language model configurations

Gaetan De Waele
Gerben Menschaert
Willem Waegeman

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Transformers pre-trained on single-cell transcriptomic data have recently been applied to a series of tasks, earning them the title of foundation models. As all currently published models in this class employ vastly different pre-training strategies, it is impossible to determine which practices drive their success (or failure). Here, we present a large-scale study of pre-training components for single-cell transcriptomic transformers: bento-sc (BENchmarking Transformer-Obtained Single-Cell representations). By isolating (and tuning) parts of the pre-training scheme one by one, we define best practices for single-cell language model (scLM) construction. While comparisons with baselines indicate that scLMs do not yet offer the generational leap in prediction performances promised by many foundation models, we identify key design choices leading to their improved performance. Namely, the best scLMs are obtained by: (1) minimally processing counts at the input level, (2) using reconstruction losses that exploit known count distributions, (3) masking (up to high rates), and (4) combining different pre-training tasks/losses. All code supporting this study is distributed on PyPI and is packaged under: https://github.com/gdewael/bento-sc .

Version published to 10.1101/2025.04.02.646825v1 on bioRxiv
Apr 8, 2025

Scaling Large Language Models for Next-Generation Single-Cell Analysis

This article has 23 authors:
1. Syed Asad Rizvi
2. Daniel Levine
3. Aakash Patel
4. Shiyang Zhang
5. Eric Wang
6. Sizhuang He
7. David Zhang
8. Cerise Tang
9. Zhuoyang Lyu
10. Rayyan Darji
11. Chang Li
12. Emily Sun
13. David Jeong
14. Lawrence Zhao
15. Jennifer Kwan
16. David Braun
17. Brian Hafler
18. Jeffrey Ishizuka
19. Rahul M. Dhodapkar
20. Hattie Chung
21. Shekoofeh Azizi
22. Bryan Perozzi
23. David van Dijk
This article has no evaluationsLatest version Apr 17, 2025
AlchemBERT: Exploring Lightweight Language Models for Materials Informatics

This article has 5 authors:
1. Xiaotong Liu
2. Yuhang Wang
3. Tao Yang
4. Xingchen Liu
5. Xiao-Dong Wen
This article has no evaluationsLatest version Mar 26, 2025
Scaling unlocks broader generation and deeper functional understanding of proteins

This article has 10 authors:
1. Aadyot Bhatnagar
2. Sarthak Jain
3. Joel Beazer
4. Samuel C. Curran
5. Alexander M. Hoffnagle
6. Kyle Ching
7. Michael Martyn
8. Stephen Nayfach
9. Jeffrey A. Ruffolo
10. Ali Madani
This article has no evaluationsLatest version Apr 16, 2025

Listed in

Abstract

Article activity feed

Related articles

Scaling Large Language Models for Next-Generation Single-Cell Analysis

AlchemBERT: Exploring Lightweight Language Models for Materials Informatics

Scaling unlocks broader generation and deeper functional understanding of proteins