A systematic assessment of single-cell language model configurations
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Transformers pre-trained on single-cell transcriptomic data have recently been applied to a series of tasks, earning them the title of foundation models. As all currently published models in this class employ vastly different pre-training strategies, it is impossible to determine which practices drive their success (or failure). Here, we present a large-scale study of pre-training components for single-cell transcriptomic transformers: bento-sc (BENchmarking Transformer-Obtained Single-Cell representations). By isolating (and tuning) parts of the pre-training scheme one by one, we define best practices for single-cell language model (scLM) construction. While comparisons with baselines indicate that scLMs do not yet offer the generational leap in prediction performances promised by many foundation models, we identify key design choices leading to their improved performance. Namely, the best scLMs are obtained by: (1) minimally processing counts at the input level, (2) using reconstruction losses that exploit known count distributions, (3) masking (up to high rates), and (4) combining different pre-training tasks/losses. All code supporting this study is distributed on PyPI and is packaged under: https://github.com/gdewael/bento-sc .