Heimdall: A Modular Framework for Tokenization in Single-Cell Foundation Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Foundation models trained on single-cell RNA-sequencing (scRNA-seq) data have rapidly become powerful tools for single-cell analysis. Their performance, however, depends critically on how cells are tokenized into model inputs – a design space that remains poorly understood. Here, we present H eimdall , a comprehensive framework and open-source toolkit for systematically evaluating tok-enization strategies in single-cell foundation models (scFMs). H eimdall decomposes each scFM into modular components: a gene identity encoder ( F G ), an expression encoder ( F E ), and a “cell sentence” constructor ( F C ) with submodules ( order , sequence , and reduce ) enabling fine-grained control and attribution. Using a transformer trained from scratch, we evaluate tokenization strategies for cell type classification across challenging transfer learning settings – cross-tissue, cross-species, and spatial gene-panel shifts – and separately assess reverse perturbation prediction. Tokenization choices show minimal impact in-distribution but are decisive under distribution shift, with F G and order driving the largest gains and F E providing additional improvements. H eimdall further shows how existing strategies can be recombined to enhance generalization. By standardizing evaluation and providing an extensive library, H eimdall establishes a foundation for reproducible, systematic exploration of single-cell tokenization and accelerates the development of next-generation scFMs.