Heimdall: A Modular Framework for Tokenization in Single-Cell Foundation Models

Ellie Haber
Shahul Alam
Nicholas Ho
Renming Liu
Evan Trop
Shaoheng Liang
Muyu Yang
Spencer Krieger
Jian Ma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Foundation models trained on single-cell RNA-sequencing (scRNA-seq) data have rapidly become powerful tools for single-cell analysis. Their performance, however, depends critically on how cells are tokenized into model inputs – a design space that remains poorly understood. Here, we present H eimdall , a comprehensive framework and open-source toolkit for systematically evaluating tok-enization strategies in single-cell foundation models (scFMs). H eimdall decomposes each scFM into modular components: a gene identity encoder ( F _G ), an expression encoder ( F _E ), and a “cell sentence” constructor ( F _C ) with submodules ( order , sequence , and reduce ) enabling fine-grained control and attribution. Using a transformer trained from scratch, we evaluate tokenization strategies for cell type classification across challenging transfer learning settings – cross-tissue, cross-species, and spatial gene-panel shifts – and separately assess reverse perturbation prediction. Tokenization choices show minimal impact in-distribution but are decisive under distribution shift, with F _G and order driving the largest gains and F _E providing additional improvements. H eimdall further shows how existing strategies can be recombined to enhance generalization. By standardizing evaluation and providing an extensive library, H eimdall establishes a foundation for reproducible, systematic exploration of single-cell tokenization and accelerates the development of next-generation scFMs.

Version published to 10.1101/2025.11.09.687403 on bioRxiv
Nov 10, 2025

Accurate, scalable, and unified single-cell atlas integration with scBIOT

This article has 2 authors:
1. Haihui Zhang
2. Peiwu Qin
This article has no evaluationsLatest version Jan 19, 2026
GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accurate, scalable, and unified single-cell atlas integration with scBIOT

GENERator: A Long-Context Generative Genomic Foundation Model

A Survey on Efficient Protein Language Models