Cellfm-datasets: A Unified Data Infrastructure for Single-Cell and Spatial Transcriptomics Foundation Model Pretraining

Liluojing Zhang
Jiangshuan Pang
Jing Yan
Wangyang Tang
Yiting Deng
Youzhe He

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale cell foundation models are increasingly limited not only by model architecture, but also by the data infrastructure required to repeatedly sample sparse transcriptomic profiles from out-of-core cohorts. AnnData/H5AD has become a standard exchange format for single-cell and spatial omics analysis, yet its HDF5-backed layout is not designed for high-frequency random mini-batch loading under multi-worker and distributed pretraining. We present Cellfm-datasets, a data infrastructure artifact that converts H5AD cohorts into a self-describing compressed sparse row (CSR) memmap layout and exposes the resulting corpus through Hugging Face Dataset and IterableDataset interfaces. The artifact stores a shared gene vocabulary, per-sample metadata, optional spatial coordinates, observation metadata, manifests, and checksums, and reconstructs sparse cell or group records at runtime without dense expansion. A unified sampling abstraction supports random-cell groups, manifest-defined biological regions, and coordinate-based spatial blocks, with deterministic sharding across distributed ranks and data-loader workers. Spatial demonstrations on P14 mouse brain transcriptomics sections illustrate region- and block-level sampling over real anatomical structures. In controlled benchmarks on a public heterogeneous ModelScope scRNA-seq subset, Cellfm-datasets reached 60,571 ± 1,734 samples/s in single-core random loading, scaled to approximately 160,000 samples/s with eight workers, and maintained near-constant process-private memory while reading up to one million cells. By moving sparse single-cell and spatial corpora from model-specific loader code into reusable, validated, and framework-native dataset artifacts, this design may reduce the engineering burden of reproducible cell foundation model pretraining and make repeated training runs, model comparisons, and mixed-modality data reuse easier to standardize.

Code availability

https://github.com/PangJiangShuan/cellfm-datasets

Version published to 10.64898/2026.06.11.731508 on bioRxiv
Jun 14, 2026

SRSA-VAE: Self-Attention-Based Feature Learning for Single-Cell Multimodal Clustering

This article has 4 authors:
1. Rangan Das
2. Ashmita Dey
3. Ujjwal Maulik
4. Sanghamitra Bandyopadhyay
This article has no evaluationsLatest version May 11, 2026
A layered standards framework for integrating single-cell and spatial omics data into brain cell atlases

This article has 44 authors:
1. Patrick L. Ray
2. Jeremy A. Miller
3. Dorota Jarecka
4. Kimberly A. Smith
5. Pamela M. Baker
6. Lydia Ng
7. Maryann E. Martone
8. Puja Trivedi
9. Rashmie Abeysinghe
10. Lisa Anderson
11. Anita E. Bandrowski
12. Edyta Vieth
13. Ashwin A. Bhandiwad
14. Tek Raj Chhetri
15. Licong Cui
16. Michelle Giglio
17. Jeff Goldy
18. Na Hong
19. Hao Huang
20. Yan Huang
21. Yasmeen Hussain
22. Nelson J. Johansen
23. Mariah Kenney
24. Lauren Kruse
25. Xiaojin Li
26. James C. Meldrim
27. Tyler Mollenkopf
28. Suvarna Nadendla
29. David Osumi-Sutherland
30. Raymond Sanchez
31. Richard H. Scheuermann
32. Shiqiang Tao
33. Charles R. Vanderburg
34. Yuntao Yang
35. Alex Ropelewski
36. Shoaib Mufti
37. Ed S. Lein
38. Hua Xu
39. W. Jim Zheng
40. Satrajit S. Ghosh
41. Owen White
42. Michael Hawrylycz
43. Guo-Qiang Zhang
44. Carol Thompson
This article has no evaluationsLatest version May 4, 2026
Transferable spatial omics deconvolution with SpaRank

This article has 5 authors:
1. Xuhua Yan
2. Ruiqing Zheng
3. Jinmiao Chen
4. Min Li
5. Wei Lan
This article has no evaluationsLatest version May 13, 2026

Discuss this preprint

Listed in

Abstract

Code availability

Article activity feed

Related articles

SRSA-VAE: Self-Attention-Based Feature Learning for Single-Cell Multimodal Clustering

A layered standards framework for integrating single-cell and spatial omics data into brain cell atlases

Transferable spatial omics deconvolution with SpaRank