CellTok: Early-Fusion Multimodal Large Language Model for Single-Cell Transcriptomics via Tokenization

Chuxi Xiao
Haiyang Bian
Yixin Chen
Lei Wei
Xuegong Zhang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Single-cell transcriptomic data provide a high-resolution view of cellular states and functions, offering critical insights into development, disease, and tissue heterogeneity. Existing foundation models for single-cell analysis typically embed cells as continuous vectors, limiting their generative flexibility and hindering integration with the knowledge accumulated in large language models (LLMs). In this work, we present CellTok, a multimodal LLM framework for unified analysis of scRNA-seq data and biological text. Each cell is tokenized into discrete codebook tokens via VQ-VAE and integrated into the LLM’s vocabulary using early fusion. This allows CellTok to process biological and textual inputs autoregressively, leveraging pretrained knowledge within LLMs to analyze cellular states and interactions. CellTok demonstrates strong performance in cell-level tasks such as annotation and generation, and enables previously unattainable large population-level tasks, including accurate prediction of intercellular communication networks, which is a key mechanism in tissue organization and disease. Furthermore, it retains interactive Q&A capabilities grounded in biomedical knowledge, offering a new paradigm for analyzing complex cellular systems. Our results suggest that the CellTok framework can extend the capabilities of LLMs to single-cell data, and suggests a direction for building more general mixed-modal models.

Version published to 10.1101/2025.10.22.684047 on bioRxiv
Oct 24, 2025

Accurate, scalable, and unified single-cell atlas integration with scBIOT

This article has 2 authors:
1. Haihui Zhang
2. Peiwu Qin
This article has no evaluationsLatest version Jan 19, 2026
Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

This article has 2 authors:
1. Xiuwei Zhang
2. Yuqi Cheng
This article has no evaluationsLatest version Dec 10, 2025
An integrated single-cell transcriptomic dataset for Mouse cortex

This article has 8 authors:
1. Xuefeng Shi
2. Zhihui Qi
3. Hong Huang
4. Zhiming Ye
5. YuMin Wu
6. Kahei Chan
7. Maojin Yao
8. Zhongxing Wang
This article has no evaluationsLatest version Dec 18, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accurate, scalable, and unified single-cell atlas integration with scBIOT

Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

An integrated single-cell transcriptomic dataset for Mouse cortex