CellTok: Early-Fusion Multimodal Large Language Model for Single-Cell Transcriptomics via Tokenization

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Single-cell transcriptomic data provide a high-resolution view of cellular states and functions, offering critical insights into development, disease, and tissue heterogeneity. Existing foundation models for single-cell analysis typically embed cells as continuous vectors, limiting their generative flexibility and hindering integration with the knowledge accumulated in large language models (LLMs). In this work, we present CellTok, a multimodal LLM framework for unified analysis of scRNA-seq data and biological text. Each cell is tokenized into discrete codebook tokens via VQ-VAE and integrated into the LLM’s vocabulary using early fusion. This allows CellTok to process biological and textual inputs autoregressively, leveraging pretrained knowledge within LLMs to analyze cellular states and interactions. CellTok demonstrates strong performance in cell-level tasks such as annotation and generation, and enables previously unattainable large population-level tasks, including accurate prediction of intercellular communication networks, which is a key mechanism in tissue organization and disease. Furthermore, it retains interactive Q&A capabilities grounded in biomedical knowledge, offering a new paradigm for analyzing complex cellular systems. Our results suggest that the CellTok framework can extend the capabilities of LLMs to single-cell data, and suggests a direction for building more general mixed-modal models.

Article activity feed