Multimodal learning of transcriptomes and text enables interactive single-cell RNA-seq data exploration with natural-language chats

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Single-cell RNA-seq characterizes biological samples at unprecedented scale and detail, but data interpretation remains challenging. Here we introduce CellWhisperer, a multimodal machine learning model and software that connects transcriptomes and text for interactive single-cell RNA-seq data analysis. CellWhisperer enables the chat-based interrogation of transcriptome data in English language. To train our model, we created an AI-curated dataset with over a million pairs of RNA-seq profiles and matched textual annotations across a broad range of human biology, and we established a multimodal embedding of matched transcriptomes and text using contrastive learning. Our model enables free-text search and annotation of transcriptome datasets by cell types, states, and other properties in a zero-shot manner and without the need for reference datasets. Moreover, Cell-Whisperer answers questions about cells and genes in natural-language chats, using a biologically fluent large language model that we fine-tuned to analyze bulk and single-cell transcriptome data across various biological applications. We integrated CellWhisperer with the widely used CELLxGENE browser, allowing users to in-teractively explore RNA-seq data through an integrated graphical and chat interface. Our method demonstrates a new way of working with transcriptome data, leveraging the power of natural language for single-cell data analysis and establishing an important building block for future AI-based bioinformatics research assistants.

Article activity feed