scGenePT: Is language all you need for modeling single-cell perturbations?

Ana-Maria Istrate
Donghui Li
Theofanis Karaletsos

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Modeling single-cell perturbations is a crucial task in the field of single-cell biology. Predicting the effect of up or down gene regulation or drug treatment on the gene expression profile of a cell can open avenues in understanding biological mechanisms and potentially treating disease. Most foundation models for single-cell biology learn from scRNA-seq counts, using experimental data as a modality to generate gene representations. Similarly, the scientific literature holds a plethora of information that can be used in generating gene representations using a different modality - language - as the basis. In this work, we study the effect of using both language and experimental data in modeling genes for perturbation prediction. We show that textual representations of genes provide additive and complementary value to gene representations learned from experimental data alone in predicting perturbation outcomes for single-cell data. We find that textual representations alone are not as powerful as biologically learned gene representations, but can serve as useful prior information. We show that different types of scientific knowledge represented as language induce different types of prior knowledge. For example, in the datasets we study, subcellular location helps the most for predicting the effect of single-gene perturbations, and protein information helps the most for modeling perturbation effects of interactions of combinations of genes. We validate our findings by extending the popular scGPT model, a foundation model trained on scRNA-seq counts, to incorporate language embeddings at the gene level. We start with NCBI gene card and UniProt protein summaries from the genePT approach and add gene function annotations from the Gene Ontology (GO). We name our model “scGenePT”, representing the combination of ideas from these two models. Our work sheds light on the value of integrating multiple sources of knowledge in modeling single-cell data, highlighting the effect of language in enhancing biological representations learned from experimental data.

Version published to 10.1101/2024.10.23.619972 on bioRxiv
Oct 28, 2024

Evaluating the learnability of single-cell large language models on multiple tasks

This article has 3 authors:
1. Yu Yan
2. Xutao Wang
3. Dongyuan Song
This article has no evaluationsLatest version Mar 12, 2026
DrugSAGE: an aggregation-based method for drug response imputation

This article has 2 authors:
1. Peilin Jia
2. Zhongming Zhao
This article has no evaluationsLatest version Mar 12, 2026
On Using Large Language Models to Understand the Language of Life

This article has 2 authors:
1. Joao Pedro Magalhaes
2. George M. Church
This article has no evaluationsLatest version Feb 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating the learnability of single-cell large language models on multiple tasks

DrugSAGE: an aggregation-based method for drug response imputation

On Using Large Language Models to Understand the Language of Life