Adding layers of information to scRNA-seq data using pre-trained language models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Single-cell technologies generate increasingly complex and multi-layered datasets, increasing the need for analysis workflows that incorporate additional biological information. Pretrained language models, with access to large corpora of biomedical literature, promise to provide such additional context to complement data-based analyses, yet recent approaches largely focus on data-intrinsic tasks. Here we propose a framework for context-aware enrichment of single-cell RNA sequencing data by aligning data-derived and literature-derived representations in a shared embedding space. We represent cells as sentences derived from ranked gene expression and metadata, and construct structurally parallel datasets from PubMed titles and abstracts. Lightweight encoder-only language models are trained jointly on both sources to learn a common embedding space, thus integrating additional layers of information from biomedical literature. Analyzing the joint embedding space, we show that biomedical literature can be meaningfully aligned with single-cell profiles to enrich standard analysis workflows. The trained models achieve robust annotation, capture functional states such as cytotoxicity, and reveal disease associations from literature-aligned embeddings. In developmental data, incorporating temporal metadata enables capturing temporal transitions consistent with cell lineage trajectories, demonstrating the potential of knowledge-augmented embeddings as a generalizable and interpretable strategy for extending single-cell analysis pipelines.

Article activity feed