Adding layers of information to scRNA-seq data using pre-trained language models

Sonia Maria Krißmer
Jonatan Menger
Johan Rollin
Tanja Vogel
Harald Binder
Maren Hackenberg

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Single-cell technologies generate increasingly complex and multi-layered datasets, increasing the need for analysis workflows that incorporate additional biological information. Pretrained language models, with access to large corpora of biomedical literature, promise to provide such additional context to complement data-based analyses, yet recent approaches largely focus on data-intrinsic tasks. Here we propose a framework for context-aware enrichment of single-cell RNA sequencing data by aligning data-derived and literature-derived representations in a shared embedding space. We represent cells as sentences derived from ranked gene expression and metadata, and construct structurally parallel datasets from PubMed titles and abstracts. Lightweight encoder-only language models are trained jointly on both sources to learn a common embedding space, thus integrating additional layers of information from biomedical literature. Analyzing the joint embedding space, we show that biomedical literature can be meaningfully aligned with single-cell profiles to enrich standard analysis workflows. The trained models achieve robust annotation, capture functional states such as cytotoxicity, and reveal disease associations from literature-aligned embeddings. In developmental data, incorporating temporal metadata enables capturing temporal transitions consistent with cell lineage trajectories, demonstrating the potential of knowledge-augmented embeddings as a generalizable and interpretable strategy for extending single-cell analysis pipelines.

Version published to 10.1101/2025.08.23.671699 on bioRxiv
Aug 27, 2025

LeukGenePipeline: Modular Workflow for Genomic Datasets

This article has 2 authors:
1. Ana Carolina Pacífico dos Santos
2. Omar Arias-Gaguancela
This article has no evaluationsLatest version Aug 2, 2025
CLM-access: A Specialized Foundation Model for High-dimensional Single-cell ATAC-seq analysis

This article has 7 authors:
1. Ziqiang Liu
2. Bowen Li
3. Zhenyu Xu
4. Yantao Li
5. Junwei Zhang
6. Chulin Sha
7. Xiaolin Li
This article has no evaluationsLatest version Aug 12, 2025
OmicsFUSION: A Pretrained Hyena-Based Framework for Encoding, Reconstruction, and Representation of the Omics and DNA Data

This article has 5 authors:
1. Nazar Beknazarov
2. Nikita Pavlichenko
3. Artem Bashkatov
4. Maria Poptsova
5. Alan Herbert
This article has no evaluationsLatest version Sep 8, 2025

Listed in

Abstract

Article activity feed

Related articles

LeukGenePipeline: Modular Workflow for Genomic Datasets

CLM-access: A Specialized Foundation Model for High-dimensional Single-cell ATAC-seq analysis

OmicsFUSION: A Pretrained Hyena-Based Framework for Encoding, Reconstruction, and Representation of the Omics and DNA Data