scEMB: Learning context representation of genes based on large-scale single-cell transcriptomics

Kang-Lin Hsieh
Yan Chu
Xiaoyang Li
Patrick G. Pilié
Yulin Dai

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

The rapid advancement of single-cell transcriptomic technologies has led to the curation of millions of cellular profiles, providing unprecedented insights into cellular heterogeneity across various tissues and developmental stages. This growing wealth of data presents an opportunity to uncover complex gene-gene relationships, yet also poses significant computational challenges.

Results

We present scEMB, a transformer-based deep learning model developed to capture context-aware gene embeddings from large-scale single-cell transcriptomics data. Trained on over 30 million single-cell transcriptomes, scEMB utilizes an innovative binning strategy that integrates data across multiple platforms, effectively preserving both gene expression hierarchies and cell-type specificity. In downstream tasks such as batch integration, clustering, and cell type annotation, scEMB demonstrates superior performance compared to existing models like scGPT and Geneformer. Notably, scEMB excels in silico correlation analysis, accurately predicting gene perturbation effects in CRISPR-edited datasets and microglia state transition, identifying a few known Alzheimer’s disease (AD) risks genes in top gene list. Additionally, scEMB offers robust fine-tuning capabilities for domain-specific applications, making it a versatile tool for tackling diverse biological problems such as therapeutic target discovery and disease modeling.

Conclusions

scEMB represents a powerful tool for extracting biologically meaningful insights from complex gene expression data. Its ability to model in silico perturbation effects and conduct correlation analyses in the embedding space highlights its potential to accelerate discoveries in precision medicine and therapeutic development.

Version published to 10.1101/2024.09.24.614685 on bioRxiv
Sep 26, 2024

Accurate, scalable, and unified single-cell atlas integration with scBIOT

This article has 2 authors:
1. Haihui Zhang
2. Peiwu Qin
This article has no evaluationsLatest version Jan 19, 2026
An integrated single-cell transcriptomic dataset for Mouse cortex

This article has 8 authors:
1. Xuefeng Shi
2. Zhihui Qi
3. Hong Huang
4. Zhiming Ye
5. YuMin Wu
6. Kahei Chan
7. Maojin Yao
8. Zhongxing Wang
This article has no evaluationsLatest version Dec 18, 2025
Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

This article has 2 authors:
1. Xiuwei Zhang
2. Yuqi Cheng
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Accurate, scalable, and unified single-cell atlas integration with scBIOT

An integrated single-cell transcriptomic dataset for Mouse cortex

Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features