CellDiffusion: a generative model to annotate single-cell and spatial RNA-seq using bulk references
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Annotating single-cell and spatial RNA-seq data can be greatly enhanced by leveraging bulk RNA-seq, which remains a cost-effective and well-established benchmark for characterising transcriptional activity in immune cell populations. However, a major technical hurdle lies in the contrasting properties of these data types: single-cell and spatial data are inherently sparse due to its cell-level sampling scheme, leading to much lower sequencing depth compared to bulk RNA-seq.
We developed CellDiffusion, a generative machine learning (ML) tool that bridges this gap. CellDiffusion generates realistic virtual cells to augment the sparse single-cell and spatial data, improving signals and the representation of rare cell types. The augmented data are more comparable to bulk references, increasing the accuracy of cell type annotation using bulk references and automated ML classifiers.
We benchmarked CellDiffusion on single-cell and spatial datasets from human peripheral blood samples, white adipose tissues, and breast tumours. Our method significantly outperforms state-of-the-art methods such as SingleR, Seurat, and scVI. In addition, CellDiffusion provides critical biological insights, including the identification of novel cell subtypes and their function during cell state transition; the discovery of new marker genes for tissue-resident immune cells, revealing their functional shifts in myeloid populations; and the accurate characterisation of cell subtypes in spatial transcriptomics to decipher tumour microenvironment.