A gene program dictionary of human cells
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Defining all human cell types and their roles in health and disease is a central goal of biology. Single-cell RNA sequencing has enabled the construction of organ-specific cell atlases, but building a comprehensive organism-wide atlas spanning multiple organs remains challenging due to batch effects, study biases, and inter-organ complexity. Here, we present Gene Program Dictionary (GPD), a framework that leverages robust gene co-expression programs–rather than direct cell integration–to overcome these barriers. Using SpacGPA, a partial correlation-based network method, we analyzed 466 scRNA-seq datasets, generating 1,975 independent networks and 90,701 gene co-expression modules, which were consolidated into 1,534 consensus gene programs representing a wide range of human tissues and cell types. Each program serves as a composite marker, capturing both cell-type-specific and shared biological processes. We demonstrate their utility by mapping endothelial cell subtypes across tissues to reveal their heterogeneity–including tumor-specific programs–annotating colorectal cancer spatial transcriptomes, and linking programs and their corresponding cell types to disease loci, revealing hotspots such as neuronal programs in psychiatric disorders and a proximal tubule program in kidney diseases. GPD provides an organism-wide reference for studying cellular diversity and disease mechanisms.