Cell type composition drives patient stratification in single-cell RNA-seq cohorts
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Early transcriptomic studies demonstrated that unsupervised analysis of bulk gene expression can reveal clinically meaningful patient subgroups. Single-cell RNA sequencing (scRNA-seq) provides high-resolution characterization of cellular heterogeneity and therefore enables more refined patient stratification. Several computational approaches have been proposed to summarize single-cell data into sample-level representations for cohort-level exploratory analyses. However, these methods generally do not explicitly account for the compositional nature of cell-type proportions. Based on eleven scRNA-seq cohorts across different biological conditions, we evaluated several state-of-the-art sample representation methods for their ability to recover known biological groupings in an unsupervised setting. Surprisingly, we found that baseline approaches based on cell-type composition and pseudobulk gene expression consistently matched or outperformed more complex methods while requiring orders of magnitude fewer computational resources. In particular, centered log-ratio-transformed cell-type proportions achieved the highest stratification performance and demonstrated robustness to batch effects. The stratification signal was frequently concentrated in a small subset of highly variable cell types, and performance was robust across diverse cell type annotation strategies. Altogether, these results suggest that clinically relevant inter-sample variation in scRNA-seq cohorts is largely driven by differences in cell-type composition. Importantly, compositional representations directly link cohort-level structure to specific cell populations, enabling mechanistic interpretation and facilitating clinical translation. We provide scECODA, an open-source R package for scalable and interpretable cohort-level Exploratory COmpositional Data Analysis of scRNA-seq data, and establish cell-type compositional representations as a powerful and interpretable baseline for unsupervised patient stratification.