Compositional data modeling of high-dimensional single cell RNA-seq (CoDA-hd): its advantages over commonly used normalization approaches
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Compositional data analysis (CoDA) is an emerging statistical framework and has been extended to microbiome, bulk RNA-seq, and cell type proportions in single-cell RNA-seq (scRNA-seq) (with 50-200 components). Here, we explore the high-dimensional application of CoDA (CoDA-hd) and its various log-ratio (LR) transformations to raw count matrices of scRNA-seq which has over 20,000 components (e.g., protein coding genes). scRNA-seq matrices are typically sparse and high-dimensional. Common approaches of normalization such as log-normalization may lead to suspicious findings as previously shown for trajectory inference. Although RNA-seq is compositional data by nature, the geometry of CoDA in high-dimensional simplex is not compatible with most downstream analyses of scRNA-seq which are based on Euclidean space. In this study, we explored if CoDA is adaptable with scRNA-seq in various downstream applications. Specifically, we attempted to study: (1) CoDA adaptability to scRNA-seq; (2) handling of zero data: prior-log-normalized, imputation or with specific count addition; (3) transformation to Euclidean space and compatibility with downstream analyses. Our results suggested that (1) the innovative count addition schemes (e.g., SGM) enable the application of CoDA to high dimensional sparse data (i.e., scRNA-seq); (2) log-normalized data could be transformed to CoDA LR transformation as an approximation; (3) CoDA LR transformations such as count-added centered-log-ratio (CLR) had some advantages in dimension reduction visualization, clustering, and trajectory inference in the tested real & simulated datasets. CLR provided more decent and separated clusters in dimension reductions, improved the Slingshot trajectory inference, and eliminated the suspicious trajectory that is probably caused by the dropouts. We therefore concluded that CoDA may be a preferred scale-free model to handle scRNA-seq data for these downstream applications. Additionally, an R package ‘CoDAhd’ was developed for conducting CoDA LR transformations for high dimensional scRNA-seq data. The code for implementing CoDA-hd and some example datasets were placed at https://github.com/GO3295/CoDAhd .