Compositional data modeling of high-dimensional single cell RNA-seq (CoDA-hd): its advantages over commonly used normalization approaches

Jinghan Huang
Phillip Sheung Chi Yam
KS Leung
Minghua Deng
Nelson LS Tang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Compositional data analysis (CoDA) is an emerging statistical framework and has been extended to microbiome, bulk RNA-seq, and cell type proportions in single-cell RNA-seq (scRNA-seq) (with 50-200 components). Here, we explore the high-dimensional application of CoDA (CoDA-hd) and its various log-ratio (LR) transformations to raw count matrices of scRNA-seq which has over 20,000 components (e.g., protein coding genes). scRNA-seq matrices are typically sparse and high-dimensional. Common approaches of normalization such as log-normalization may lead to suspicious findings as previously shown for trajectory inference. Although RNA-seq is compositional data by nature, the geometry of CoDA in high-dimensional simplex is not compatible with most downstream analyses of scRNA-seq which are based on Euclidean space. In this study, we explored if CoDA is adaptable with scRNA-seq in various downstream applications. Specifically, we attempted to study: (1) CoDA adaptability to scRNA-seq; (2) handling of zero data: prior-log-normalized, imputation or with specific count addition; (3) transformation to Euclidean space and compatibility with downstream analyses. Our results suggested that (1) the innovative count addition schemes (e.g., SGM) enable the application of CoDA to high dimensional sparse data (i.e., scRNA-seq); (2) log-normalized data could be transformed to CoDA LR transformation as an approximation; (3) CoDA LR transformations such as count-added centered-log-ratio (CLR) had some advantages in dimension reduction visualization, clustering, and trajectory inference in the tested real & simulated datasets. CLR provided more decent and separated clusters in dimension reductions, improved the Slingshot trajectory inference, and eliminated the suspicious trajectory that is probably caused by the dropouts. We therefore concluded that CoDA may be a preferred scale-free model to handle scRNA-seq data for these downstream applications. Additionally, an R package ‘CoDAhd’ was developed for conducting CoDA LR transformations for high dimensional scRNA-seq data. The code for implementing CoDA-hd and some example datasets were placed at https://github.com/GO3295/CoDAhd .

Version published to 10.1101/2025.03.24.644852v1 on bioRxiv
Mar 26, 2025

Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq

This article has 9 authors:
1. Hefei Zhang
2. Xuhang Li
3. Dongyuan Song
4. Onur Yukselen
5. Shivani Nanda
6. Alper Kucukural
7. Jingyi Jessica Li
8. Manuel Garber
9. Albertha J.M. Walhout
This article has no evaluationsLatest version Feb 3, 2025
scBiopsy-seq: a platform for temporal single-cell RNA-seq analysis

This article has 12 authors:
1. Linfeng Cai
2. Shiyan Lin
3. Minghao Qiu
4. Li Lin
5. Fuyuan Li
6. Jiajia Liu
7. Yuning Zou
8. Xing Na
9. Shanshan Liang
10. Xing Xu
11. Chaoyong Yang
12. Jin Li
This article has no evaluationsLatest version Mar 29, 2025
Evaluating discrepancies in dimensionality reduction for time-series single-cell RNA-sequencing data

This article has 4 authors:
1. Maren Hackenberg
2. Laia Canal Guitart
3. Rolf Backofen
4. Harald Binder
This article has no evaluationsLatest version Feb 8, 2025

Listed in

Abstract

Article activity feed

Related articles

Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq

scBiopsy-seq: a platform for temporal single-cell RNA-seq analysis

Evaluating discrepancies in dimensionality reduction for time-series single-cell RNA-sequencing data