scValue: value-based subsampling of large-scale single-cell transcriptomic data for machine and deep learning tasks

Li Huang
Weikang Gong
Dongsheng Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large single-cell ribonucleic acid-sequencing (scRNA-seq) datasets offer unprecedented biological insights but present substantial computational challenges for visualization and analysis. While existing subsampling methods can enhance efficiency, they may not ensure optimal performance in downstream machine learning and deep learning (ML/DL) tasks. Here, we introduce scValue, a novel approach that ranks individual cells by ‘data value’ using out-of-bag estimates from a random forest model. scValue prioritizes high-value cells and allocates greater representation to cell types with higher variability in data value, effectively preserving key biological signals within subsamples. We benchmarked scValue on automatic cell-type annotation tasks across four large datasets, paired with distinct ML/DL models. Our method consistently outperformed existing subsampling methods, closely matching full-data performance across all annotation tasks. In three additional case studies—label transfer learning, cross-study label harmonization, and bulk RNA-seq deconvolution—scValue more effectively preserved T-cell annotations across human gut-colon datasets, more accurately reproduced T-cell subtype relationships in a human spleen dataset, and constructed a more reliable single-cell immune reference for cell-type deconvolution in simulated bulk tissue samples. Finally, using 16 public datasets ranging from tens of thousands to millions of cells, we evaluated subsampling quality based on computational time, Gini coefficient, and Hausdorff distance. scValue demonstrated fast execution, well-balanced cell-type representation, and distributional properties akin to uniform sampling. Overall, scValue provides a robust and scalable solution for subsampling large scRNA-seq data in ML/DL workflows. It is available as an open-source Python package installable via pip, with source code at https://github.com/LHBCB/scvalue.

Version published to 10.1093/bib/bbaf279
May 1, 2025
Version published to 10.1101/2025.01.10.632338 on bioRxiv
Jan 14, 2025

Pathway-informed Universal Domain Adaptation for Single-cell RNA-seq Data

This article has 6 authors:
1. Xinrong Wei
2. Xingyi Li
3. Huan Liu
4. Gaoyuan Du
5. Feng Wei
6. Xuequn Shang
This article has no evaluationsLatest version May 11, 2026
Recovering biological structure in sparse single-cell proteomics with GIRAFI

This article has 16 authors:
1. Huan Zhong
2. Shuxin Chi
3. Rachel Wong
4. Jason Rogalski
5. Ziming Wang
6. Susanna Chan
7. Melanie L. Bailey
8. Arpa Ebrahimi
9. Gabrielle Jayme
10. Jerry Yin
11. Albie Gong
12. Terrance P. Snutch
13. Claudia S. Maier
14. Marco A. Marra
15. Leonard J. Foster
16. Xin Tang
This article has no evaluationsLatest version May 21, 2026
reComBat-seq: Regularized negative binomial regression for batch-effect correction in underdetermined transcriptomics datasets

This article has 3 authors:
1. Zhasmina Stoyanova
2. Jörg Menche
3. Daniel Malzl
This article has no evaluationsLatest version May 30, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Pathway-informed Universal Domain Adaptation for Single-cell RNA-seq Data

Recovering biological structure in sparse single-cell proteomics with GIRAFI

reComBat-seq: Regularized negative binomial regression for batch-effect correction in underdetermined transcriptomics datasets