Benchmarking large-scale single-cell RNA-seq analysis

Ilaria Billato
Hérve Pagès
Vince Carey
Levi Waldron
Gabriele Sales
Chiara Romualdi
Davide Risso

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The increasing size of single-cell RNA sequencing (scRNA-seq) datasets poses major computational challenges. This work benchmarks the scalability, efficiency, and accuracy of five widely used analysis frameworks (Seurat, OSCA, scrapper, Scanpy, and rapids_singlecell), focusing on the impact of algorithmic and infrastructural choices on performance. We performed a systematic comparison of these workflows using representative datasets, including the 1.3M mouse brain dataset for scalability and three smaller datasets (BE1, sc_mixology, and cord blood CITE-seq) with ground truth labels to assess clustering accuracy. Principal Component Analysis (PCA) was used as a paradigmatic step to evaluate the computational performance of six SVD algorithms (exact, ARPACK, IRLBA, randomized, Jacobi, and incremental PCA) across multiple data representations (dense, sparse, HDF5) and hardware configurations (CPU vs GPU). All methods showed high concordance in PCA results, with negligible loss of accuracy in truncated approaches. GPU-based computation using rapids_singlecell provided a 15x speed-up over the best CPU methods, with moderate memory usage. On CPU, ARPACK and IRLBA were the most efficient for sparse matrices, while randomized SVD performed best for HDF5-backed data. Among full pipelines, rapids_singlecell was the fastest, whereas OSCA and scrapper achieved the highest clustering accuracy (ARI up to 0.97) in datasets with known cell identities. Performance differences were largely driven by the choice of highly variable genes (HVGs) and PCA implementation. The study highlights that scalability in scRNA-seq analysis depends critically on both algorithmic and infrastructural factors. GPU acceleration and optimized BLAS/LAPACK configurations markedly enhance performance, while Bioconductor-based pipelines remain robust in accuracy. The provided benchmarks offer practical guidelines for efficient and reliable analysis of large-scale single-cell datasets.

Version published to 10.1101/2025.10.28.681564 on bioRxiv
Oct 29, 2025

Persistent hindrances to data re-use in single-cell genomics

This article has 8 authors:
1. Sanja Rogic
2. Xinrui Xiang Yu
3. Brianna Xu
4. Alexandra Millett
5. Salva Sherif
6. Guillaume Poirier-Morency
7. Rachel Schwartz
8. Paul Pavlidis
This article has no evaluationsLatest version Oct 3, 2025
Spectral Compression of Single-Cell Transcriptomes. A Proof-of-Concept FFT Framework for Scalable MRD Follow-updocx

This article has 1 author:
1. Solomon Tessega
This article has no evaluationsLatest version Oct 9, 2025
Modtector: Ultra-Fast Modification Signal Mining on Mapped Sequencing Reads

This article has 6 authors:
1. Tong Zhou
2. Yifan Hong
3. Panfeng Li
4. Xitong Liu
5. Ang Li
6. Lei Sun
This article has no evaluationsLatest version Oct 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Persistent hindrances to data re-use in single-cell genomics

Spectral Compression of Single-Cell Transcriptomes. A Proof-of-Concept FFT Framework for Scalable MRD Follow-updocx

Modtector: Ultra-Fast Modification Signal Mining on Mapped Sequencing Reads