Benchmarking large-scale single-cell RNA-seq analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The increasing size of single-cell RNA sequencing (scRNA-seq) datasets poses major computational challenges. This work benchmarks the scalability, efficiency, and accuracy of five widely used analysis frameworks (Seurat, OSCA, scrapper, Scanpy, and rapids_singlecell), focusing on the impact of algorithmic and infrastructural choices on performance. We performed a systematic comparison of these workflows using representative datasets, including the 1.3M mouse brain dataset for scalability and three smaller datasets (BE1, sc_mixology, and cord blood CITE-seq) with ground truth labels to assess clustering accuracy. Principal Component Analysis (PCA) was used as a paradigmatic step to evaluate the computational performance of six SVD algorithms (exact, ARPACK, IRLBA, randomized, Jacobi, and incremental PCA) across multiple data representations (dense, sparse, HDF5) and hardware configurations (CPU vs GPU). All methods showed high concordance in PCA results, with negligible loss of accuracy in truncated approaches. GPU-based computation using rapids_singlecell provided a 15x speed-up over the best CPU methods, with moderate memory usage. On CPU, ARPACK and IRLBA were the most efficient for sparse matrices, while randomized SVD performed best for HDF5-backed data. Among full pipelines, rapids_singlecell was the fastest, whereas OSCA and scrapper achieved the highest clustering accuracy (ARI up to 0.97) in datasets with known cell identities. Performance differences were largely driven by the choice of highly variable genes (HVGs) and PCA implementation. The study highlights that scalability in scRNA-seq analysis depends critically on both algorithmic and infrastructural factors. GPU acceleration and optimized BLAS/LAPACK configurations markedly enhance performance, while Bioconductor-based pipelines remain robust in accuracy. The provided benchmarks offer practical guidelines for efficient and reliable analysis of large-scale single-cell datasets.

Article activity feed