A dual-metric framework for quantifying the biological fidelity of scRNA-seq pipelines
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Single-cell RNA sequencing (scRNA-seq) enables high-resolution transcriptional profiling, but early-stage processing pipelines differ markedly in barcode recovery, UMI correction, and read assignment-variations that can propagate and bias downstream analyses. We present a reproducible, parameter-aware benchmarking framework to quantify the biological fidelity of four primary pipelines-STARsolo, Cell Ranger, Kallisto|Bustools, and Alevin-fry-across simulated ground-truth datasets and complex biological contexts, including a Huntington’s disease (HD) mouse model. Our approach introduces two complementary metrics: the Cluster Annotation Score (CAS), assessing concordance between direct cell-level and cluster-level consensus labels, and the Marker Concordance Score (MCS), measuring cohesion of de novo marker genes per cell type. By systematically varying highly variable gene (n_HVG) and principal component (n_PC) settings, we map how upstream quantification interacts with downstream parameter choice. Simulations show STARsolo and Kallisto|Bustools deliver high technical accuracy, but only STARsolo consistently preserves stable cell identities and coherent marker expression across parameter regimes. In empirical datasets, alignment-based pipelines (STARsolo, Cell Ranger) yield higher CAS/MCS values and more biologically faithful annotations, while alignment-free methods show reduced signal fidelity despite faster runtimes. Differences introduced during primary processing persist after batch correction and integration, altering disease-associated cell type detection. Our open-source CAS-MCS-Scoring toolkit enables transparent evaluation of pipeline performance, providing a practical guide for selecting analysis strategies that maximize reproducibility and biological interpretability in scRNA-seq studies.