Error Correction Algorithms for Efficient Gene Expression Quantification in Single Cell Transcriptomics

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Technological advances in single-cell RNA sequencing (scRNA-seq) allow us to sequence the transcriptomes of thousands of single cells in parallel, resulting in massive amounts of raw sequence data that must be processed efficiently to obtain a genes x cells expression matrix. In droplet-based scRNA-seq protocols, the sequenced mRNA molecules are tagged with a cell-specific barcode and a unique molecular identifier (UMI) within each cell. Both barcodes and UMIs may contain errors from production, amplification or sequencing. Correcting and resolving such errors before further processing yields more reliable data and more accurate expression measurements. We propose algorithmic advancements for barcode correction, read-to-gene mapping and UMI resolution, which we combine into a new method called arcane for efficient gene expression quantification from scRNA-seq data. We additionally provide an implementation as a workflow-friendly command-line tool, also called arcane. This work builds on the recently published Fourway method to efficiently discover DNA k -mers with a Hamming distance of 1, speeding up barcode correction and UMI resolution, and allowing for distinguishing k -mers into weakly and strongly unique ones during read-to-gene mapping. As a side result of separate interest, we show that for the mapping step, it suffices to store three genes per k -mer in order to cover almost all of the genes almost completely, thus avoiding arbitrarily large colors sets in the colored De Bruijn graph index. As a result, arcane is faster than existing methods while producing very similar results, as demonstrated in a comparison with CellRanger, Kallisto|bustools and Alevin-fry.

Article activity feed