Harvesting more reads from single-cell combinatorial barcoding data with scarecrow

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Summary

Combinatorial barcoding technologies for single-cell nucleotide sequencing, such as split-pool ligation protocols, involve sequential rounds of cell barcoding to uniquely tag individual cells. The rapid adoption of combinatorial barcoding in recent years is due in part to its scalability across cells and samples. However, small shifts in barcode positions within sequencing reads caused by technical artifacts, e.g. during barcode incorporation or synthesis, can impact the accurate assignment of reads to cell barcodes. Existing processing tools typically assume barcodes contain fixed-length nucleotide sequences located at fixed positions within reads, overlooking any positional variability. Consequently, reads containing truncated or mispositioned barcodes are discarded during initial data processing steps leading to significant data loss. To solve this limitation and maximise the retention of sequencing reads from single-cell combinatorial barcoding experiments, we introduce scarecrow . Our tool screens a subsample of reads to generate position-specific barcode profiles, which are then used to flexibly identify barcode sequences in each read whilst accounting for positional errors, a phenomenon we refer to as ‘jitter’. Barcode matches are then prioritised to minimise nucleotide mismatches and the degree of jitter. These initial profiles are subsequently used to extract and error correct barcode combinations in high throughput sequencing libraries. By incorporating jitter into barcode error correction, scarecrow enables greater data recovery and improved downstream single-cell analyses. Scarecrow is fully open access, implemented in Python, and generates output files using standardised sequence file formats for maximal interoperability. A detailed explanation of the scarecrow workflow can be found in the supplementary materials.

Article activity feed