Benchmarking DNA barcode decoding strategies under high error rates

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: DNA barcoding enables multiplexed identification of biomolecules in pooled sequencing experiments, with broad applications including spatial transcriptomics. Photolithographic synthesis of high-density barcode arrays achieves library sizes exceeding $10^5$ unique sequences but introduces error rates of 10--20\% per nucleotide through substitutions, insertions, and deletions. Classical error-correcting codes cannot scale to such library sizes while maintaining robust error correction under these conditions. Methods: We benchmarked three computational barcode decoding approaches---Columba (FM-index-based lossless alignment), QUIK (k-mer filtering with GPU acceleration), and RandomBarcodes (trimer-based triage with GPU parallelization)---across simulated and empirical datasets. Simulations spanned barcode lengths of 28--36 nt, library sizes of 21,000--85,000 barcodes, and error rates of 9--32\%. Real sequencing data were generated from photolithographically synthesized arrays at three printing density levels. Results: Under medium error rates (\textasciitilde23\%), QUIK achieved the highest recall (87--89\%) while maintaining precision $>99.5\%$, outperforming RandomBarcodes (recall 56\%, precision $>99.8\%$) and Columba (recall 35\%, precision 98--100\%). QUIK demonstrated superior scalability, processing 59,620 reads/second on a single GPU compared to RandomBarcodes (68 reads/second) and Columba (1550 reads/second with 8 CPU threads). Barcode length strongly influenced accuracy: 34-nt barcodes enabled 75\% recall at 99.97\% precision with QUIK, compared to 60\% recall with 32-nt barcodes. On real data from a 42,000-spot subarray with 36-nt barcodes, QUIK managed a 57\% assignment rate with perfect precision, versus 52\% (Columba, precision 99.96) and 50\% (RandomBarcodes, precision 99.82). Conclusions: QUIK provides the optimal balance of speed, accuracy, and scalability for high-density spatial transcriptomics applications under realistic synthesis error conditions. Barcode lengths $\geq 34$ nt are recommended for applications requiring $>75\%$ read recovery at $>99.9\%$ precision.

Article activity feed