Benchmarking DNA barcode decoding strategies under high error rates
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: DNA barcoding enables multiplexed identification of biomolecules in pooled sequencing experiments, with broad applications including spatial transcriptomics. Photolithographic synthesis of high-density barcode arrays achieves library sizes exceeding $10^5$ unique sequences but introduces error rates of 10--20\% per nucleotide through substitutions, insertions, and deletions. Classical error-correcting codes cannot scale to such library sizes while maintaining robust error correction under these conditions. Methods: We benchmarked three computational barcode decoding approaches---Columba (FM-index-based lossless alignment), QUIK (k-mer filtering with GPU acceleration), and RandomBarcodes (trimer-based triage with GPU parallelization)---across simulated and empirical datasets. Simulations spanned barcode lengths of 28--36 nt, library sizes of 21,000--85,000 barcodes, and error rates of 9--32\%. Real sequencing data were generated from photolithographically synthesized arrays at three printing density levels. Results: Under medium error rates (\textasciitilde23\%), QUIK achieved the highest recall (87--89\%) while maintaining precision $>99.5\%$, outperforming RandomBarcodes (recall 56\%, precision $>99.8\%$) and Columba (recall 35\%, precision 98--100\%). QUIK demonstrated superior scalability, processing 59,620 reads/second on a single GPU compared to RandomBarcodes (68 reads/second) and Columba (1550 reads/second with 8 CPU threads). Barcode length strongly influenced accuracy: 34-nt barcodes enabled 75\% recall at 99.97\% precision with QUIK, compared to 60\% recall with 32-nt barcodes. On real data from a 42,000-spot subarray with 36-nt barcodes, QUIK managed a 57\% assignment rate with perfect precision, versus 52\% (Columba, precision 99.96) and 50\% (RandomBarcodes, precision 99.82). Conclusions: QUIK provides the optimal balance of speed, accuracy, and scalability for high-density spatial transcriptomics applications under realistic synthesis error conditions. Barcode lengths $\geq 34$ nt are recommended for applications requiring $>75\%$ read recovery at $>99.9\%$ precision.