Benchmarking DNA barcode decoding strategies under high error rates

Franco Poma-Soto
Hanne Van Droogenbroeck
Brecht Soulliaert
Maya Giridhar
Jürgen Behr
Hamed Sabzalipoor
Mark Somoza
Pieter Mestdagh
Jo Vandesompele

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: DNA barcoding enables multiplexed identification of biomolecules in pooled sequencing experiments, with broad applications including spatial transcriptomics. Photolithographic synthesis of high-density barcode arrays achieves library sizes exceeding $10^5$ unique sequences but introduces error rates of 10--20\% per nucleotide through substitutions, insertions, and deletions. Classical error-correcting codes cannot scale to such library sizes while maintaining robust error correction under these conditions. Methods: We benchmarked three computational barcode decoding approaches---Columba (FM-index-based lossless alignment), QUIK (k-mer filtering with GPU acceleration), and RandomBarcodes (trimer-based triage with GPU parallelization)---across simulated and empirical datasets. Simulations spanned barcode lengths of 28--36 nt, library sizes of 21,000--85,000 barcodes, and error rates of 9--32\%. Real sequencing data were generated from photolithographically synthesized arrays at three printing density levels. Results: Under medium error rates (\textasciitilde23\%), QUIK achieved the highest recall (87--89\%) while maintaining precision $>99.5\%$, outperforming RandomBarcodes (recall 56\%, precision $>99.8\%$) and Columba (recall 35\%, precision 98--100\%). QUIK demonstrated superior scalability, processing 59,620 reads/second on a single GPU compared to RandomBarcodes (68 reads/second) and Columba (1550 reads/second with 8 CPU threads). Barcode length strongly influenced accuracy: 34-nt barcodes enabled 75\% recall at 99.97\% precision with QUIK, compared to 60\% recall with 32-nt barcodes. On real data from a 42,000-spot subarray with 36-nt barcodes, QUIK managed a 57\% assignment rate with perfect precision, versus 52\% (Columba, precision 99.96) and 50\% (RandomBarcodes, precision 99.82). Conclusions: QUIK provides the optimal balance of speed, accuracy, and scalability for high-density spatial transcriptomics applications under realistic synthesis error conditions. Barcode lengths $\geq 34$ nt are recommended for applications requiring $>75\%$ read recovery at $>99.9\%$ precision.

Version published to 10.21203/rs.3.rs-8850174/v1 on Research Square
Mar 2, 2026

Quantifying XNA replication fidelity using nanopore sequencing

This article has 5 authors:
1. Jorge Marchand
2. Nicholas Kaplan
3. Jayson Sumabat
4. Jane McKelvey
5. Jeantine Lunshof
This article has no evaluationsLatest version Mar 5, 2026
Single nucleotide polymorphisms genotyping via an ultrasensitive CRISPR-based assay

This article has 15 authors:
1. Qiupeng Lin
2. Jiaying Huang
3. Linsha Ma
4. Ziyi Wang
5. Yuan Zhang
6. Juntao Wang
7. Jie Chen
8. Yuxin Yuan
9. Xiujie Liu
10. Chun Liu
11. Nan Chai
12. Zhiming Xiang
13. Jisen Zhang
14. Qinlong Zhu
15. Bin Hu
This article has no evaluationsLatest version Mar 27, 2026
Inferring RNA structure from mobility-based deep mutational landscapes

This article has 6 authors:
1. Yaoqi Zhou
2. Jinle Tang
3. Yazhou Shi
4. Zhe Zhang
5. Dailin Luo
6. Jian Zhan
This article has no evaluationsLatest version Mar 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Quantifying XNA replication fidelity using nanopore sequencing

Single nucleotide polymorphisms genotyping via an ultrasensitive CRISPR-based assay

Inferring RNA structure from mobility-based deep mutational landscapes