FastDedup A fast and memory-efficient tool for read deduplication

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

PCR duplicate removal is a critical first step in high-throughput sequencing pipelines, yet existing tools struggle with speed, memory, or correctness at modern dataset scales. We present FastDedup , a Rust-based FASTX deduplicator that transforms each read or read pair to a compact xxh3 hash fingerprint, drastically reducing memory usage and binding most of the execution time to disk I/ O. Benchmarked against six competing tools on synthetic human WGS datasets up to 300 million reads, FastDedup consistently leads on paired-end data, running more than 10 times faster than fastp . It also outperforms all tools on uncompressed single-end data, deduplicating a million reads in a second. We additionally report correctness failures in prinseq++ and clumpify . FastDedup is available under the MIT License via GitHub, Bioconda, and Cargo.

Article activity feed