Deacon: fast sequence filtering and contaminant depletion

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Realising the value of large DNA sequence collections demands efficient search and extraction of sequences of interest. Search queries may vary in size from short gene sequences to multiple whole genomes that are too large to fit in computer memory. In microbial genomics, a routine search application involving both large queries and large collections is the removal of contaminating host genome sequences from microbial (meta)genomes. Where the host is human, sensitive classification and excision of host sequences is usually necessary to protect host genetic information. Precise classification is also critical in order to retain microbial sequences and permit accurate microbial genomic analysis. While human pangenomes have been shown to increase sensitivity of human sequence classification, existing bioinformatic host depletion approaches have either limited precision when used with metagenomes or large computing resource requirements.

Results

We present Deacon, an efficient and versatile sequence filter for raw sequence files and streams. We demonstrate its leading accuracy for the task of host depletion, using less computing resource than existing approaches. By querying a human pangenome index for minimizers contained in each input sequence, Deacon is able to accurately classify and discard diverse human sequences from long reads at over 250Mbp/s with a commodity laptop. We present validation of classification sensitivity, specificity and speed with simulated short and long reads for diverse catalogues of human, bacterial and viral genomes alongside existing methods. Beyond host depletion, Deacon is well suited to common sequence search and filtering applications, particularly those involving large queries. Capable of indexing a human genome in under 30s, Deacon is equipped to rapidly compose custom minimizer indexes using set operations, facilitating efficient search and filtering of massive sequence datasets using gigabase queries.

Availability and implementation

Deacon is implemented as an MIT-licensed command line tool written in Rust and packaged with Bioconda. Code is available from https://github.com/bede/deacon .

Article activity feed