Seqwin: Ultrafast identification of signature sequences in microbial genomes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Polymerase chain reaction (PCR) enables rapid, cost-effective diagnostics but requires prior identification of genomic regions that allow sensitive and specific identification of target microbial groups, herein referred to as microbial signature sequences. We introduce Seqwin, an open-source framework designed to automate microbial genome signature discovery. Tens of thousands of microbial genomes are now available, limiting the application of existing manual and automated approaches for identifying signatures. Modern approaches that are capable of leveraging all available microbial genomes will ensure sensitive and accurate DNA signatures identification and enable robust pathogen detection for clinical, environmental, and public health applications.
Results
Seqwin builds weighted pan-genome minimizer graphs and uses a traversal algorithm to identify signature sequences that occur frequently in target genomes but remain rare in non-targets. Unlike earlier tools that depend on strict presence or absence of sequences, Seqwin accommodates natural sequence variation and scales to very large genome collections. When applied to genomes from C. difficile, M. tuberculosis and S. enterica , Seqwin recovered more high-quality signatures than alternative methods with lower computational burden. Seqwin analysis of nearly 15,000 S. enterica genomes yielded over 200 candidate signatures in less than 10 minutes. Seqwin provides an open-source solution for the long-standing need for scalable microbial signature discovery and diagnostic assay design.
Availability
Seqwin is freely available for academic use ( https://github.com/treangenlab/Seqwin ) and can be installed via Bioconda.