Integrated population clustering and genomic epidemiology with PopPIPE
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Genetic distances between bacterial DNA sequences can be used to cluster populations into closely related subpopulations, and as an additional source of information when detecting possible transmission events. Due to their variable gene content and order, reference-free methods offer more sensitive detection of genetic differences, especially among closely related samples found in outbreaks. However, across longer genetic distances, frequent recombination can make calculation and interpretation of these differences more challenging, requiring significant bioinformatic expertise and manual intervention during the analysis process. Here we present a Pop ulation analysis PIPE line (PopPIPE) which combines rapid reference-free genome analysis methods to analyse bacterial genomes across these two scales, splitting whole populations into subclusters and detecting plausible transmission events within closely related clusters. We use k-mer sketching to split populations into strains, followed by split k-mer analysis and recombination removal to create alignments and subclusters within these strains. We first show that this approach creates high quality subclusters on a population-wide dataset of Streptococcus pneumoniae . When applied to nosocomial vancomycin resistant Enterococcus faecium samples, PopPIPE finds transmission clusters which are more epidemiologically plausible than core genome or MLST-based approaches. Our pipeline is rapid and reproducible, creates interactive visualisations, and can easily be reconfigured and re-run on new datasets. Therefore PopPIPE provides a user-friendly pipeline for analyses spanning species-wide clustering to outbreak investigations.
Impact statement
As time passes, bacterial genomes accumulate small changes in their sequence due to mutations, or larger changes in their content due to horizontal gene transfer. Using their genome sequences, it is possible to use phylogenetics to work out the most likely order in which these changes happened, and how long they took to happen. Then, one can estimate the time that separates any two bacterial samples – if it is short then they may have been directly transmitted or acquired from the same source; but if it is long they must have been acquired separately. This information can be used to determine transmission chains, in conjunction with dates and locations of infections. Understanding transmission chains enables targeted infection control measures. However, correctly calculating the genetic evidence for transmission is made difficult by correctly distinguishing different types of sequence changes, dealing with large amounts of genome data, and the need to use multiple complex bioinformatic tools. We addressed this gap by creating a computational workflow, PopPIPE, which automates the process of detecting possible transmissions using genome sequences. PopPIPE applies state-of-the-art tools and is fast and easy to run – making this technology will be available to a wider audience of researchers.
Data summary
The code for this pipeline is available at https://github.com/bacpop/PopPIPE and as a docker image https://hub.docker.com/r/poppunk/poppipe . Raw sequencing reads for Enterococcus faecium isolates have been deposited at the NCBI under BioProject accession number PRJNA997588.