Verticall: A fast and robust tool for recombination detection in large-scale bacterial genomic datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The inference and removal of horizontally acquired genomic regions is a crucial step in phylogenomics analyses for evolutionary studies. Existing tools perform well on clonal lineage-focused datasets on the scale of hundreds of genomes, but are limited in their ability to analyse larger or more diverse datasets. Here we present Verticall, a tool to identify recombinant regions in bacterial assemblies and generate recombination-free phylogenies, which scales to thousands of genomes from clonal to genus-level diversity. Verticall uses a non-parametric approach to assign genomic regions as horizontally or vertically related based on the distribution of pairwise genetic distances between genomes. Recombination-free phylogenetic trees may be inferred by either calculating a pairwise genetic distance matrix from vertical-only regions (distance-tree approach) or by pairwise comparisons of all genomes to a reference and then masking horizontally acquired regions in a pseudo-alignment to the reference (alignment-tree approach). We demonstrate Verticall’s performance using four publicly available whole-genome sequence datasets of varying sample sizes (range: 154 – 4,857 genomes) and evolutionary scales (ranging from within-lineage to genus-wide diversity). Across all four datasets, Verticall showed comparable or superior performance to the established tools Gubbins and ClonalFrameML in terms of computational efficiency, plausibility of inferred phylogenetic trees, and recovery of temporal signal for molecular dating. Our results show that Verticall is a useful tool to more efficiently and accurately detect recombination, particularly applied to datasets for which existing tools are limited, including large datasets with hundreds to thousands of genomes and those that span entire species or genera. Verticall is available free and open source at https://github.com/rrwick/Verticall .
Impact Statement
Many bacterial species can acquire genetic material from external sources and stably incorporate them into their own genomes through homologous recombination. During phylogenomic analyses to investigate outbreaks or for evolutionary studies, a core objective is often to reconstruct the evolutionary history of the studied organisms independent of these horizontally acquired genomic regions. This is particularly desirable when the aim is to construct dated phylogenies, as horizontally acquired variation can interfere with the molecular clock signal on which dating relies. Existing recombination detection programs perform well in certain contexts, but their algorithms are not suitable for datasets with very high diversity or thousands of genomes. We addressed this gap by developing the software package Verticall. We show this approach produces comparable results to existing software for smaller more clonal datasets, but also performs well on datasets that the existing packages cannot handle.
Data Summary
Verticall is available free and open source at https://github.com/rrwick/Verticall . We used published whole-genome sequence data deposited in public databases (Pathogenwatch [ https://pathogen.watch/ ]; European Nucleotide Archive [ https://www.ebi.ac.uk/ena/ ], Sequence Read Archive [ https://www.ncbi.nlm.nih.gov/sra/ ]). Accession numbers for the raw whole-genome sequences are presented in Tables S2–S6. All data, code, and analysis commands used to generate the results and figures presented in this paper are available on figshare (DOI: 10.6084/m9.figshare.31930821) and GitHub ( https://github.com/erkison/verticall_paper ).