cgDist: An Enhanced Algorithm for Efficient Calculation of pairwise SNP and InDel differences from Core Genome Multilocus Sequence Typing
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Bacterial genomic surveillance requires balancing computational efficiency with genetic resolution for effective outbreak investigation. Traditional cgMLST distance calculations treat all allelic differences as equivalent units, potentially obscuring nucleotide-level variation critical for source attribution. While SNP-based methods provide enhanced resolution, their computational requirements limit routine deployment in surveillance laboratories. We present cgDist, an algorithm that bridges this resolution gap by calculating nucleotide-level distances directly from cgMLST allelic profiles. The unified cache architecture stores comprehensive alignment statistics, enabling multiple distance calculation modes without redundant computation and supporting both dataset-specific and schema-complete cache generation. This design transforms genomic surveillance from batch processing to continuous streaming analysis, with cumulative performance benefits as laboratories accumulate alignment data. cgDist functions optimally as a precision "zoom lens" for the detailed investigation of clusters identified through initial cgMLST screening. Rather than restructuring entire population relationships, this targeted approach maximizes epidemiological insight precisely where enhanced resolution is most valuable. The algorithm ensures that cgDist distances are always greater than or equal to corresponding cgMLST distances, preserving epidemiological interpretability while adding critical genetic discrimination. The system includes integrated recombination detection capabilities that leverage cached alignment statistics to identify potential horizontal gene transfer events through mutation density analysis. This multi-scale analytical framework - from population screening through cluster zoom analysis to recombination detection - provides comprehensive surveillance capabilities within the computational constraints of routine public health practice.