Rapidly and reproducibly building a comprehensive catalogue of resistance-associated variants for M. tuberculosis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Catalogues of genetic variants associated with resistance underpin whole-genome sequencing (WGS)-based predictions of drug susceptibility in Mycobacterium tuberculosis , and are essential for molecular diagnostics and surveillance. The current gold standard catalogues are those released by the WHO but the underlying data are not fully released and they are difficult to interpret. Open and reproducible methods would help address these problems, extending the important work already done.
Methods
We have developed an automated method, catomatic , that uses a binomial test to associate informative isolates with resistance or susceptibility, and built a catalogue ( catomatic-1 ) from the same 39,358 samples used to construct the first edition of the WHO catalogue ( WHOv1 ). We performed a sensitivity analysis to optimise statistical and bioinformatic parameters for each drug, and benchmarked catomatic-1 against WHOv1 using an independent Validation Dataset of 14,380 isolates.
Findings
By using simpler statistics, catomatic-1 algorithmically classified 1,329 genetic variants, ranging from five for linezolid to 440 for pyrazinamide. WHOv1 included generalisable rules added by a panel of experts, increasing its predictive coverage, but at the cost of reproducibility. Despite not including such expert rules, catomatic-1 achieves comparable performance for all drugs, with sensitivities for first-line agents above 88% on the independent Validation Dataset. The automated process allowed us to efficiently explore parameter space; for instance, detecting resistant variants with low read support improved the sensitivity for all drugs.
Interpretation
Performant resistance catalogues for M. tuberculosis can be built automatically using transparent and reproducible statistical methods. As more data are collected, catalogue content and performance will evolve, highlighting the need for proper versioning, machine/human readability, and open access. This approach demonstrates resistance catalogues used in surveillance and diagnostics can be rapidly and reproducibily updated.
Funding
The National Institute for Health and Care Research (NIHR), Engineering and Physics Sciences Research Council (EPSRC) and ORACLE Corporation.
Research in context
Evidence before this study
We searched PubMed and preprint servers (bioRxiv, medRxiv), and publicly available mutation catalogues for studies linking Mycobacterium tuberculosis genomic variants with drug resistance using whole-genome or targeted sequencing and phenotypic drug-susceptibility testing (pDST). Search terms combined “Mycobacterium tuberculosis”, “genome sequencing”, “mutation catalogue”, “mutation effects”, “drug resistance”, and individual drug names, with no language or date restriction. We included studies providing paired, clinical genomic and pDST or MIC data, excluding purely in-silico or case-only reports. This work directly builds on methodologies and data published by five prior studies, and makes primary comparisons with the First ( WHOv1 ) and Second ( WHOv2 ) Editions of the WHO Catalogue of mutations in Mycobacterium tuberculosis .
Added value of this study
We developed catomatic , a transparent, reproducible tool for building catalogues of resistance- and susceptibility-associated genetic variants. Trained on the same samples used to build WHOv1 and benchmarked on an independent Validation Dataset, catomatic achieves comparable sensitivity, specificity, and definitive prediction rates to WHOv1 without expert-rule augmentation and despite using simpler statistics. It optimises parameters per drug, produces machine-readable outputs (CSV/JSON), and demonstrates that adjusting read-support thresholds can improve detection of minor resistance subpopulations.
Implications of all the available evidence
Catalogues of resistance-associated variants for M. tuberculosis can be rapidly and transparently constructed. Making catalogues available in human/machine-readable formats with uncertainty estimates will improve uptake of WGS for M. tuberculosis surveillance and diagnostics; using a reproducible process permits diagnostic test manufacturers, researchers, clinical and public health laboratories to select the level of statistical support necessitated by their specific use-case, Policymakers should balance the benefits of expert rules against loss of reproducibility. Future work will expand the size of the datasets used, integrate minimum inhibitory concentration data, and establish consensus workflows for routine, transparent catalogue updates.