Defining the Mycobacterium tuberculosis Pangenome and Suggestions for a New Composite Reference Sequence
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Mycobacterium tuberculosis (Mtb) causes tuberculosis (TB), a global disease with diverse clinical and microbiological manifestations. Studies into the biological causes of this phenotypic diversity have been largely limited to a few reference strains. A pangenome approach is likely to provide new insights. Pangenomic tuberculosis studies have been limited the availability of only fragmented genome sequences and error-prone reference genomes. We used a de novo assembly pipeline that generates extremely complete and accurate whole genome sequences to generate 50 closed Mtb genomes across all seven major lineages. We identified 3,377 core gene clusters and 379 accessory clusters. Analysis showed multi-copy core clusters were largely due to gene fragmentation (76%), paralogs (12%), nearly identical gene duplications (4%), or combinations (8%). Sixteen hypervariable regions (HVRs) were identified, including novel paralogs and variable PE/PPE genes. We consolidated these findings into a Pangenome Gene Reference Resource (PGRR) for precision alignment. Our study demonstrates the closed nature of the Mtb pangenome, with most variation in accessory genes and HVRs. The PGRR provides a foundation for improved drug/vaccine target discovery and highlights the need to move beyond the commonly used H37Rv strain to study Mtb genetic and phenotypic diversity.
IMPORTANCE
Tuberculosis (TB), caused by Mycobacterium tuberculosis , affects millions globally. Genetic differences among Mtb strains have been difficult to resolve due to incomplete genome references. We sequenced and analyzed complete genomes of 50 Mtb strains from all lineages, identifying 16 hypervariable regions and 3,498 core gene clusters whose diversity mostly stemmed from gene fragmentation, paralog duplication and deletion events and differences in the PE/PPE gene family representation. These differences may explain many of the varied clinical manifestations of TB. We created Pangenome Gene Reference Resource to unify genetic data for precise comparison studies to aid in developing new drugs vaccines and other interventions against this disease.