The information content of species: formal definitions of pangenome complexity track with bacterial lifestyle
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The information content encoded in the genomes of species varies across the microbial tree of life. Bacterial lifestyle has been shown to drive this information diversity. Challenging environments are linked to information accrual. Pangenome fluidity is often invoked to provide a measure of this genomic diversity. Fluid pangenomes contain genes found only in subsets of species strains. Tighter pangenomes contain more genes that define a shared core among strains. In any global comparative framework, pangenomes must be calculated across all known species. But defining pangenomes is fraught with computational and biological challenges, requiring assembly, annotation, alignment, and phylogenetics of millions of orthologs. Here, we introduce an alternative view that employs agile complexity metrics to quantify the information density of pangenomes. In our framework, ensembles of free-living, motile, and non-pathogenic species have high genomic complexity. Ensemble complexity decreases in species bound to specific hosts. Because we eliminate annotation and alignment, our method is fast enough to evaluate existing species classifications across all known bacterial genomes. The approach democratizes classification and our results highlight how broad the term “species” has become.