A contextualised protein language model reveals the functional syntax of bacterial evolution
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Bacteria have evolved a vast diversity of functions and behaviours which are currently incom-pletely understood and poorly predicted from DNA sequence alone. To understand the syntax of bacterial evolution and discover genome-to-phenotype relationxsships, we curated over 1.3 million genomes spanning bacterial phylogenetic space, representing each as an ordered sequence of proteins which collectively were used to train a transformer-based, contextualised protein language model, Bacformer . By pretraining the model to learn genome-wide evolutionary patterns, Bacformer captures the compositional and positional relationships of proteins and can accurately: predict protein-protein interactions, operon structure (which we validated experimentally), and protein function; infer phenotypic traits and identify likely causal genes; and design template synthethic genomes with desired properties. Thus, Bacformer represents a new foundation model for bacterial genomics that provide biological insights and a framework for prediction, inference, and generative tasks.