A data-driven rediscovery of the specificity-conferring code of adenylation domains in nonribosomal peptide synthetases
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Nonribosomal peptide synthetases (NRPSs) are large modular enzymes that assemble structurally diverse peptides, many of pharmacological importance, including antibiotics and immunosuppressants. Within each NRPS module, the adenylation (A) domain selects the substrate to be incorporated, a choice governed by a small set of residues lining the binding pocket. For two decades, computational prediction of A-domain substrate specificity has relied on residue sets—most prominently the Stachelhaus code and the 34-residue “8 Å code”—that were defined by spatial proximity to the substrate rather than by demonstrated predictive value. Here we revisit which residues govern substrate specificity from a purely data-driven perspective. We assembled a non-redundant dataset of 5,366 A-domain sequences (4,693 bacterial and 673 fungal) and used information-theoretic measures to rank alignment positions by their statistical association with substrate identity, without restricting candidate positions to any predefined structural shell. This procedure yielded two compact, kingdom-specific codes: IG15B (15 positions) for bacterial and IG13F (13 positions) for fungal A-domains. Both match or exceed the predictive accuracy of the 34-residue 8 Å code while using fewer than half its positions, and both independently recover the majority of the classical Stachelhaus positions. Notably, our analysis identifies four positions (242, 280, 281, and 284) that lie outside all conventional codes yet carry non-redundant specificity information and co-localize with classical determinants on two helices flanking the binding pocket. These positions provide new candidate sites for the rational engineering of A-domain specificity.
Author summary
Many clinically important drugs—including antibiotics such as vancomycin and immunosuppressants such as cyclosporin—are nonribosomal peptides, assembled by large enzymes known as nonribosomal peptide synthetases. These enzymes contain adenylation domains that act as molecular gatekeepers, each selecting one chemical building block to add to a growing peptide. Identifying which amino acids within a domain determine this choice is central both to predicting what an enzyme produces and to re-engineering it to make new compounds. For over twenty years, researchers have approached this question by selecting the amino acids that sit physically closest to the substrate. However, being close to the substrate does not guarantee that a residue actually influences substrate selection. In this work, we instead let the data decide: using thousands of adenylation domain sequences, we measured which positions are statistically most informative about the substrate, using information gain, mutual information and χ 2 statistic. We found that far fewer positions than conventionally used are sufficient to predict specificity, and—importantly—we identified several influential positions that earlier approaches had overlooked because they lie just beyond the conventional distance cutoff. These positions offer promising new targets for engineering these enzymes to produce novel peptide-based drugs.