Interpretable adenylation domain specificity prediction using protein language models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Natural products have long been a rich source of diverse and clinically effective drug candidates. Non-ribosomal peptides (NRPs), polyketides (PKs), and NRP-PK hybrids are three classes of natural products that display a broad range of bioactivities, including antibiotic, antifungal, anticancer, and immunosuppressant activities. However, discovering these compounds through traditional bioactivity-guided techniques is costly and time-consuming, often resulting in the rediscovery of known molecules. Consequently, genome mining has emerged as a high-throughput strategy to screen hundreds of thousands of microbial genomes to identify their potential to produce novel natural products. Adenylation domains play a key role in the biosynthesis of NRPs and NRP-PKs by recruiting substrates to incrementally build the final structure. We propose MASPR, a machine learning method that leverages protein language models for accurate and interpretable predictions of A-domain substrate specificities. MASPR demonstrates superior accuracy and generalization over existing methods and is capable of predicting substrates not present in its training data, or zero-shot classification. We use MASPR to develop Seq2Hybrid, an efficient algorithm to predict the structure of hybrid NRP-PK natural products from microbial genomes. Using Seq2Hybrid, we propose putative biosynthetic gene clusters for the orphan natural products Octaminomycin A, Dityromycin, SW-163B, and JBIR-39.