DeepSeMS: a large language model reveals hidden biosynthetic potential of the global ocean microbiome
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Microbial biosynthetic diversity holds immense potential for discovering natural products with therapeutic applications, yet a substantial quantity of natural products derived from uncultivated microorganisms remains uncharacterized. The intricate nature of biosynthetic enzymes poses a major challenge in accurately predicting the chemical structures of secondary metabolites solely based on genome sequences using current rule-based methods. Here, we present DeepSeMS, a large language model designed to predict the chemical structures of secondary metabolites from various microbial biosynthetic gene clusters. Built on the Transformer architecture, DeepSeMS innovatively identifies sequence features using functional domains of biosynthetic enzymes, and incorporates feature-aligned chemical structure enumeration for training data augmentation. External evaluation results show that DeepSeMS predicts more accurate chemical structures of secondary metabolites with a Tanimoto coefficient up to 0.6 compared with the ground truth, significantly outperforming antiSMASH and PRISM with coefficients of only 0.14 and 0.45 respectively. Moreover, DeepSeMS successfully predicted secondary metabolites for 96.60% of cryptic biosynthetic gene clusters, surpassing existing methods with success rates less than 50%. Leveraging DeepSeMS, we characterized over 65,000 novel secondary metabolites from the global ocean microbiome with previously undocumented structural types, ecological distribution, and biomedical applications especially antibiotics. A login-free and user-friendly web server for DeepSeMS ( https://biochemai.cstspace.cn/deepsems/ ) has been launched, featuring an integrated global ocean microbial secondary metabolites repository to expedite the discovery of novel natural products. Collectively, this study underscores the great capacity of a large language model-driven method in revealing hidden biosynthetic potential of the global ocean microbiome.