Deciphering Biosynthetic Gene Clusters with a Context-aware Protein Language Model

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Microbial secondary metabolites, synthesized by biosynthetic gene clusters (BGCs), play critical roles in ecological interactions and offer vast potential for biotechnological and pharmaceutical applications. Despite advances in computational BGC detection, current methods face challenges, including time-consuming sequence alignments, dependence on known homologs and manually defined rules, limiting its robustness and generalizability. To address these, we present CoreFinder, a deep learning framework that integrates protein language models (pLMs) and genomic contexts to predict product class and decipher gene functions within BGCs without alignment. CoreFinder demonstrated higher precision of 0.945 (842/891) and recall of 0.821 (842/1,025) than antiSMASH for core gene annotation in over 700 experimentally validated fungal BGCs. Built on CoreFinder, we introduced an end-to-end scalable workflow for BGC screening and deciphering, which is about 240 times faster than antiSMASH. Applied to 256 genomes spanning 197 taxa, CoreFinder identified 6,414 core genes within 4,585 BGCs. Further analysis indicates that a non-ribosomal peptide synthetase (NRPS) family likely existed prior to the divergence of Fusarium and Aspergillus and evolved into function-specific gene clusters. These findings emphasize the potential of CoreFinder as a powerful tool for accelerating natural product discovery and driving innovation in synthetic biology by unlocking novel biosynthetic pathways for biotechnological and pharmaceutical advancements.

Highlights

  • CoreFinder is a context-aware deep learning framework leveraging protein language model to predict gene functions and associated metabolites classes within biosynthetic gene clusters (BGCs).

  • CoreFinder deciphers BGCs without sequence alignment and manually defined rules. It achieved precision of 0.945 (842/891) and recall of 0.821 (842/1,025) for core gene annotation in over 700 experimentally validated fungal BGCs.

  • Built on CoreFinder, we introduced an end-to-end deep learning-based workflow for BGC screening and deciphering, which is two orders of magnitude faster than current methods. Applied to 256 fungal genomes spanning 197 taxa, CoreFinder identified 6,414 core genes within 4,585 BGCs.

  • CoreFinder uncovered an ancient NRPS gene family existed before the divergence of Fusarium and Aspergillus , underscoring evolutionary retention of biosynthetic tools in fungi for ecological adaptation.

Article activity feed