Uncovering Microbial Biosynthetic Potential with Genomic Context-aware Protein Language Model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Microbial secondary metabolites, synthesized by biosynthetic gene clusters (BGCs), offer vast potential for biotechnological applications. Among BGC profiling techniques, computational detection methods face challenges, including time-consuming alignment and reliance on predefined profiles. To address these, we present BGC-Finder, an end-to-end pipeline utilizing protein language models for BGC detection and annotation from microbial genomes and metagenomes. This approach achieves remarkable increase in profiling speed of up to 100-fold, and employs genomic context-aware modeling to facilitate interpretable genetic essentiality assessment and large-scale BGC clustering. BGC-Finder outperformed traditional methods, successfully detecting 9.49% more biosynthetic-core genes and 27.70% more cytochrome P450s in 742 experimentally-validated BGCs. Notably, it retrieved 31 remote biosynthetic homologs from 210 polar marine metagenomes and identified 4,585 BGCs with 6,388 core genes from 256 fungal genomes. These findings highlight BGC-Finder’s capability to illuminate “microbial biosynthesis dark matter” (sequence-unrelated, function-similar biosynthetic enzymes) and expedite natural product discovery.
Highlights
-
BGC-Finder is an accurate and ultrafast pipeline leveraging protein language models (pLMs) to predict and annotate biosynthetic gene clusters (BGCs) from microbial genomes and metagenomes.
-
The genomic context-aware model enables interpretable analysis: attention-driven identification of essential biosynthetic genes and embedding-guided BGC clustering.
-
BGC-Finder sensitively retrieves remote homologous BGCs from both bacteria and fungi genomes, uncovering hidden ‘microbial biosynthesis dark matter’.
-
We discovered a non-ribosomal peptide synthetase (NRPS) family, which involved into function-specific BGCs in two evolutionarily distant fungi.