Uncovering Microbial Biosynthetic Potential with Genomic Context-aware Protein Language Model

Zixin Kang
Haohong Zhang
Chaoqin Liang
Ronghua Yang
Ying Ye
Hong Bai
Yonghui Zhang
Kang Ning

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Microbial secondary metabolites, synthesized by biosynthetic gene clusters (BGCs), offer vast potential for biotechnological applications. Among BGC profiling techniques, computational detection methods face challenges, including time-consuming alignment and reliance on predefined profiles. To address these, we present BGC-Finder, an end-to-end pipeline utilizing protein language models for BGC detection and annotation from microbial genomes and metagenomes. This approach achieves remarkable increase in profiling speed of up to 100-fold, and employs genomic context-aware modeling to facilitate interpretable genetic essentiality assessment and large-scale BGC clustering. BGC-Finder outperformed traditional methods, successfully detecting 9.49% more biosynthetic-core genes and 27.70% more cytochrome P450s in 742 experimentally-validated BGCs. Notably, it retrieved 31 remote biosynthetic homologs from 210 polar marine metagenomes and identified 4,585 BGCs with 6,388 core genes from 256 fungal genomes. These findings highlight BGC-Finder’s capability to illuminate “microbial biosynthesis dark matter” (sequence-unrelated, function-similar biosynthetic enzymes) and expedite natural product discovery.

Highlights

BGC-Finder is an accurate and ultrafast pipeline leveraging protein language models (pLMs) to predict and annotate biosynthetic gene clusters (BGCs) from microbial genomes and metagenomes.
The genomic context-aware model enables interpretable analysis: attention-driven identification of essential biosynthetic genes and embedding-guided BGC clustering.
BGC-Finder sensitively retrieves remote homologous BGCs from both bacteria and fungi genomes, uncovering hidden ‘microbial biosynthesis dark matter’.
We discovered a non-ribosomal peptide synthetase (NRPS) family, which involved into function-specific BGCs in two evolutionarily distant fungi.

Version published to 10.1101/2025.04.29.651206 on bioRxiv
May 3, 2025

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026
Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model

This article has 13 authors:
1. Peilin Xie
2. Xingchen Liu
3. Lantian Yao
4. Zhihao Zhao
5. Anming Yang
6. Jiahui Guan
7. Zijun Jiao
8. Zhihong Liu
9. Junwen Wang
10. Tzong-Yi Lee
11. Zigang Li
12. Bingyu Cui
13. Ying-Chih Chiang
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Highlights

Article activity feed

Related articles

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model