NPannotator: a genome- and chemistry- constrained automation for type I polyketide synthase pathway elucidation

Yash Chainani
Andre Cornman
Yunha Hwang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Natural products (NPs) are structurally diverse bioactive compounds whose biosynthesis is encoded within biosynthetic gene clusters (BGCs). Although databases such as the Minimum Information about a Biosynthetic Gene Cluster (MiBIG) repository now catalog thousands of experimentally validated NP structures, the full biosynthetic pathway connecting individual domain sequences to specific chemical features on final NP structures remains largely unannotated. This gap is especially pronounced for type I polyketide synthases (PKSs). These are modular assembly lines in which multiple enzymatic domains work in concert to condense acyl-CoA building blocks into complex polyketide scaffolds. Within these systems, acyltransferase (AT) domains govern which starter and extender units are incorporated at each elongation step, yet the substrate specificities of AT domains are known for only a fraction of cataloged clusters. Moreover, the catalytic order of genes encoding PKS modules is not immediately apparent from existing database entries, leaving the correct module ordering for observed product structures uncertain. Here, we present NPannotator, an automated, genomic context-aware cheminformatics pipeline that infers both the catalytic ordering of PKS domains and the substrate specificities of a given PKS’s AT domains. NPannotator loads a precomputed database of synthetically generated polyketide backbones, iteratively replaces default malonyl-CoA substrates with candidate starter and extender units via SMARTS-based substructure matching against the target NP, and selects the arrangement that maximizes chemical similarity. When benchmarked on the type I PKSs annotated within the expert-reviewed ClusterCAD dataset, NPannotator recovered 62.0% of both correct gene orderings and AT substrate annotations, and achieved 80.0% accuracy on gene ordering alone. By bridging gene-level architecture with chemical outcomes, NPannotator represents a step toward systematically decoding how protein sequence and genomic organization encode chemical structure in the world of natural products.

Version published to 10.64898/2026.04.06.712324 on bioRxiv
Apr 8, 2026

Integrating targeted genome mining and structure-guided modeling reveals unexplored 7-deazapurine-containing pathways

This article has 4 authors:
1. José D. D. Cediel-Becerra
2. Marc G. Chevrette
3. Valérie de Crécy-Lagard
4. Raquel Dias
This article has no evaluationsLatest version Apr 19, 2026
PlantP450Dock: an Automated Molecular Docking Pipeline of Plant Cytochrome P450s

This article has 5 authors:
1. Liang Feng
2. Changbin Niu
3. Xindong Qing
4. Chunhui Zhang
5. Changsheng Li
This article has no evaluationsLatest version May 15, 2026
ActSeekN: A Structural-Motif–Based Pipeline for Interpretable Enzyme Function Annotation

This article has 5 authors:
1. Sandra Castillo
2. Chunhao Gu
3. Paula Jouhten
4. Gopal Peddinti
5. O. H. Samuli Ollila
This article has no evaluationsLatest version Apr 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating targeted genome mining and structure-guided modeling reveals unexplored 7-deazapurine-containing pathways

PlantP450Dock: an Automated Molecular Docking Pipeline of Plant Cytochrome P450s

ActSeekN: A Structural-Motif–Based Pipeline for Interpretable Enzyme Function Annotation