NPannotator: a genome- and chemistry- constrained automation for type I polyketide synthase pathway elucidation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Natural products (NPs) are structurally diverse bioactive compounds whose biosynthesis is encoded within biosynthetic gene clusters (BGCs). Although databases such as the Minimum Information about a Biosynthetic Gene Cluster (MiBIG) repository now catalog thousands of experimentally validated NP structures, the full biosynthetic pathway connecting individual domain sequences to specific chemical features on final NP structures remains largely unannotated. This gap is especially pronounced for type I polyketide synthases (PKSs). These are modular assembly lines in which multiple enzymatic domains work in concert to condense acyl-CoA building blocks into complex polyketide scaffolds. Within these systems, acyltransferase (AT) domains govern which starter and extender units are incorporated at each elongation step, yet the substrate specificities of AT domains are known for only a fraction of cataloged clusters. Moreover, the catalytic order of genes encoding PKS modules is not immediately apparent from existing database entries, leaving the correct module ordering for observed product structures uncertain. Here, we present NPannotator, an automated, genomic context-aware cheminformatics pipeline that infers both the catalytic ordering of PKS domains and the substrate specificities of a given PKS’s AT domains. NPannotator loads a precomputed database of synthetically generated polyketide backbones, iteratively replaces default malonyl-CoA substrates with candidate starter and extender units via SMARTS-based substructure matching against the target NP, and selects the arrangement that maximizes chemical similarity. When benchmarked on the type I PKSs annotated within the expert-reviewed ClusterCAD dataset, NPannotator recovered 62.0% of both correct gene orderings and AT substrate annotations, and achieved 80.0% accuracy on gene ordering alone. By bridging gene-level architecture with chemical outcomes, NPannotator represents a step toward systematically decoding how protein sequence and genomic organization encode chemical structure in the world of natural products.