Machine learning inference of natural product chemistry across biosynthetic gene cluster types
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With ever-increasing volumes of sequencing data for biosynthetic gene clusters (BGCs), computational methods for the prediction of resulting secondary metabolites are critically needed. Here, we present CHAMOIS, a machine learning tool inferring metabolite properties from protein domains in BGCs. Out of 539 relevant chemical properties from the ChemOnt ontology, CHAMOIS predicts 120 with an AUPRC > 0.5. Although entirely data-driven, CHAMOIS infers many protein-metabolite links that are consistent with the scientific literature and suggests interesting novel biosynthetic functions of uncharacterized proteins. Finally, to guide experimental BGC characterisation, CHAMOIS can pinpoint which BGC within a given genome produces a pre-specified metabolite.