A linked Genomics Sequencing and Mass Spectrometry multi-modal dataset and models for Streamlined Natural Products Discovery in Microbial Strain Libraries

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Natural products (NP) are instrumental in drug development, but their discovery and validation remain challenging and laborious despite advances in both genomic and analytical technologies. In this study, we demonstrate the use of an integrated multi-modal characterization of a microbial strain library for enhanced natural product discovery. This characterization utilizes language- and transformer-based models, integrated through a cross-validate and rank approach to search a mass spectrometry (MS)-genome multi-modal dataset with high confidence. MS data are analysed using an in-house developed tandem mass spectral MS/MS to structural elucidation workflow (WISE) that features a combination of molecular language and transformer-based models to predict corresponding molecular structures. Simultaneously, the related genomic data is pre-processed using the protein language model (ESM2) to extract meaningful embeddings. As a proof of concept, these models and pre-processed linked MS-genome datasets were applied and validated for the rapid identification of microbial strains capable of producing three diverse natural product compounds with precision ranging from 75-100%. Our findings demonstrate the transformative potential of linked MS-genome datasets at the strain-level to accelerate natural product discovery. This approach can expand the range of biotechnological innovations beyond what is currently known and curated, while also greatly reduce the resources and effort needed for discovery.

Article activity feed