BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs).

Results

Here, we introduce BiG-SLiCE, a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion. We used BiG-SLiCE to analyze 1,225,071 BGCs collected from 209,206 publicly available microbial genomes and metagenome-assembled genomes within 10 days on a typical 36-core CPU server. We demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential. BiG-SLiCE also provides a “query mode” that can efficiently place newly sequenced BGCs into previously computed GCFs, plus a powerful output visualization engine that facilitates user-friendly data exploration.

Conclusions

BiG-SLiCE opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry. BiG-SLiCE is available via https://github.com/medema-group/bigslice.

Article activity feed

  1. Now published in GigaScience doi: 10.1093/gigascience/giaa154

    Satria A. Kautsar 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Satria A. KautsarJustin J. J. van der Hooft 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Justin J. J. van der HooftDick de Ridder 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Dick de RidderMarnix H. Medema 1Bioinformatics Group, Wageningen University, the NetherlandsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Marnix H. MedemaFor correspondence: marnix.medema@wur.nl

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa154 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    These peer reviews were as follows:

    Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102605 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102606 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102607