MetaPathPredict: A machine learning-based tool for predicting metabolic modules in incomplete bacterial genomes

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This landmark study presents MetaPathPredict, a method that uses a stacked ensemble of neural networks to predict the presence or absence of KEGG modules based on annotated features in the genome. The evidence supporting the conclusions is compelling, with a tool that allows for prediction of KEGG modules in sparse gene sequence datasets.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The reconstruction of complete microbial metabolic pathways using ‘omics data from environmental samples remains challenging. Computational pipelines for pathway reconstruction that utilize machine learning methods to predict the presence or absence of KEGG modules in incomplete genomes are lacking. Here, we present MetaPathPredict, a software tool that incorporates machine learning models to predict the presence of complete KEGG modules within bacterial genomic datasets. Using gene annotation data and information from KEGG module databases, MetaPathPredict employs neural network and XGBoost stacked ensemble models to reconstruct and predict the presence of KEGG modules in a genome. MetaPathPredict can be used as a command line tool or as an R package, and both options are designed to be run locally or on a compute cluster. In our benchmarks, MetaPathPredict makes robust predictions of KEGG module presence within highly incomplete genomes.

Article activity feed

  1. eLife assessment

    This landmark study presents MetaPathPredict, a method that uses a stacked ensemble of neural networks to predict the presence or absence of KEGG modules based on annotated features in the genome. The evidence supporting the conclusions is compelling, with a tool that allows for prediction of KEGG modules in sparse gene sequence datasets.

  2. Reviewer #1 (Public Review):

    The authors are presenting a new algorithm applying machine learning to determine the presence or absence of KEGG metabolic modules in microbial genomes. Specifically, they aim to make these predictions in incomplete genomes, like those you will see from assembly and binning of metagenomic reads. This is a significant problem and challenge in the bioinformatics and computational biology community, and as such, this work is a substantial step forward. A key aspect of this, which the authors themselves aptly demonstrate in their results is the ability of machine learning to judge the likelihood of a KEGG module being present based on all gene annotations and not just those genes in the module. The yields significantly greater results compared with approaches that rely solely on genes within the pathway.

  3. Reviewer #2 (Public Review):

    The authors introduce MetaPathPredict, a method that infers the presence of functional units of gene sets, such as a set of genes coding enzymes for a common metabolic pathway, from a pool of genes or genetic sequences. MetaPathPredict employs a stacked ensemble of neural networks, each trained for a specific pathway, to consider mutual information between pathways.

    In predicting the presence of metabolic pathways in incomplete genomes, MetaPathPredict outperforms alternative naive classifiers and single neural network methods. These results demonstrate the effectiveness of a stacked ensemble of neural networks in exploiting mutual information between metabolic pathways.