iPHoP: an integrated machine-learning framework to maximize host prediction for metagenome-assembled virus genomes

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived genomes lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e. for a number of viruses they yield erroneous predictions or no prediction at all. Here we describe iPHoP, a two-step framework that integrates multiple methods to provide host predictions for a broad range of viruses while retaining a low (<10%) false-discovery rate. Based on a large database of metagenome-derived virus genomes, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses. iPHoP is available at https://bitbucket.org/srouxjgi/iphop , through a Bioconda recipe, and a Docker container.

Article activity feed

  1. through a Bioconda recipe

    I saw that noarch was specified on conda, but when I tried to install it via conda on an M1 mac, I encountered issues whether running on arm64 or rosetta.

  2. Meanwhile, in the same benchmark, alignment-free methods appeared to contain a genuine and strongphage-host signal for a broader range of phages, but more complex to parse as the highest scoring hostwas often (>50% of the time) yielding an incorrect prediction at the species, genus, and family level

    I think this sentence has a missing word

  3. The main exception tothis pattern was the unexpectedly high number of host predictions to the Bacteroides genus for marine

    This is interesting. I'm curious if this could also stem from contamination in the databases, as mentioned in the next sentence. Is there a way to systematically evaluate this? (e.g. potential for kit contamination in isolates vs assembly/binning contamination in MAGs)

  4. Random Forest Classifiers were built using the TensorFlow Decision Forests v0.2.164 packagewithin the Keras 2.7.0 python library65, with parameters optimized with the Optuna v2.5.0 framework66.Parameters to be optimized included maximum tree depth (between 4 and 32), minimum number ofexamples in a node (between 2 and 10) and number of trees (between 100 and 1,000). A total of 100trials were performed, each was evaluated on the test dataset, the 5 classifiers with the highest accuracywere selected as the best candidates, and the candidate with the highest recall at 5% FDR was thenselected as the final iPHoP-RF classifier

    If I'm reading this correctly, this design wouldn't allow you to estimate overfitting. Have you brainstormed any ways to make a validation set for this model?

  5. Based on alarge database of metagenome-derived virus genomes

    Would it be possible to add context into the abstract about whether this is a new database, a combination of existing data bases, or just an existing database? I think that would be useful information to have up front

  6. Ashost references, we opted to use all genomes included in the GTDB database34, supplemented byadditional publicly available genomes from the IMG isolate database35 and the GEM catalog36.

    How does this set of genomes compare to e.g. all bacteria and archaea in GenBank? Was there a reason for excluding GenBank?

  7. Random Forest Classifiers were built using the TensorFlow Decision Forests v0.2.164 packagewithin the Keras 2.7.0 python library65, with parameters optimized with the Optuna v2.5.0 framework66.Parameters to be optimized included maximum tree depth (between 4 and 32), minimum number ofexamples in a node (between 2 and 10) and number of trees (between 100 and 1,000). A total of 100trials were performed, each was evaluated on the test dataset, the 5 classifiers with the highest accuracywere selected as the best candidates, and the candidate with the highest recall at 5% FDR was thenselected as the final iPHoP-RF classifier

    If I'm reading this correctly, this design wouldn't allow you to estimate overfitting. Have you brainstormed any ways to make a validation set for this model?

  8. The main exception tothis pattern was the unexpectedly high number of host predictions to the Bacteroides genus for marine

    This is interesting. I'm curious if this could also stem from contamination in the databases, as mentioned in the next sentence. Is there a way to systematically evaluate this? (e.g. potential for kit contamination in isolates vs assembly/binning contamination in MAGs)

  9. Ashost references, we opted to use all genomes included in the GTDB database34, supplemented byadditional publicly available genomes from the IMG isolate database35 and the GEM catalog36.

    How does this set of genomes compare to e.g. all bacteria and archaea in GenBank? Was there a reason for excluding GenBank?

  10. Meanwhile, in the same benchmark, alignment-free methods appeared to contain a genuine and strongphage-host signal for a broader range of phages, but more complex to parse as the highest scoring hostwas often (>50% of the time) yielding an incorrect prediction at the species, genus, and family level

    I think this sentence has a missing word

  11. Based on alarge database of metagenome-derived virus genomes

    Would it be possible to add context into the abstract about whether this is a new database, a combination of existing data bases, or just an existing database? I think that would be useful information to have up front

  12. through a Bioconda recipe

    I saw that noarch was specified on conda, but when I tried to install it via conda on an M1 mac, I encountered issues whether running on arm64 or rosetta.