iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

Simon Roux
Antonio Pedro Camargo
Felipe H. Coutinho
Shareef M. Dabdoub
Bas E. Dutilh
Stephen Nayfach
Andrew Tritt

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived sequences lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e., for a number of viruses they yield erroneous predictions or no prediction at all. Here, we describe iPHoP, a two-step framework that integrates multiple methods to reliably predict host taxonomy at the genus rank for a broad range of viruses infecting bacteria and archaea, while retaining a low false discovery rate. Based on a large dataset of metagenome-derived virus genomes from the IMG/VR database, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses.

Version published to 10.1371/journal.pbio.3002083
Apr 21, 2023
Arcadia Science
Apr 14, 2023

through a Bioconda recipe

I saw that noarch was specified on conda, but when I tried to install it via conda on an M1 mac, I encountered issues whether running on arm64 or rosetta.

Read the original source
Arcadia Science
Apr 14, 2023

Meanwhile, in the same benchmark, alignment-free methods appeared to contain a genuine and strongphage-host signal for a broader range of phages, but more complex to parse as the highest scoring hostwas often (>50% of the time) yielding an incorrect prediction at the species, genus, and family level

I think this sentence has a missing word

Read the original source
Arcadia Science
Apr 14, 2023

The main exception tothis pattern was the unexpectedly high number of host predictions to the Bacteroides genus for marine

This is interesting. I'm curious if this could also stem from contamination in the databases, as mentioned in the next sentence. Is there a way to systematically evaluate this? (e.g. potential for kit contamination in isolates vs assembly/binning contamination in MAGs)

Read the original source
Arcadia Science
Apr 14, 2023

Random Forest Classifiers were built using the TensorFlow Decision Forests v0.2.164 packagewithin the Keras 2.7.0 python library65, with parameters optimized with the Optuna v2.5.0 framework66.Parameters to be optimized included maximum tree depth (between 4 and 32), minimum number ofexamples in a node (between 2 and 10) and number of trees (between 100 and 1,000). A total of 100trials were performed, each was evaluated on the test dataset, the 5 classifiers with the highest accuracywere selected as the best candidates, and the candidate with the highest recall at 5% FDR was thenselected as the final iPHoP-RF classifier

If I'm reading this correctly, this design wouldn't allow you to estimate overfitting. Have you brainstormed any ways to make a validation set for this model?

Read the original source
Arcadia Science
Apr 14, 2023

Based on alarge database of metagenome-derived virus genomes

Would it be possible to add context into the abstract about whether this is a new database, a combination of existing data bases, or just an existing database? I think that would be useful information to have up front

Read the original source
Arcadia Science
Apr 14, 2023

Ashost references, we opted to use all genomes included in the GTDB database34, supplemented byadditional publicly available genomes from the IMG isolate database35 and the GEM catalog36.

How does this set of genomes compare to e.g. all bacteria and archaea in GenBank? Was there a reason for excluding GenBank?

Read the original source
Arcadia Science
Sep 6, 2022

Random Forest Classifiers were built using the TensorFlow Decision Forests v0.2.164 packagewithin the Keras 2.7.0 python library65, with parameters optimized with the Optuna v2.5.0 framework66.Parameters to be optimized included maximum tree depth (between 4 and 32), minimum number ofexamples in a node (between 2 and 10) and number of trees (between 100 and 1,000). A total of 100trials were performed, each was evaluated on the test dataset, the 5 classifiers with the highest accuracywere selected as the best candidates, and the candidate with the highest recall at 5% FDR was thenselected as the final iPHoP-RF classifier

If I'm reading this correctly, this design wouldn't allow you to estimate overfitting. Have you brainstormed any ways to make a validation set for this model?

Read the original source
Arcadia Science
Sep 6, 2022

The main exception tothis pattern was the unexpectedly high number of host predictions to the Bacteroides genus for marine

This is interesting. I'm curious if this could also stem from contamination in the databases, as mentioned in the next sentence. Is there a way to systematically evaluate this? (e.g. potential for kit contamination in isolates vs assembly/binning contamination in MAGs)

Read the original source
Arcadia Science
Sep 6, 2022

Ashost references, we opted to use all genomes included in the GTDB database34, supplemented byadditional publicly available genomes from the IMG isolate database35 and the GEM catalog36.

How does this set of genomes compare to e.g. all bacteria and archaea in GenBank? Was there a reason for excluding GenBank?

Read the original source
Arcadia Science
Sep 6, 2022

Meanwhile, in the same benchmark, alignment-free methods appeared to contain a genuine and strongphage-host signal for a broader range of phages, but more complex to parse as the highest scoring hostwas often (>50% of the time) yielding an incorrect prediction at the species, genus, and family level

I think this sentence has a missing word

Read the original source
Arcadia Science
Sep 6, 2022

Based on alarge database of metagenome-derived virus genomes

Would it be possible to add context into the abstract about whether this is a new database, a combination of existing data bases, or just an existing database? I think that would be useful information to have up front

Read the original source
Arcadia Science
Sep 6, 2022

through a Bioconda recipe

I saw that noarch was specified on conda, but when I tried to install it via conda on an M1 mac, I encountered issues whether running on arm64 or rosetta.

Read the original source
Version published to 10.1101/2022.07.28.501908v1 on bioRxiv
Jul 28, 2022

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

This article has 6 authors:
1. Qiuyi Li
2. Wei Wu
3. Yiheng Zhu
4. Fuli Feng
5. Jieping Ye
6. Zheng Wang
This article has no evaluationsLatest version Jul 4, 2025
Deep Learning Transforms Phage-Host Interaction Discovery from Metagenomic Data

This article has 7 authors:
1. Yiyan Yang
2. Tong Wang
3. Dan Huang
4. Xu-Wen Wang
5. Scott T Weiss
6. Joshua Korzenik
7. Yang-Yu Liu
This article has no evaluationsLatest version Jun 27, 2025
MetaFX: feature extraction from whole-genome metagenomic sequencing data

This article has 5 authors:
1. Artem Ivanov
2. Vladimir Popov
3. Maxim Morozov
4. Evgenii Olekhnovich
5. Vladimir Ulyantsev
This article has no evaluationsLatest version May 31, 2025

iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

Deep Learning Transforms Phage-Host Interaction Discovery from Metagenomic Data

MetaFX: feature extraction from whole-genome metagenomic sequencing data

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

Deep Learning Transforms Phage-Host Interaction Discovery from Metagenomic Data

MetaFX: feature extraction from whole-genome metagenomic sequencing data