IPEV: Identification of Prokaryotic and Eukaryotic Virus-derived sequences in virome using deep learning

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.

Findings

We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.

Conclusions

IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV .

Article activity feed

  1. AbstractBackground The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.Findings We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.Competing Interest StatementThe authors have declared no competing interest.FootnotesRepair the typos of the title.

    Reviewer 2. Mohammadali Khan Mirzaei

    Yin et al. have developed a new tool to differentiate eukaryotic and prokaryotic viruses. The tool offers a potential benefit to the community, but there are several issues with the contribution in its current form, as discussed below.

    Major issues: The authors should separate their training and testing databases. Ideally, their testing dataset should include a set of previously unseen viruses that have their host experimentally confirmed. In addition, the performance of IPEV should be compared with tools commonly used in the field, including vcontact2: https://doi.org/10.1038/s41587-019-0100-8 and iPHoP: https://doi.org/10.1371/journal.pbio.3002083. However, none of these tools are developed to directly differentiate eukaryotic and prokaryotic viruses, identification of viral taxonomy or host range could lead to the identification of viral type. Moreover, the authors have used multiple approaches for their assessment of the type of viruses. Yet, it is not clear how they combined the results they generated by these approaches in their decisions.

    Minor issues: Please use either phageome or phages instead of phage virome. There are some typos in the text that need to be fixed.

  2. AbstractBackground The virome obtained through virus-like particle enrichment contain a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial for understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.Findings We present IPEV, a novel method that combines trinucleotide pair relative distance and frequency with a 2D convolutional neural network for distinguishing prokaryotic and eukaryotic viruses in viromes. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in terms of accuracy on most real virome samples when using sequence alignments as annotations. Notably, IPEV reduces runtime by 50 times compared to existing methods under the same computing configuration. We utilized IPEV to reanalyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae018), and has published the reviews under the same license. These are as follows.

    Reviewer 1: Guillermo Andres Rangel-Pineros

    Yin et al described the development and testing of IPEV, a deep-learning-based model that detects and discriminates sequences derived from prokaryotic and eukaryotic viruses in virome datasets. The model was developed using a set of reference viral sequences with known host information. The sequences were represented as sequence pattern matrices that contained values derived from the frequency and order of trinucleotide pairs. These matrices were subsequently used to train a 2D convolutional neural network that generates a 2-value vector for each input sequence, indicating the probability that the sequence corresponds to a prokaryotic or eukaryotic virus. The model was trained and tested using 5-fold cross validation on the reference set, and the authors assessed the robustness of the method using input datasets covering a range of homology and mutation rate values. Finally, the authors applied their model to a gut virome dataset from Shkoporov et al 2019.

    Indeed, IPEV represents a novel method that classifies viral sequences based on the type of host they target (prokaryotic or eukaryotic), and the results presented indicate that it efficiently covers a wide range of sequence lengths (from 100 bp). A model like IPEV provides a focus on eukaryotic viruses that is relatively shallow, in comparison with phages for which a wide range of prediction tools have been developed to date. Nevertheless, there are a few points that the authors need to address, particularly in relation to the robustness of the model:

    Major

    1. I am concerned about the number of reference sequences that were employed to train the model, and it makes me question its general applicability to viromes from any kind of biome. It would be great if the authors incorporated more sequences to their training and validation. Sources of viral sequences such as IMG/VR (https://img.jgi.doe.gov/cgi-bin/vr/main.cgi) and RVDB (https://rvdb.dbi.udel.edu/) could be useful for identifying further sequences, and generate a set that cover a much wider range of viral diversity. Perhaps, this could also lead to an improved performance for the gut datasets.

    2. Even though viral enrichment methods increase the concentration of viral DNA, the presence of contaminant DNA from other microbes in the enriched viral samples is common. Currently, the results do not indicate what the performance of the model would be in the presence of contaminating sequences. I suggest the authors to carry out tests that demonstrate the performance of IPEV when analysing a sample containing microbial contamination (ideally from both prokaryotes and eukaryotes) and demonstrate that IPEV is not prone to wrongly reporting these sequences as viruses.

    3. I find the results of the gut samples interesting and appropriate for the scope of IPEV. However, if IPEV is meant to be a general-purpose tool for virome analysis, it would be ideal if the authors provided results demonstrating the performance of the tool with samples from other biomes. For example, the authors could analyse datasets from the TARA Oceans project (e.g., 10.1016/j.cell.2019.03.040), some of which have already been assembled (https://www.ebi.ac.uk/ena/browser/view/PRJEB22493) .

    4. There are several instances in the manuscript where the authors indicate the existence of significant differences between metrics measured to compare the performance of tools (e.g., line 326: “which was significantly higher than the mean AUC values of …”), but there is no mention of statistical analyses conducted to reach those conclusions (except for the Wilcoxon rank-sum test in line 305). Please provide information on statistical tests conducted to identify the significant differences.

    Minor

    1. There is a reference missing in line 37.
    2. In the sentence between lines 41-44, it is not clear what you are referring to with “identification of viral sequences”. Are you referring to viral vs non-viral, or to host identification?
    3. Line 50: you mean “identification” or “differentiation”?
    4. The two sentences between lines 49 – 52 seem redundant. I would suggest rewriting these into a single sentence.
    5. Line 65: the latest version of ICTV taxonomy has 11,273 species. Please update this number.
    6. Line 67: there is a newer version of VirSorter (VirSorter2), which has an expended scope in comparison with the older version. Please, modify the text to include the most up-to-date version of this tool.
    7. There are some more tools with a varied range of strategies for viral prediction that are widely known among the community, which I feel should be mentioned in the introduction (e.g., VIBRANT, DeepVirFinder, PPR-Meta, etc). Even though none of these were explicitly designed for prediction of eukaryotic viruses, it’d be worth commenting on them.
    8. Indicate the version of Virus-Host DB used, and the version or date when the viral data was retrieved from NCBI.
    9. Line 124: do you mean 10 samples or 10 adults? If it’s the latter, please correct the sentence.
    10. Line 130: by “genome sequences” are you referring to the assembled viral contigs? In that case, please clarify as it is currently ambiguous.
    11. Tables 1 and 2, perhaps consider presenting these results as plots? I feel that the tables are rather hard to process.
    12. Line 274: This is a rather old reference, are you sure the error rate for PacBio is still this high? I would suggest looking at more up-to-date references.
    13. Line 279: replace “base insert or delete” with “insertions or deletions”.
    14. Table 3: Indicate the length range of the analysed sequences in the header.
    15. The section regarding the performance on functional proteins seems to include information that should be split between methods and results. Please modify accordingly.
    16. Please italicise names of viral taxa wherever they are mentioned in the manuscript (e.g., Tubulavirales and Timlovirales in Line 300).
    17. Line 320: This sounds as if the authors had conducted the experiments to collect the gut virome data. Rewrite to make it clear that these data were retrieved from a previous study.
    18. Line 331: Based on which observation did you reach this conclusion?
    19. Line 368: Wasn’t HTP developed for addressing a similar question? Please clarify.
    20. Line 409-410: The way the sentence is written seems to indicate that plant viruses can also infect human cells and microorganisms. Please rewrite to make it clearer.
    21. Regarding the tool’s text output, I would suggest modifying it to make it easier to parse (for example, leaving it as a tabular .csv file), and currently the header does not seem to accurately describe the contents of the file.

    Re-review: Yin et al described the development and testing of IPEV, a deep-learning-based model that detects and discriminates sequences derived from prokaryotic and eukaryotic viruses in virome datasets. The model was developed using a set of reference viral sequences with known host information. The sequences were represented as sequence pattern matrices that contained values derived from the frequency and order of trinucleotide pairs. These matrices were subsequently used to train a 2D convolutional neural network that generates a 2-value vector for each input sequence, indicating the probability that the sequence corresponds to a prokaryotic or eukaryotic virus. The model was trained and tested using 5-fold cross validation on the reference set, and the authors assessed the robustness of the method using input datasets covering a range of homology and mutation rate values. Finally, the authors applied their model to a gut virome dataset from Shkoporov et al 2019, and marine virome datasets from Gregory et al 2019. Indeed, IPEV represents a novel method that classifies viral sequences based on the type of host they target (prokaryotic or eukaryotic), and the results presented indicate that it efficiently covers a wide range of sequence lengths (from 100 bp). A model like IPEV provides a focus on eukaryotic viruses that is relatively shallow, in comparison with phages for which a wide range of prediction tools have been developed to date. In my opinion, the authors satisfactorily addressed the comments and suggestions made in the first round of review. I only have a few final suggestions to finalise the manuscript and have it ready for publication:

    1. The authors include some text in the Discussion section (paragraph from line 423 to line 436, and paragraph from line 437 to 448) that, in my opinion, would fit better in the Results section. I suggest the authors include these in the Results section, and then in the Discussion comment how those results compare to other methods and what are their implications.
    2. I would suggest modifying the sentence in line 42 like this: "Nonetheless, it is essential to note that enriched sample approaches carry the risk of losing valuable host or environmental information [8], potentially leading to inaccurate virus host identification and constraining subsequent analyses."
    3. In the sentence starting in line 392, instead of "During" use "For".