What the Phage: A scalable workflow for the identification and analysis of phage sequences

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Phages are among the most abundant and diverse biological entities on earth. Phage prediction from sequence data is a crucial first step to understanding their impact on the environment. A variety of bacteriophage prediction tools have been developed over the years. They differ in algorithmic approach, results, and ease of use. We, therefore, developed “What the Phage” (WtP), an easy-to-use and parallel multitool approach for phage prediction combined with an annotation and classification downstream strategy, thus, supporting the user’s decision-making process by summarizing the results of the different prediction tools in charts and tables. WtP is reproducible and scales to thousands of datasets through a workflow manager (Nextflow). WtP is freely available under a GPL-3.0 license ( https://github.com/replikation/What_the_Phage ).

Article activity feed

  1. Abstract

    Reviewer1: Satoshi Hiraoka

    In this manuscript, the authors developed a new tool, What the Phage (WtP), for comparison of the output from multiple bioinformatics tools to predict phage sequences from genomic or metagenomic datasets. The purpose of this study is some or less meaningful. As the authors described in the Introduction section, currently it is difficult to predict reliable viral genomes, especially from cultureindependent metagenomic datasets precisely because of the lack of knowledge about viral genomes in current protein/genome databases. There are many bioinformatics tools already proposed and some of them are widely used in microbiology, however, the outputs from these tools are frequently varied and conflicted among them. However, there is no good integrative platform to compare the outputs. Here, the proposed tool easily generates well-summarized output derived from multiple tools, and thus, the tool might be facilitated the analysis of phage prediction in the field of microbiology. Indeed, the authors conducted (only but) one case study using real phage genomes and reported reasonable performance. I feel the tool has some potential to contribute to the wide fields of viral genomics. However, the user of this tool should keep in mind the fact that the tool just summarizes the output of multiple phage-prediction tools, meaning does not evaluate the reliability of the output, as described in the Discussion section. I feel thus the tool sometimes may lead to misunderstandings or make the users confuse rather than help them. It should emphasize that the majority decision among the multiple tools does not always bring the best result. The users may need further detailed analysis for the precise prediction of viral genome from metagenomes. Also, I feel that, because the development of bioinformatics tools is quite rapid, integrated platforms like WtP will be outdated very soon without continuous effort for maintenance and upgrade to assimilate future novel tools. I understand the 'sustainability' of the tool is out of the journal scope, but the perspective on this point will be better to be described in the manuscript or GitHub page. I have some suggestions that would increase the clarity and impact of this manuscript if addressed. [Background] Some tools (e.g., Virsorter2) can be used to predict viruses out from common bacteriophages, e.g., NCLDV and virophage (See the original article of VirSorter2). Those kinds of viruses should be described briefly in this section as well as common dsDNA phages. Assembly-free long read is described here, but I think this is a bit far from the scope of this manuscript. Indeed, the dataset used in this study (ERR575692) is derived from Illumina HiSeq and the performance of assembly-free long-read dataset was not analyzed in this study. I think the descriptions could be moved to the Discussion section rather than the Introduction. Rather than that, it would be better to add more attractive descriptions about studies of phage genomes identified from short-read metagenomes to emphasize the importance of phage prediction and the value of the proposed tool, WtP. e.g., History of viral genomics using metagenomic dataset, recent technical improvement of metagenomics, phylogenetic diversity of phages, discovery of novel phage lineages from environmental metagenome, etc. Only 5 out of 11 tools that used in WtP were introduced here. The remaining 6 tools would be better to also cite here with a brief explanation of those strategies for virus prediction. Also, MARVEL was cited here but not used in WtP. [Design and Implementation] Figure 1 is different from the one on the GitHub page ( https://mult1fractal.github.io/wtpdocumentation/figures/wtp-flowchart-simple.png ), which seem to be better than the Figure 1. What 'DAG' means? [Prediction and Visualization] 'a metagenome assembly' could rephrase like 'metagenomic assembled contigs' Metaphinder and Seeker are here with 'no release version'. I understand the situation but I feel this description is not good for reproducing the analysis. To specify the version of tools even if lack the official release version, mention the last commit date (For Metaphinder, Aug 10, 2021) or GitHub commit ID ( bebc447d00ec9ff9f4960f38b627d8651262ff72 ) is likely a good way. [Functional annotation & Taxonomy] In this manuscript, Prodigal was used for gene prediction. However, accurate gene prediction from phage genome is still difficult (see https://academic.oup.com/bioinformatics/article/35/22/4537/5480131). This fact have been affect both the phage prediction and functional gene annotation in the field of virology. I think the difficulty of gene prediction from phage genome and potential room for improvement should be noted in the discussion section. [Result report] The sentence ' ~ IMG/VR, iVirus, or VERVE-NET' here should be with appropriate citations or URLs. I found a paper of iVirus: https://www.nature.com/articles/s43705-021-00083-3 [Other features] WTP -> WtP [Analysis] Figure 3. X-axis title of left-bottom bar plot and Y-axis title of top-right bar plot. viral -> phage What 'prediction values' mean? Are these scores generated by each prediction tool? Figure 4. X-axis texts. Unify the format to either NodeID:assignment (e.g., NODE_5:unknown) or assignment:NodeID (T3:NODE_14). ' The sequences matched with 100% identity to Salmonella enterica (Salmonella enterica strain FDAARGOS_768 chromosome, complete genome), but not to prophage sequences. ' here. Does the sentence mean that the contig NODE_5 and NODE_8 were mis-predicted as prophage by CheckV? Table 1. completeness -> completeness (%) [Discussion & potential implications] Add citation in the line ' At least one multitool approach was implemented on a smaller scale by Ann C. Gregory et al. (comprising only VirFinder and VirSorter). ' [References]

    1. Lack doi.
    2. Lack doi.
    3. Lack doi.

    Reviewer2: : Huaiqiu Zhu

    In this manuscript, the authors developed an integrated workflow WtP for identification, annotation and taxonomy of phage sequences. Based on Docker and Nextflow, WtP integrates 11 phage sequence identification tools (including 14 approaches), two functional annotation and taxonomy tools (Prodigal and HMMER), and a visualizing tool (chromoMap). When using WtP, it is convenient that users do not need to install each tool and can avoid the conflict between each installation package and between operating systems. Also, the WtP tool was applied to the artificial microbiome. The threshold of each phage sequence prediction tool can be manually adjusted and outputted. Annotation and taxonomy results of phage sequences can be further visualized by CheckV and by chromeMap tool. However, there are some limitations in this manuscript. For the annotation and taxonomy stage, only the Prodigal tool was used for gene prediction, and no other gene prediction tools (especially the phage-specific tools). It is necessary for an integrated workflow to include other similar tools. WtP needs at least 4 GB of memory and 75 GB of storage, so the author should develop a web version or at least a graphical interface version of WtP for its prevalence. Major comment:

    1. Except for sequence identification, host prediction (e.g., HoPhage, PHP, and VirHost Matcher-Net) and lifestyle prediction (e.g., DeepPhage, PhagePred) of phage sequences are also important in microbial communities. However, WtP did not involve those functions.
    2. In addition to the web version or graphical interface version of WtP, the author can also consider a video demo or usage illustration. To clarify the purpose of this study, I think it would be better to add the phrase 'a web server of ...' or 'a GUI platform of ...' into the title.
    3. In 'Analysis' Section (Page 12), only four contigs of phage sequences can be annotated in artificial data: P22 (NODE_12), T3 (NODE_14), T7 (NODE_13) and phiX174 (NODE_30). The 'predicted_organism_name' of the remaining 102 phage contigs are 'no match found'. Can WtP improve or add more databases to annotate more contigs?
    4. In 'Analysis' Section (Page 14), the author mentions 'No specialized phage assembly strategy or any cleanup step was included during the assembly step'. I think it is unreasonable, and the downstream analysis will inevitably be affected by the impurity sequences.
    5. In Figure 2, it is possible to export results in the form of 'csv', 'pdf' or 'excel'. Can WtP export all the predicted phage sequences in the form of 'fasta'. The author should describe how to change or add the database during the annotation and classification phases. Minor comment:
    6. In 'Functional annotation & Taxonomy' Section (Page 8), 'Figure 3' in the sentence 'All annotations are summarized in an interactive HTML file via chromoMap (see Figure 3)' should be 'Figure 4'.
    7. The column of 'Computeness' in Table 1 missed the unit, and the author could add an outer border to Table 1.
    8. Figure 2 and Figure 3 need to be clearer.
    9. Page 5. 'approach to gain' should be 'approach to gaining'.
    10. Page 13. 'In addition to' should be 'In addition to'.