DeePVP: Identification and classification of phage virion proteins using deep learning

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Many biological properties of phages are determined by phage virion proteins (PVPs), and the poor annotation of PVPs is a bottleneck for many areas of viral research, such as viral phylogenetic analysis, viral host identification, and antibacterial drug design. Because of the high diversity of PVP sequences, the PVP annotation of a phage genome remains a particularly challenging bioinformatic task.

Findings

Based on deep learning, we developed DeePVP. The main module of DeePVP aims to discriminate PVPs from non-PVPs within a phage genome, while the extended module of DeePVP can further classify predicted PVPs into the 10 major classes of PVPs. Compared with the present state-of-the-art tools, the main module of DeePVP performs better, with a 9.05% higher F1-score in the PVP identification task. Moreover, the overall accuracy of the extended module of DeePVP in the PVP classification task is approximately 3.72% higher than that of PhANNs. Two application cases show that the predictions of DeePVP are more reliable and can better reveal the compact PVP-enriched region than the current state-of-the-art tools. Particularly, in the Escherichia phage phiEC1 genome, a novel PVP-enriched region that is conserved in many other Escherichia phage genomes was identified, indicating that DeePVP will be a useful tool for the analysis of phage genomic structures.

Conclusions

DeePVP outperforms state-of-the-art tools. The program is optimized in both a virtual machine with graphical user interface and a docker so that the tool can be easily run by noncomputer professionals. DeePVP is freely available at https://github.com/fangzcbio/DeePVP/.

Article activity feed

  1. The

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac076) and has published the reviews under the same license.

    **Reviewer 1 Satoshi Hiraoka **

    In this manuscript, the authors developed a new tool, DeePVP, for predicting Phage Virion Proteins (PVPs) using the Deep learning approach. The purpose of this study is meaningful. As the authors described in the Introduction section, currently it is difficult to annotate functions of viral genes precisely because of its huge sequence diversity and existence of many unknown functions, and there are still many rooms to improve the performance of in silico annotation of phage genes including PVPs. Although I'm not an expert in machine learning, the newly proposed method based on Deep learning seems to be appropriate. The proposed tool showed clear outperformance compared with the other previously proposed tools, and thus, the tool might be valuable for further deep analysis of many viral genomes. Indeed, the authors conducted two case studies using real phage genomes and reported novel findings that may have insight into the genomics of the phages. Overall, the manuscript is well written, and I feel the tool has a good potential to contribute to the wide fields of viral genomics. Unfortunately, I have concerns including the source cord openness. Also, I have some suggestions that would increase the clarity and impact of this manuscript if addressed.

    Major: I did not find DeePVP source cord on the GitHub page. Is the tool not open source? I strongly recommend the author disclose all scripts of the tool for further validation and secondary usage by other scientists. Or, at least, clearly state why the source cords need to hold private. Also, I was much confused about the GitHub page because the uploaded files are not well structured. Scripts and data used for performance evaluation were included in 'data.zip' file, which should be renamed to be an appropriate one. 'Source code' button in the Releases page strangely links to the 'Supporting_data.zip' files which only contained installing manual but not source cord file. The authors should prepare the GitHub page appropriately that, for example, upload all source cords to the 'main' branch rather than include them in zip file, and 'source code' file in Releases should contain actual source code files rather than manual PDF. According to the Material and method section, 1) using the Deep learning approach, and 2) using th large dataset retrieved from PhANNs as teacher dataset, are two of the important improvement from the other studies in the PVP identification task. Someone may suspect the better performance of DeePVP was mostly contributed by the increased teaching dataset rather than the used classification method. Is there a possibility that the previously proposed tools (especially the tools except for PhANNs) with re-training using the large PhANNs dataset could reach better performances than DeePVP? The naming of 'Reliability index' (L249) is inaccurate. The score did not support the prediction 'reliability' (i.e., whether the predicted genes are truly PVP or not) but just reflects the fact that the gene is predicted as PVP by many tools without considering whether it is correct or incorrect. The sentence 'A higher n indicates that this protein is predicted as PVP by more tools at the same time, and therefore, the prediction may be more reliable.' in L252 is not logical. I dose not fully agree with the discussion that the tool will facilitate viral host prediction as mentioned in L294-302. It is very natural that if the phages are phylogenetically close and possess similar genomic structures including PVP-enriched regions, those will infect the same microbial lineage as a host. However, this is not evaluated systematically in wide phage lineages. In general, almost all phage-host relations are unknown in nature except few numbers of specific viruses such as E. Coli phages. Further detailed studies should be needed on whether and how degree the conservation of PVP-enriched region could be a potentially good feature to predict phage-host relationship. I think the phage-host prediction is beyond the scope of this tool, and thus the analysis could be deleted in this manuscript or just briefly mention in the Discussion section as a future perspective.

    Minor: The URL of the GitHub page is better to describe in the last of the Abstract or inside of the main text in addition to the 'Availability of supporting source code and requirements' section. This will make it easy for many readers to access the homepage and use the tool. Fig 2 and 3. I think it is better to change the labels of the x-axis like 0 kb, 20 kb, 40 kb, ..., and 180 kb. This will make it easy for understanding that the horizontal bar represented the viral genome.

    Re-review:

    I read the revised manuscript and acknowledge that the authors made efforts to take reviewers' comments into account. My previous points have been addressed and I feel the manuscript was improved. I think the word 'incomplete proteins' in L391-396 would be rephrased like 'partial genes' because here we should consider protein-encoding genes (or protein sequences), not proteins themselves, and the word 'incomplete' is a bit ambiguous.

  2. ABSTRACT

    **Reviewer 2. Deyvid Amgarten **

    The manuscript presents DeePVP, a new tool for PVP annotation of a phage genome. The tool implements two separate modules: The main module aims to discriminate PVPs from non-PVPs within a phage genome, while the extended module of DeePVP can further classify predicted PVPs into the ten major classes of PVPs. Compared with the present state-of-the-art tools, the main module of DeePVP performs better, with a 9.05% higher F1-score in the PVP identification task. Moreover, the overall accuracy of the extended module of DeePVP in the PVP classification task is approximately 3.72% higher than that of PhANNs, a known tool in the area. Overall, the manuscript is well written, clear, and I could not identify any serious methodological inconsistence. I was not sure whether to consider the performance metrics shown as significant improvements or not, since PhANNs already does a similar job on that regard. And it is better for some types of PVPs for example. But I would rather give this task to readers and other researchers in the area. Specifically, I enjoyed the discussion about how one-hot encoded features may be more suitable for predictions that k-mers based ones. And by consequence, that convolution networks may present an advantage against simple multilayer perceptron networks. This manuscript brings an important contribution to the phage genomics and machine learning fields. I am certain that DeePVP will be helpful to many researchers. I have a major question about the composition of the dataset used to train the main module: Among the PVP proteins, do authors know if only the ten types of PVP are present? There is a rapid mention to key words used to assemble the PhANNs dataset in the discussion (line 340), but that is not clear to me. This will help me understand the following: Line 124: The CNN in the extended module has an output softmax layer, which outputs likelihood scores for 10 types of virion proteins. I wonder if only proteins from these 10 types were included in the datasets used to train the CNNs. I mean, is it possible that a different type of virion protein is predicted by the main module as PVP? And if so, how would the extended module predict this protein since it is PVP but none of the ten types? Minors: Line 121: By default, a protein with a PVP score higher than 0.5 is regarded as a PVP. How was this cutoff chosen? Was this part of the k-cross validation process? Line 157 and other pieces in the manuscript: I would suggest authors not to use sentences like "F1-score is 9.05% much higher than that of PhANNs" for obvious reasons that 9% may not seem such a great difference for using the "much" adverb. Same thing to "much better" and variations. About the comparisons between DeePVP and PhANNs: Did authors make sure that instances of the test set were not used to train the PhANNs model being used? Line 221: What authors mean by "more authentic prediction"? Looking at the github repository, I found rather unusual that authors chose to upload only a PDF with instructions of how to use and install. It is very detailed, I appreciate. The virtual machine and docke containers are also nice resources to help less experienced users. However, I noticed that the github repository has no clear mention to the source code of the tool. I found it by a mention in the Availability of supporting data, where authors created a release with the datasets and the scripts. Again, very unusual, but I suppose authors have chosen this approach due to github limitations to large files. Table 2: I would like to ask authors what might me the reason for such low performance metrics to some types of PVP (for example, minor capsid)? Figure 5 states: "Host genus composition of the subject sequences". But there is a "Myoviridae" category, which is a family of phages. Not anything related to bacterial hosts. Please, verify why this is in the figure.

    Re-review:

    Thank you for authors' responses. Most of my concern were addresses. I have to say, though, that the github page is not quite in the standards for a bioinformatics tools yet. I appreciate the source code upload, but I noticed that not a single line of #comments were present in the code I have checked. README file is also not very clarifying. I do not consider this as an impediment for publication (since there are detailed info in GigaScience DB), but perhaps this may hind usage of authors' tool. Most users will only look at the github repository. I suggest some improvements in case authors judge my comment makes some sense. Bellow I list three examples just to give authors an idea:

    https://github.com/fenderglass/Flye https://github.com/LaboratorioBioinformatica/MARVEL https://github.com/vrmarcelino/CCMetagen

    One last concern was about authors' response to the Myoviridae mistake in figure 5. Authors stated that the genus of a phage host is in its name (as for example Escherichia phage XX). But this is a dangerous assumption, since many phage names are outside of this rule. For example, there are many phages with Enterobacteria phage XXX (for instance NC_054905.1 ), meaning that they infect some Enterobacteria. Again, enterobacteria is not a genus. Phage nomenclature may be a mess sometimes, be careful.