Comparative evaluation of computational methods for reconstruction of human viral genomes
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
The increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS .
Article activity feed
-
AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using …
AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf159), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 3: Serghei Mangul
The paper is well written and provides a valuable contribution to the field. My only concern pertains to the real data utilized, which lacks a gold standard. Consequently, I question whether the real data adds significant value to the analysis, given the absence of a gold standard. Major comments:1. What are the types of data used in the manuscript? Is it solely metagenomics data? If so, it would be beneficial to clarify this in the abstract and potentially in the title
- . 2. Was the real data comprised of metagenomics? It would be advantageous to include some text explaining the nature of the data
- . 3. In the section titled " Performance in real datasets, " It is unclear why the results of FALSCON-meta are regarded as the gold standard
. Minor comments:1. The phrase "availability of viral sequences"Seemingly suggests that the author intends to reference viral sequencing data or metagenomics data. Currently, it reads as though it refers to viral reference genomes.
-
AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using …
AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf159), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 2: Anton Korobeynikov
Sousa et all in their article provided an attempt to review available computational methods to assemble human viruses from real and simulated data. While the review itself seems to be valuable, we believe that the exposition contains several methodological issues that renders it somehow useful. We will try to summarize these high-level issues instead of going down through small details here and there.
To our surprise, the authors (who are also authors of two tools under evaluation) somehow do not distinguish different kinds of input data that is pretty important as it effectively determines the choice of the tool. Clearly, there is no silver bullet here and there is no single push-button solution that could handle universally all kinds of input data. It is very strange that e.g. authors do not distinguish between DNA and RNA viruses. The sequencing approaches for these kinds of data is very different, they have entirely different internal structure and organization and the challenges associated with assembly process are different. This is summarized well in e.g. (Grabherr et al, 2011), (Bushmanova et al, 2019) and (Meleshko et al, 2022) among the others. To add second dimension here: we can have more or less "pure" viral culture, or metagenome / metavirome, or some highly divergent metavirome (e.g. in case of HIV or other viruses undergo reverse transcription). The host contamination is more sound for DNA viruses, etc. So, to summarize - all (very complex!) variations of input data were somehow folded into a single "human viruses" title, which is really misleading. It is the properties of input data that should guide the choice of the appropriate tool.
Next, the choice of tools is also somewhat questionable. Some well-known tools like PRICE or VICUNA were omitted. Ok, IVA is here and this might be enough for "classical" viral assemblies. But then generic-purpose metagenome assembler metaSPAdes is considered without other choices. How about MEGAHIT? for RNA viral data - what's about Trinity or rnaSPAdes? It was strange to see coronaSPAdes mentioned, while it is essentially rnaviralSPAdes + set of SARS-Cov 2 HMMs. Why not just rnaviralSPAdes if we already know we're not going to reconstruct coronaviral data? Another thing is that the majority of tools are tuned for a particular tasks: there are tools for quasispecies assembly, so they would aim to preserve all the variation present. Metagenomic assemblers aim to provide a backbone consensus of a metagenome. Usually assemblers for RNA data aim for the reconstruction as many transcripts as possible (so their "duplication rate" might be misleading). metaviralSPAdes aims to reconstruct full-length circular and linear viruses from complex contaminated metagenomes, so it could be very conservative, etc. It feels like the benchmarking compares something warm with something soft giving misleading guidance to the reader.
Finally, it is the year 2025, but the pipeline is just a huge pile of shell scripts that install tools (sometimes outdated as far as I can see, e.g. it uses SPAdes 3.13.0 that was released more than 5 years ago) often globally, sometimes only via conda. It could hardly be named as "reproducible" pipeline: error handling is quite non-existing, if something happens in between the user might end with some partially resolved state. There are lots of frameworks and approaches developed recently that provide all necessary details like job isolation, installation, restart & checkpointing, data acquisition, etc. To put things simple: why everything is done manually via hand-written shell scripts and not based on say, Nextflow? There are lots of ready modules from
nf-corethat one could just reuse. Likely some ideas could be taken from https://github.com/nf-core/viralrecon/ and other pipelines available there. -
AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using …
AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf159), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 1: Levente Laczkó
I reviewed the manuscript titled "An evaluation of computational methods for reconstruction of human viral genomes" by Sousa et al. The authors reviewed different tools for the reconstruction of viral genomes and developed a benchmarking framework to measure the performance of the different tools. The benchmarking was performed with both synthetic and real sequencing data, and the authors provide recommendations for different scenarios. The benchmarking framework developed with Bash is also made available on GitHub, providing the scientific community a good example to increase reproducibility. The analysis steps are also clearly described in the manuscript. Independent benchmarks, such as presented in the manuscript, are valuable contributions to the scientific literature and help to select the right tool for different tasks. The manuscript is clearly structured and well written, and the results are appropriately presented with rich supplementary material. I definitely recommend the publication of the manuscript in GigaScience. However, I have some questions that I think should be addressed before publishing the final version to further improve the manuscript.
The authors describe that multiple strains may be present within a single infection. Indeed, the variability of strains within a single infection may be particularly important for some viruses. QuRe, ViSpA, SAVAGE and ViQUF are explicitly designed to find quasispecies. Are there any other tools in the benchmark that can predict whether samples are heterogeneous (or whose results can be used to infer this)?
The authors have used the human mitochondrion as a source of contamination to test whether the tools are sensitive to it. Is there a reason why only the mitochondrion was used for this test and other, perhaps random, human DNA fragments were not?
The error rate can strongly influence the accuracy of reference-based genome reconstructions. Has the effect of error rate been tested or could it affect the results, e.g. are there any tools in the benchmark that are less sensitive to higher error rates?
In the synthetic dataset, the coverage ranged from 2-40×. This range represents scenarios where the viral copy number is low, but especially if the viral DNA was enriched before sequencing, the coverage could be much higher. Is there a reason to specifically choose 40x coverage as the highest coverage value? I agree that low coverage is a difficult challenge, but checking the performance of different tools at high read depth can help readers to choose the right tool for these use cases if there is a difference in the performance of the tools at e.g. >100x coverage.
The authors correctly describe that the complexity of genomes can be a challenge for accurate genome reconstruction. Assessing the complexity (e.g. repetitive content ratio, GC ratio) of the genomes used in the synthetic dataset can add additional value to the results by showing how different tools perform on genomes of different complexity.
Some reference-based tools (QVG, TRACESPipe, TRACESPipeLite and V-pipe) produced results with many gaps. Could the different approach be a reason for how they deal with low coverage regions? QVG, for example, masks positions with low sequencing depth to increase the specificity of the search for polymorphisms. Can the gaps be explained by the variation in sequencing depth, i.e. could the gaps be linked to genomic regions with very low or very high sequencing depth?
I agree that benchmarking real datasets without the correct original sequence is a difficult task. I believe that showing the coverage and completeness (e.g. the ratio of the reconstructed length of the reference genome) can be an additional and useful information for the reader to choose the right tool for different tasks. The expected length of the viral genomes could be determined by the length of the reference genomes used, based on the classification of FALCON-meta, and in the case of de novo pipelines, the scaffolds that do not match the references could be classified using e.g. kraken2. This could show how complete the reconstructed genomes are and whether there are other viral genomes in the samples that FALCON-meta missed but still represent valuable information. Supplementary Figures S143-S146 show the number of reconstructed bases with and without gaps, but I think that this experiment should be emphasised more in the main text and that the ratio of reconstructed bases to the expected genome sizes might be more informative than just the total number of reconstructed base pairs.
Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Yes
Are the conclusions adequately supported by the data shown? Yes
Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? The language is well understandable
Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Yes
-
