Hecatomb: An End-to-End Research Platform for Viral Metagenomics

Michael J. Roach
Sarah J. Beecroft
Kathie A. Mihindukulasuriya
Leran Wang
Anne Paredes
Kara Henry-Cocks
Lais Farias Oliveira Lima
Elizabeth A. Dinsdale
Robert A. Edwards
Scott A. Handley

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools.

Results

Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb .

Conclusions

Hecatomb provides a single, modular software solution to the complex tasks required of many virome analysis. We demonstrate the value of the approach by applying Hecatomb to both a host-associated (enteric) and an environmental (marine) virome data set. Hecatomb provided data to determine true- or false-positive viral sequences in both data sets and revealed complex virome structure at distinct marine reef sites.

GigaScience
Jul 1, 2024

Results Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb.

Reviewer 2: Satoshi Hiraoka In this manuscript, the authors developed a novel pipeline, Hecatomb, for viral genome analysis using metagenome and virome data that accepted both short- and long-read sequencing data. Using the pipeline, the authors performed the analysis using one virome and one metagenome dataset from different environments (stool and coral reef, respectively). The analyses showed reasonable results according to the original …

Results Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb.

Reviewer 2: Satoshi Hiraoka In this manuscript, the authors developed a novel pipeline, Hecatomb, for viral genome analysis using metagenome and virome data that accepted both short- and long-read sequencing data. Using the pipeline, the authors performed the analysis using one virome and one metagenome dataset from different environments (stool and coral reef, respectively). The analyses showed reasonable results according to the original studies and rather they discovered candidate novel phages and new findings that possibly have great insight into the microbial ecology. The manuscript is overall informative and well-written. The Hecatomb incorporates famous bioinformatics tools that are frequently used in viral genome analyses today, allowing many researchers including beginners to examine virome datasets easily and effectively. Thus the pipeline is likely valuable and would contribute to wide studies of viruses, most of which are not cultured and its characteristics are unknown. Noteworthy, there is an informative document page ( https://hecatomb.readthedocs.io/en/latest/ ) including tutorials, which are very helpful for many users. I think this point could be more emphasized in the manuscript. However, unfortunately, lacking the analysis of the mock dataset makes it hard to estimate the accuracy of the pipeline. I think adding such kinds of analysis for evaluating the performance would greatly improve the study.I have some suggestions that would increase the clarity and impact of this manuscript if addressed.Major:In general, to clearly evaluate the efficiency of the novel bioinformatic tools and pipelines, benchmarking using ground-truth datasets is important in advance to the application using real datasets. To reach this, in this case, some artificial datasets that are composed of known viral and prokaryotic genomes with defined composition and library types (single and paired-end) and sequenced read length (current short- and long-reads) could be designed as mock metagenome data. Via the analysis using the mock datasets, the accuracy of the pipeline can be evaluated. It would be appreciated if the author performed such benchmarking tests as well as the real data applications.According to the GitHub page, the Hecatomb is designed to generate results that reduce false-positive and enrich for true-positive viral read detection. This point is important for understanding the purpose of developing the pipeline and differentiating the pipeline tool from other ones. The efficiency of the false-positive reduction using this pipeline would be better clearly shown in this manuscript. Therefore the mock dataset analyses are expected.When I read the manuscript, I was confused about what the targeted dataset the pipeline aiming for. Is the Hecatomb designed to analyze common prokaryotic shotgun metagenomic data to detect viruses? In other words, is the pipeline not limited to analyzing viral metagenomes (viromes), which specifically enriched viral particles from the samples for sequencing (e.g., density centrifugation to condense viral particles)? The stool samples were likely virome datasets (viral particles were enriched via 0.45-Î¼m-pore-size membrane filtration according to the article), whereas the coral reef data are metagenome datasets. I would suggest that the terms "viral metagenome" (or virome, specifically targeting only viruses) and common "metagenome" (mainly focusing on prokaryotes) should be clearly distinguished throughout the manuscript including the title.I'm wondering about the sequence clustering step in Module 1. In my understanding, from the metagenomic settings, genomic regions are randomly sequenced, and thus most of the sequenced reads will not be clustered together using the criteria as described in the manuscript, and not so many sequences are reduced in this step. Is this step truly needed? Please add more explanation and importance about this step. For example, how many ratios of the reads were reduced in the test of the two real datasets (stool and coral reef) in this step?Minor:The introduction section is informative but a bit long. The section could be shortened.Some viruses were newly found using the pipeline (e.g., Fig1A). Which one is which virus types (dsDNA, ssDNA, dsRNA, ssRNA)? This information would be better to show clearly in the figure.I think the sequences derived from RNA viruses are generally not abundantly included in typical metagenomics datasets except if with specific techniques in the experiment. I think the potential for detecting RNA viruses from typical metagenomic DNA sequencing reads will be discussed in the Introduction section.L103. Please describe where the name "Hecatomb" is derived from in this article, though this is shown on the GitHub page.L119. " round A/B libraries" here, but I have not heard or could not find this term in the articles cited here. Please add more explanation of what is "round A/B libraries".L130 up to 2 insertions and deletions?L131. BBmap included in BBtools [73]?L181. A brief explanation of the "Baltimore classification" here would improve the readability for readers who are not familiar with this.L239. There is no explanation of what "SIV" means before.L253-L268 & Figure 4B. According to Figure 2A, there are two paths (1,2,5: aa and 1,3,4,5: nt) for detecting viral reads. I'm interested in which path is major and which is minor. Could the authors provide the ratio of the reads that predicted using aa or nt in each dataset examination (each stool and coral)?L431, L436. Not only BioProject but SRA accession ID should be provided.L479. There is no LACC here. What is his main contribution? Just reviewing and editing the manuscript is insufficient for citing as an author: see https://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-ofauthors-and-contributors.html#twoFigure 1. There are some DBs newly created and used in the pipeline (e.g., Viral AA DB, Multi-kingdom AA DB, Virus NT DB, and Polymicrobial MT DB). I think it would be better to add how to make the DBs in this or other figures. This must contribute to understanding how to construct the DBs and why to use them in this pipeline.Figure 1. specified (1)-(4) in the legend, not just color.Figure 4A. Please provide the total number of sequencing reads in addition to the read count assigned to each virus.Figure 4C. CPM was not explained in the manuscript and not listed in L460.L490. Some references are incomplete. e.g., lack of article ID or page number (49, 79, 90, 94, 95, 96, 100, 101, 102), remaining unnecessary words ("academic.oup.com" in 90, 91), etc. Please check the reference list carefully.Figure S5. Alignment length (bp)Table S2. For calculating the best hit identify, what database was used?

Read the original source
GigaScience
Jul 1, 2024

Background Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools.

Reviewer1: Arvind Varsani The MS titled "Hecatomb: An Integrated …

Background Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools.

Reviewer1: Arvind Varsani The MS titled "Hecatomb: An Integrated Software Platform for Viral Metagenomics" addresses the developed of a toolkit for viral meatgenomics analysis that assembles a variety of tools into a workflow.Overall, I do not have any issue with this MS or the toolkit.I have some minor points to help improve the MS and make it as current as possible.1. Line 40: I would include Cenote-take 2 PMID: 33505708, geNomad https://www.biorxiv.org/content/10.1101/2023.03.05.531206v12. Line 40: I would probably not cite the preprint of this current paper - see ref 21.3. Line 80: Actually Cenote-take (both version 1 and 2) both use HHMs and as far as I know so does geNomad.4. Line 248: Please note that Siphoviridae, Podoviridae and Myoviridae are not currently family names. See PMID: 366830755. This means you will likely need to edit you figure to collapse these to Caudovirales6. Line 250-251: Picornaviridae and Adenoviridiae should be in italics7. Line 270: Here and elsewhere, please note that a taxa do not infect a host, it is a virus that infects a host. "Mimiviridae, that infect Acanthamoeba, and Phycodnaviridae, that infect algae, are both dsDNA viruses with large genomes" should ideally be written as "Viruses in the family Mimiviridae infect Acanthamoeba and those in the family Phycodnavirida infect algae, are dsDNA viruses with large genomes."8. Figure 6: the name tags of the CDS/ ORFS are truncated e.g. replication initiate…, heat maturation prot…9. Figure 6: Major head protein should be major capsid protein.10. One thing that I would highlight is that none of the workflows / tool kits developed account for spliced CDS. This is a major issue in automation of virus genome annotation at the moment and with this there will be some degree of misidentification for taxa assignment.

Read the original source
Version published to 10.1101/2022.05.15.492003 on bioRxiv
May 16, 2022