ViReMa: A Virus Recombination Mapper of Next-Generation Sequencing data characterizes diverse recombinant viral nucleic acids

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Genetic recombination is a tremendous source of intra-host diversity in viruses and is critical for their ability to rapidly adapt to new environments or fitness challenges. While viruses are routinely characterized using high-throughput sequencing techniques, characterizing the genetic products of recombination in next-generation sequencing data remains a challenge. Viral recombination events can be highly diverse and variable in nature, including simple duplications and deletions, or more complex events such as copy/snap-back recombination, inter-virus or inter-segment recombination and insertions of host nucleic acids. Due to the variable mechanisms driving virus recombination and the different selection pressures acting on the progeny, recombination junctions rarely adhere to simple canonical sites or sequences. Furthermore, numerous different events may be present simultaneously in a viral population, yielding a complex mutational landscape. We have previously developed an algorithm called ViReMa (Virus Recombination Mapper) that bootstraps the bowtie short-read aligner to capture and annotate a wide-range of recombinant species found within virus populations. Here, we have updated ViReMa to provide an ‘error-density’ function designed to accurately detect recombination events in the longer reads now routinely generated by the Illumina platforms and provide output reports for multiple types of recombinant species using standardized formats. We demonstrate the utility and flexibility of ViReMa in different settings to report deletion events in simulated data from Flock House virus, copy-back RNA species in Sendai viruses, short duplication events in HIV, and virus to host recombination in an archaeal DNA virus.

Article activity feed

  1. Genetic recombinat

    Reviewer2-Fadi G Alnaji

    In this work, Sotcheff et al provide a comprehensive and nicely-written report about using the algorithm Virus Recombination Mapper (ViReMa) to identify and characterize different kinds of recombination events in different viruses. ViReMA was first reported - by the same group - in a separate paper (Routh et al, NAR, 2013) as a python-based algorithm that, by accounting for the high-diversity nature of virus populations, can efficiently detect a wide range of virus recombination junctions within virus-derived Next Generations Sequencing (NGS) datasets. In this paper, the authors described a couple of important updates on the original algorithm that enables ViReMa to cope with the new technological advances in NGS, including the read length and the significant increase in NGS library size and NGS-based experiments. Notably, the authors implemented a powerful validation approach by challenging the algorithm with a different type of NGS-based data containing various types of junctions from different viruses to highlight the contextual computational and biological connotations. Overall, the paper used a robust analysis method and sufficient controls to clearly demonstrate the capacity of ViReMa to detect different types of recombinant molecules in different NGS datasets and viruses with high sensitivity and specificity. I only have very few minor comments.Minor comments1) Since Fig 2E is showing the gradual effect of the permissibility imposed by the error-density values, transforming the tables into figures e.g. bar or scatter plots can render the effect more observable visually.2) At lines 500-501, the author found that the majority of reads mapped directly to the virus genome. Looking at the aligned read number, this dataset seems fairly large, I was wondering if using the newly added function --Chunk can come into play at this scenario to speed up the analysis? If it is the case, then maybe mentioning it would be valuable.3) At line 478, the authors stated: "The 'Reads' columns describe the number of reads at each particular nucleotide position", is this the average read number?4) Typos at line 206 "red", and at line 397 "(NL4-3)"

  2. Abstract

    Reviewer1-Diogo Pratas

    This article describes a pipeline (coded in Python) to detect and analyze recombination events of viral genomes using short-read FASTQ data. The paper presents some level of work accomplished by the authors. Usually, these types of articles hide numerous hours of coding and experimentation. Moreover, the authors present actual accomplishments that typically are unique architectural designs and important alternative ways to the area, including several results. However, many points require attention, namely:1) This pipeline expects exactly a specific virus. Hence, it uses a specific reference. However, this reference might not be the most representative because of the recombination events. Although it may be appropriate for smaller recombination events, detecting large-scale recombinations may face substantial difficulties. Moreover, since it is not prepared to deal with more significant variations (without de-novo support), it is exclusively for targeted support. Therefore, the article could be more descriptive about this specificity.2) The article states that the improvement is also inspired with the read length increase that NGS is bringing. Also, the reported depth coverages are very high. So, why not use de-novo assembly? For example, the de-novo assembly can be used to create scaffolds that can generate a reference sequence to be used after by the aligners. Please, comment on this.3) About the use of artificial poly-(A)tales to allow the mapper to align the reads, what happens when the read size is smaller than the k-mer hash of the aligner? Usually, repetitive A-sequence content appears in almost all samples because they have lower entropy and a higher probability of being generated. Wouldn't this create ambiguity, especially when there are very high-depth coverages? Please, comment on this matter.4) What is the minimum read size allowed to be considered a valid read for downstream analyses? Are the reads collapsed (in the case of Paired-ends) or considered split? Although less probable, the trimming is fundamental for excluding "events" generated at the tips of the reads that very rarely overlap, depending on the nucleotide distribution.5) Are the reads clipped above a particular depth coverage? This feature is especially critical in repetitive viral content, such as hairpins or poly- (A)tales - removing mountains that become the most significant factor in sequence depth coverage.6) Have some of these viruses been enriched for targeted capture? Please, provide this information in the manuscript. In some parts of the article, the coverage depth is very high: 300'000 - is this 300000? The simulated data used this coverage which may not be entirely similar to reality. Also, allowing lower depth coverage helps to understand how the pipeline behaves. Moreover, some aligners may have problems in older versions with these depth values.7) It was unclear which types of duplications were flagged and if the pipeline covers them.8) How does the pipeline deal with contaminants?9) This article states that the pipeline works for viral sequences. However, the tests used do not include large genomes. What about larger genomes? Some larger genomes contain repetitive content that provides additional reconstruction challenges. Therefore, the benchmark could have an example of this nature.10) While looking for recombination events, specially fusions with the host, what are the differences between sequenced viral integrations and fusion events at the analysis level? How do we distinguish both using this pipeline? Please, comment on this.11) The authors state that the pipeline provides accurate results. Regarding the calculation of accuracy values, several good practices and recommended by many experts in the field:a)https://www.sciencedirect.com/science/article/pii/S1386653220304339b)https://www.sciencedi rect.com/science/article/pii/S138665322100079212) Augmentation of existing pipelines in the area could guide the reader to other solutions and sometimes complementary. See, for example:a) ASPIRE: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08649-8b) TRACESPipe: https://academic.oup.com/gigascience/article/9/8/giaa086/5894824c) V-pipe: https://academic.oup.com/bioinformatics/article/37/12/1673/610481613) Line 113: "in range a of plant" - please correct;14) Line 120-121: Please, rephrase.15) There are several acronyms; perhaps an abbreviation list would improve the reading of the article.16) Line 394: ART is defined as "antiretroviral treated," but this acronym overlaps the ART simulator. Perhaps, in this case, adding another letter or changing it would remove the ambiguity.17) Line 753-754: Reference 27 is missing at least the title, journal, and year.18) Please, consider to add ViReMa to Bioconda.19) I've tried to clone the repository from sourceforge, and it came out empty. I had to download the package manually. I faced some problems, perhaps because it was not easy to follow. Possibly, users may face the same difficulties, which may be an obstacle to using the software. Please, consider having an elementary example for running ViReMa (already including some tiny read sample and reference along with the code and command description - including how to run the GUI). Please, consider using Github in the following times.