Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR

Abstract

In 2020 and 2021, >1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. VADR is now nearly 1000 times faster than it was in early 2020 SARS-CoV-2 sequence processing. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month.

Article activity feed

SciScore for 10.1101/2022.04.25.489427: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

No key resources detected.

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

A universal pipeline MosaicProt enables large-scale modeling and detection of chimeric protein sequences for studies on programmed ribosomal frameshifting

Snekmer Learn/Apply: A kmer-based vector similarity approach to protein classification suitable for metagenomic datasets

Lazypipe3: Customizable Virome Analysis Pipeline Enabling Fast and Sensitive Virus Discovery from NGS data

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

A universal pipeline MosaicProt enables large-scale modeling and detection of chimeric protein sequences for studies on programmed ribosomal frameshifting

Snekmer Learn/Apply: A kmer-based vector similarity approach to protein classification suitable for metagenomic datasets

Lazypipe3: Customizable Virome Analysis Pipeline Enabling Fast and Sensitive Virus Discovery from NGS data