Optimized Quantification of Intrahost Viral Diversity in SARS-CoV-2 and Influenza Virus Sequence Data

AE Roder
KEE Johnson
M Knoll
M Khalfan
B Wang
S Schultz-Cherry
S Banakis
A Kreitman
C Mederos
J-H Youn
R Mercado
W Wang
D Ruchnewitz
MI Samanovic
MJ Mulligan
M Lassig
M Łuksza
S Das
D Gresham
E Ghedin

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (ScreenIT)

Abstract

High error rates of viral RNA-dependent RNA polymerases lead to diverse intra-host viral populations during infection. Errors made during replication that are not strongly deleterious to the virus can lead to the generation of minority variants. However, accurate detection of minority variants in viral sequence data is complicated by errors introduced during sample preparation and data analysis. We used synthetic RNA controls and simulated data to test seven variant calling tools across a range of allele frequencies and simulated coverages. We show that choice of variant caller, and use of replicate sequencing have the most significant impact on single nucleotide variant (SNV) discovery and demonstrate how both allele frequency and coverage thresholds impact both false discovery and false negative rates. We use these parameters to find minority variants in sequencing data from SARS-CoV-2 clinical specimens and provide guidance for studies of intrahost viral diversity using either single replicate data or data from technical replicates. Our study provides a framework for rigorous assessment of technical factors that impact SNV identification in viral samples and establishes heuristics that will inform and improve future studies of intrahost variation, viral diversity, and viral evolution.

IMPORTANCE

When viruses replicate inside a host, the virus replication machinery makes mistakes. Over time, these mistakes create mutations that result in a diverse population of viruses inside the host. Mutations that are neither lethal to the virus, nor strongly beneficial, can lead to minority variants that are minor members of the virus population. However, preparing samples for sequencing can also introduce errors that resemble minority variants, resulting in inclusion of false positive data if not filtered correctly. In this study, we aimed to determine the best methods for identification and quantification of these minority variants by testing the performance of seven commonly used variant calling tools. We used simulated and synthetic data to test their performance against a true set of variants, and then used these studies to inform variant identification in data from clinical SARS-CoV-2 clinical specimens. Together, analyses of our data provide extensive guidance for future studies of viral diversity and evolution.

Version published to 10.1101/2021.05.05.442873v2 on bioRxiv
Aug 16, 2022

SciScore for 10.1101/2021.05.05.442873: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Each pair of downsampled fastq files, along with the original, was quality and adapter trimmed using trimmomatic v0.36 with the following parameters: ILLUMINACLIP:adapters.fa:2:30:10:8:true LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:20 (47).	trimmomatic suggested: (Trimmomatic, RRID:SCR_011848)
The trimmed reads were aligned to the Wuhan-Hu-1 SARS-CoV-2 reference genome (NC_045512.2) using BWA mem v0.7.17 with the -K parameter set to 100000000 for reproducibility and -Y to use soft clipping for supplementary alignments (48).	BWA suggested: (BWA, RRID:SCR_010910)
Variants were called using …

SciScore for 10.1101/2021.05.05.442873: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Each pair of downsampled fastq files, along with the original, was quality and adapter trimmed using trimmomatic v0.36 with the following parameters: ILLUMINACLIP:adapters.fa:2:30:10:8:true LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:20 (47).	trimmomatic suggested: (Trimmomatic, RRID:SCR_011848)
The trimmed reads were aligned to the Wuhan-Hu-1 SARS-CoV-2 reference genome (NC_045512.2) using BWA mem v0.7.17 with the -K parameter set to 100000000 for reproducibility and -Y to use soft clipping for supplementary alignments (48).	BWA suggested: (BWA, RRID:SCR_010910)
Variants were called using six separate methods: Intersections between the workflow VCF files (produced by Mutect2, Freebayes, timo, VarScan, iVar and haplotype caller) and the golden VCF file were generated using bcftools isec v1.9 (48).	Mutect2 suggested: None VarScan suggested: (VARSCAN, RRID:SCR_006849)
Assembly of genomes and consensus sequences: Reads were base-called with Picard Tools IlluminaBasecallsToFastq v2.17.11 and demultiplexed using Pheniqs allowing for 1 mismatch in sample index sequences (49, 50).	Picard suggested: (Picard, RRID:SCR_006525)
Duplicates were marked using GATK MarkDuplicatesSpark v4.1.3.0 (https://gatk.broadinstitute.org/hc/en-us/articles/360037224932-MarkDuplicatesSpark).	GATK suggested: (GATK, RRID:SCR_001876)
Predicted SNV effects were called using SnpEff v4.3i (51).	SnpEff suggested: (SnpEff, RRID:SCR_005191)

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

No conflict of interest statement was detected. If there are no conflicts, we encourage authors to explicit state so.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Version published to 10.1101/2021.05.05.442873v1 on bioRxiv
May 6, 2021

Immune Pressure is Key to Understanding Observed Patterns of Respiratory Virus Evolution in Prolonged Infections

This article has 3 authors:
1. Amber Coats
2. Yintong Rita Wang
3. Katia Koelle
This article has no evaluationsLatest version Apr 30, 2025
Antibody escape drives emergence of diverse spike haplotypes resembling variants of concern in persistent SARS-CoV-2 infections

This article has 14 authors:
1. Luke B. Snell
2. Suzanne Pickering
3. Adela Alcolea-Medina
4. Helena Winstone
5. Jeff Seow
6. Carl Graham
7. Lorcan O’Connell
8. Rahul Batra
9. Michael H. Malim
10. Katie J. Doores
11. Gaia D. Nebbia
12. Jonathan Edgeworth
13. Stuart J.D. Neil
14. Rui P. Galão
This article has no evaluationsLatest version Apr 22, 2025
Geospatial and demographic patterns of SARS-CoV-2 spread in Massachusetts from over 130,000 genomes

This article has 22 authors:
1. Gage K Moreno
2. Taylor Brock-Fisher
3. Lydia A Krasilnikova
4. Stephen Schaffner
5. Meagan Burns
6. Carolyn E Casiello
7. Katelyn S Messer
8. Brittany A Petros
9. Ivan Specht
10. Katherine C DeRuff
11. Katherine J Siddle
12. Christine Loreth
13. Nicholas A Fitzgerald
14. Heather M Rooke
15. Stacey B Gabriel
16. Sandra Smole
17. Shirlee Wohl
18. Daniel J Park
19. Lawrence C Madoff
20. Catherine M Brown
21. Bronwyn L MacInnis
22. Pardis C Sabeti
This article has no evaluationsLatest version Apr 6, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

IMPORTANCE

Article activity feed

Related articles

Immune Pressure is Key to Understanding Observed Patterns of Respiratory Virus Evolution in Prolonged Infections

Antibody escape drives emergence of diverse spike haplotypes resembling variants of concern in persistent SARS-CoV-2 infections

Geospatial and demographic patterns of SARS-CoV-2 spread in Massachusetts from over 130,000 genomes