Strategy and Performance Evaluation of Low-Frequency Variant Calling for SARS-CoV-2 Using Targeted Deep Illumina Sequencing

Laura A. E. Van Poelvoorde
Thomas Delcourt
Wim Coucke
Philippe Herman
Sigrid C. J. De Keersmaecker
Xavier Saelens
Nancy H. C. Roosens
Kevin Vanneste

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

The ongoing COVID-19 pandemic, caused by SARS-CoV-2, constitutes a tremendous global health issue. Continuous monitoring of the virus has become a cornerstone to make rational decisions on implementing societal and sanitary measures to curtail the virus spread. Additionally, emerging SARS-CoV-2 variants have increased the need for genomic surveillance to detect particular strains because of their potentially increased transmissibility, pathogenicity and immune escape. Targeted SARS-CoV-2 sequencing of diagnostic and wastewater samples has been explored as an epidemiological surveillance method for the competent authorities. Currently, only the consensus genome sequence of the most abundant strain is taken into consideration for analysis, but multiple variant strains are now circulating in the population. Consequently, in diagnostic samples, potential co-infection(s) by several different variants can occur or quasispecies can develop during an infection in an individual. In wastewater samples, multiple variant strains will often be simultaneously present. Currently, quality criteria are mainly available for constructing the consensus genome sequence, and some guidelines exist for the detection of co-infections and quasispecies in diagnostic samples. The performance of detection and quantification of low-frequency variants using whole genome sequencing (WGS) of SARS-CoV-2 remains largely unknown. Here, we evaluated the detection and quantification of mutations present at low abundances using the mutations defining the SARS-CoV-2 lineage B.1.1.7 (alpha variant) as a case study. Real sequencing data were in silico modified by introducing mutations of interest into raw wild-type sequencing data, or by mixing wild-type and mutant raw sequencing data, to construct mixed samples subjected to WGS using a tiling amplicon-based targeted metagenomics approach and Illumina sequencing. As anticipated, higher variation and lower sensitivity were observed at lower coverages and allelic frequencies. We found that detection of all low-frequency variants at an abundance of 10, 5, 3, and 1%, requires at least a sequencing coverage of 250, 500, 1500, and 10,000×, respectively. Although increasing variability of estimated allelic frequencies at decreasing coverages and lower allelic frequencies was observed, its impact on reliable quantification was limited. This study provides a highly sensitive low-frequency variant detection approach, which is publicly available at https://galaxy.sciensano.be , and specific recommendations for minimum sequencing coverages to detect clade-defining mutations at certain allelic frequencies. This approach will be useful to detect and quantify low-frequency variants in both diagnostic (e.g., co-infections and quasispecies) and wastewater [e.g., multiple variants of concern (VOCs)] samples.

Version published to 10.3389/fmicb.2021.747458
Oct 13, 2021

SciScore for 10.1101/2021.07.02.21259923: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
2.1 Employed sequencing data and generation of consensus genome sequences: SARS-CoV-2 raw sequencing data from 316 samples was downloaded from the Sequence Read Archive (SRA) [42].	Sequence Read Archive suggested: (DDBJ Sequence Read Archive, RRID:SCR_001370)
Next, the re-paired paired-end reads were trimmed using Trimmomatic v0.38 [45] setting the following options: ‘LEADING:10’, ‘TRAILING:10’ ‘SLIDINGWINDOW:4:20’, and ‘MINLEN:40’.	Trimmomatic suggested: (Trimmomatic, RRID:SCR_011848)
Trimmed reads were aligned to their respective reference genomes using Bowtie2 v2.3.4.3 [46] using default …

SciScore for 10.1101/2021.07.02.21259923: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
2.1 Employed sequencing data and generation of consensus genome sequences: SARS-CoV-2 raw sequencing data from 316 samples was downloaded from the Sequence Read Archive (SRA) [42].	Sequence Read Archive suggested: (DDBJ Sequence Read Archive, RRID:SCR_001370)
Next, the re-paired paired-end reads were trimmed using Trimmomatic v0.38 [45] setting the following options: ‘LEADING:10’, ‘TRAILING:10’ ‘SLIDINGWINDOW:4:20’, and ‘MINLEN:40’.	Trimmomatic suggested: (Trimmomatic, RRID:SCR_011848)
Trimmed reads were aligned to their respective reference genomes using Bowtie2 v2.3.4.3 [46] using default parameters.	Bowtie2 suggested: (Bowtie 2, RRID:SCR_016368)
The resulting SAM files were converted to BAM files using Samtools view v1.9 [47] and sorted and indexed using the default settings of respectively Samtools sort and Samtools index v1.9 [47].	Samtools suggested: (SAMTOOLS, RRID:SCR_002105)
Python 3.6.9 was used with the packages pysam 0.16.0.1 [54] and numpy 1.19.5 [55].	Python suggested: (IPython, RRID:SCR_001658) numpy suggested: (NumPy, RRID:SCR_008633)
Next, reads were sorted using Picard SortSam v2.18.14 (https://github.com/broadinstitute/picard) with the option “SORT_ORDER=coordinate” and Picard CreateSequenceDictionary v2.18.14 [56] was used to generate a dictionary of the reference FASTA file.	Picard suggested: (Picard, RRID:SCR_006525)
The resulting BAM files were indexed using Samtools index v1.9 and used as input for GATK RealignerTargetCreator 3.7 [57], which was followed by indel realignment using GATK IndelRealigner v3.7 [57].	GATK suggested: (GATK, RRID:SCR_001876)
Additionally, the workflow is also available at the public Galaxy instance of our institute at https://galaxy.sciensano.be as a free resource for academic and non-profit usage.	Galaxy suggested: (Galaxy, RRID:SCR_006281)
Samples in BAM format were then converted back to FASTQ format using bedtools bamtofastq v2.27.1 [60].	bedtools suggested: (BEDTools, RRID:SCR_006646)
Finally the LFV detection workflow (Figure 1: Step 3) described in section 2.2 was used on these 10 samples for all 364 conditions using the FASTA file of the wild-type sample as reference with LoFreq.	LoFreq suggested: (LoFreq, RRID:SCR_013054)
2.2.2 Dataset 2: Introduction of mutations of interest by mixing wild-type and mutant raw sequencing read datasets: For the second dataset (Figure 1: Step 5), the coverage of all 20 samples (Table 2) was normalized to 5000X using BBMap v38.89 bbnorm.sh [43] with the options “target=5000”, “mindepth=5”, “fixspikes=f”, “passes=3”, “uselowerdepth=t”.	BBMap suggested: (BBmap, RRID:SCR_016965)

Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a protocol registration statement.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Version published to 10.1101/2021.07.02.21259923 on medRxiv
Jul 7, 2021

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

This article has 15 authors:
1. Pulchérie Pelembi
2. Philippe Colson
3. Alain Farra
4. Ornella Anne Sibiro-Demi
5. Christian Noël Malaka
6. Aurélia Kwasiborski
7. Véronique Hourdel
8. Gilles Landry Ngaya
9. Romaric Nzoumbou-Boko
10. Jean-Claude Manuguerra
11. Emmanuel Ryvalin Nakoune-Yandoko
12. Guy VERNET
13. Bernard La Scola
14. Valérie Caro
15. Alexandre Manirakiza
This article has no evaluationsLatest version Jan 19, 2026
Assessing Mass Screening as an Effective Tool for Pandemic Management

This article has 5 authors:
1. Adil Lagmar
2. Maryem Wardi
3. Ahmed Belmouden
4. Mohamed Aghrouch
5. Zohra Lemkhente
This article has no evaluationsLatest version Dec 17, 2025
Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

This article has 31 authors:
1. Sofia Herrera Agüero
2. Aldo Sosa
3. Alexander Martínez
4. Ambar Moreno
5. César Roberto Conde Pereira
6. Claudia Gonzalez
7. Claudio Soto Garita
8. Daniel Ulate
9. Estela Cordero-Laurent
10. Hebleen Brenes
11. Isaac Miguel Sánchez
12. Jairo Mendez-Rico
13. Jessica Góndola
14. Jose Arturo Molina-Mora
15. Juliana Leite
16. Leticia Franco
17. Linda Mendoza
18. Lionel Gresh
19. Lucia De La Cruz
20. Mitzi Castro Paz
21. Monica Barahona
22. Naomi Iihoshi
23. Oris Chavarria
24. Priscila Born
25. Ruby Melany Aguillón
26. Ruth Carolina Vasquez Cordova
27. Selene Gonzalez
28. Sofia Carolina Alvarado Silva
29. Xochitl Sandoval López
30. Yvonne Imbert
31. Francisco Duarte-Martínez
This article has no evaluationsLatest version Jan 14, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genomic characterization of SARS-CoV-2 variants circulating in the population of Bangui, Central African Republic (CAR) in 2022.

Assessing Mass Screening as an Effective Tool for Pandemic Management

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts