From partial to whole genome imputation of SARS-CoV-2 for epidemiological surveillance

Francisco M Ortuño
Carlos Loucera
Carlos S. Casimiro-Soriguer
Jose A. Lepe
Pedro Camacho Martinez
Laura Merino Diaz
Adolfo de Salazar
Natalia Chueca
Federico García
Javier Perez-Florido
Joaquin Dopazo

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)
Evaluated articles (GigaScience)

Abstract

Background

the current SARS-CoV-2 pandemic has emphasized the utility of viral whole genome sequencing in the surveillance and control of the pathogen. An unprecedented ongoing global initiative is increasingly producing hundreds of thousands of sequences worldwide. However, the complex circumstances in which viruses are sequenced, along with the demand of urgent results, causes a high rate of incomplete and therefore useless, sequences. However, viral sequences evolve in the context of a complex phylogeny and therefore different positions along the genome are in linkage disequilibrium. Therefore, an imputation method would be able to predict missing positions from the available sequencing data.

Results

We developed impuSARS, an application that includes Minimac, the most widely used strategy for genomic data imputation and, taking advantage of the enormous amount of SARS-CoV-2 whole genome sequences available, a reference panel containing 239,301 sequences was built. The impuSARS application was tested in a wide range of conditions (continuous fragments, amplicons or sparse individual positions missing) showing great fidelity when reconstructing the original sequences. The impuSARS application is also able to impute whole genomes from commercial kits covering less than 20% of the genome or only from the Spike protein with a precision of 0.96. It also recovers the lineage with a 100% precision for almost all the lineages, even in very poorly covered genomes (< 20%)

Conclusions

imputation can improve the pace of SARS-CoV-2 sequencing production by recovering many incomplete or low-quality sequences that would be otherwise discarded. impuSARS can be incorporated in any primary data processing pipeline for SARS-CoV-2 whole genome sequencing.

GigaScience
Mar 14, 2022

current

This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 2: Stephen Nayfach

In their manuscript, Ortuno et al. develop a procedure for imputing missing genotypes of SARS-CoV-2. Missing genotypes can arise from fragmented whole genome assemblies, targeted sequencing (e.g. spike protein), or incomplete genotype panels. I really like this idea and thought the paper was conducted quite carefully. I was impressed by the high level of precision across all experiments. I have a few minor comments, questions, and suggestions below:

Major comments: My understanding is that only SNPs are imputed by the program. Is this correct? If this is the case, can the authors …

current

This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 2: Stephen Nayfach

In their manuscript, Ortuno et al. develop a procedure for imputing missing genotypes of SARS-CoV-2. Missing genotypes can arise from fragmented whole genome assemblies, targeted sequencing (e.g. spike protein), or incomplete genotype panels. I really like this idea and thought the paper was conducted quite carefully. I was impressed by the high level of precision across all experiments. I have a few minor comments, questions, and suggestions below:

Major comments: My understanding is that only SNPs are imputed by the program. Is this correct? If this is the case, can the authors comment on the frequency of other types of variants in the SARS-CoV-2 genome? How common are small indels, large indels, or rearrangements? Can the authors include code for building their reference panel? This would enable the same pipeline to be applied to updated SARS-CoV-2 references or to other kinds of viruses entirely. For example, metagenomic DNA sequencing often yields partial viral genomes, and it would be great to use this same pipeline to impute these genomes (where sufficient references exist). I noticed that several of the PANGOLIN lineages seem especially hard to impute. Can the authors comment on why this might be the case? Regarding the PANGOLIN lineages, how to these correspond to specific variants of interest (e.g. delta variant)? Is this information provided to users? A visual could really help here showing the phylogenetic relationships between PANGOLIN lineages and how they relate to variants of interest. The authors indicate that missing regions of partial genome assemblies must be indicated by Ns. This seems like an artificial constraint that may be a pain point for users. Can the authors modify their program to detect missing regions from FASTA files and automatically fill these regions with Ns prior to imputation?

Minor comments: For the installation options, please provide an alternative to docker. Would it be feasible to add an installation option using conda? In their methods, could the authors clearly define true positives, true negatives, false positives, and false negatives in the context of their validation experiments? Related to this point, I noticed that the precision is consistently high in the validation experiments, but recall can be quite low. I assume this means that the program will not impute a genotype where there is insufficient evidence, leaving it as a "N". In this case, users should have high confidence in all imputed genotypes. Is this correct? All the figures in the manuscript were of low resolution and difficult to read. The authors should use a consistent tense (present or past) throughout the manuscript. In some places future tense was even used: "Once we have validated the robustness of our imputation against different missing regions scenarios, the validation will focus on the imputation of variants"

Read the original source
GigaScience
Mar 14, 2022

the

This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Siyang Liu

The authors have introduced an imputation pipeline that integrated softwares of minimac 3, minimac 4 and PANGOLIN to impute the variant of the missing region of the SARS-CoV-2 sequencing data. The accuracy of the imputation for genotyping assay kits is around 0.9. The idea is interesting and may be helpful in a few limited scenario. However, given the high mutation rate of the SARS-CoV-2 and for most of the studies that can generate high quality SARS-CoV-2 (reference-based) genome assembly, I don't think the method will be widely used in the SARS-CoV-2 studies. In addition, it lacks a bit …

the

This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Siyang Liu

The authors have introduced an imputation pipeline that integrated softwares of minimac 3, minimac 4 and PANGOLIN to impute the variant of the missing region of the SARS-CoV-2 sequencing data. The accuracy of the imputation for genotyping assay kits is around 0.9. The idea is interesting and may be helpful in a few limited scenario. However, given the high mutation rate of the SARS-CoV-2 and for most of the studies that can generate high quality SARS-CoV-2 (reference-based) genome assembly, I don't think the method will be widely used in the SARS-CoV-2 studies. In addition, it lacks a bit genuine creativity in terms of mathematics behind the method. I think the author's study may be more suitable for a journal like bioinformatics.

Read the original source
ScreenIT
Apr 16, 2021
SciScore for 10.1101/2021.04.13.439668: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected.
Randomization not detected.
Blinding not detected.
Power Analysis not detected.
Sex as a biological variable not detected.
Table 2: Resources
No key resources detected.
Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtranspar…
SciScore for 10.1101/2021.04.13.439668: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
Institutional Review Board Statement not detected.
Randomization not detected.
Blinding not detected.
Power Analysis not detected.
Sex as a biological variable not detected.
Table 2: Resources
No key resources detected.
Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.
About SciScore
SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.
Read the original source
Version published to 10.1101/2021.04.13.439668 on bioRxiv
Apr 13, 2021

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025
Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

This article has 31 authors:
1. Sofia Herrera Agüero
2. Aldo Sosa
3. Alexander Martínez
4. Ambar Moreno
5. César Roberto Conde Pereira
6. Claudia Gonzalez
7. Claudio Soto Garita
8. Daniel Ulate
9. Estela Cordero-Laurent
10. Hebleen Brenes
11. Isaac Miguel Sánchez
12. Jairo Mendez-Rico
13. Jessica Góndola
14. Jose Arturo Molina-Mora
15. Juliana Leite
16. Leticia Franco
17. Linda Mendoza
18. Lionel Gresh
19. Lucia De La Cruz
20. Mitzi Castro Paz
21. Monica Barahona
22. Naomi Iihoshi
23. Oris Chavarria
24. Priscila Born
25. Ruby Melany Aguillón
26. Ruth Carolina Vasquez Cordova
27. Selene Gonzalez
28. Sofia Carolina Alvarado Silva
29. Xochitl Sandoval López
30. Yvonne Imbert
31. Francisco Duarte-Martínez
This article has no evaluationsLatest version Jan 14, 2026
Reemergence of chikungunya in Mauritius driven by a novel lineage with pandemic potential

This article has 22 authors:
1. Tulio de Oliveira
2. Magalutcheemee Ramuth
3. Houriiyah Tegally
4. Ashvin Ubheeram
5. Yajna Ramphal
6. Diana Iyaloo
7. Lavanya Singh
8. Lucious Chabuka
9. Eduan Wilkinson
10. Monika Moir
11. Jenicca Poongavanan
12. Graeme Dor
13. Hastings Musopole
14. Tomasz Sanko
15. Stepfan de Villiers
16. Khouaïldi Bin Elahee
17. Baboo Bahadoor
18. Mahmad Khodabocus
19. Ashwamed Dinassing
20. Cheryl Baxter
21. Richard Lessells
22. Janaki Sonoo
This article has no evaluationsLatest version Dec 17, 2025

Institutional Review Board Statement	not detected.
Randomization	not detected.
Blinding	not detected.
Power Analysis	not detected.
Sex as a biological variable	not detected.

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Overview of SARS-CoV-2 Genomic Surveillance in Central America and the Dominican Republic from February 2020 to January 2023: The Impact of PAHO and COMISCA's Collaborative Efforts

Reemergence of chikungunya in Mauritius driven by a novel lineage with pandemic potential