From partial to whole genome imputation of SARS-CoV-2 for epidemiological surveillance

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

the current SARS-CoV-2 pandemic has emphasized the utility of viral whole genome sequencing in the surveillance and control of the pathogen. An unprecedented ongoing global initiative is increasingly producing hundreds of thousands of sequences worldwide. However, the complex circumstances in which viruses are sequenced, along with the demand of urgent results, causes a high rate of incomplete and therefore useless, sequences. However, viral sequences evolve in the context of a complex phylogeny and therefore different positions along the genome are in linkage disequilibrium. Therefore, an imputation method would be able to predict missing positions from the available sequencing data.

Results

We developed impuSARS, an application that includes Minimac, the most widely used strategy for genomic data imputation and, taking advantage of the enormous amount of SARS-CoV-2 whole genome sequences available, a reference panel containing 239,301 sequences was built. The impuSARS application was tested in a wide range of conditions (continuous fragments, amplicons or sparse individual positions missing) showing great fidelity when reconstructing the original sequences. The impuSARS application is also able to impute whole genomes from commercial kits covering less than 20% of the genome or only from the Spike protein with a precision of 0.96. It also recovers the lineage with a 100% precision for almost all the lineages, even in very poorly covered genomes (< 20%)

Conclusions

imputation can improve the pace of SARS-CoV-2 sequencing production by recovering many incomplete or low-quality sequences that would be otherwise discarded. impuSARS can be incorporated in any primary data processing pipeline for SARS-CoV-2 whole genome sequencing.

Article activity feed

  1. current

    This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Stephen Nayfach

    In their manuscript, Ortuno et al. develop a procedure for imputing missing genotypes of SARS-CoV-2. Missing genotypes can arise from fragmented whole genome assemblies, targeted sequencing (e.g. spike protein), or incomplete genotype panels. I really like this idea and thought the paper was conducted quite carefully. I was impressed by the high level of precision across all experiments. I have a few minor comments, questions, and suggestions below:

    Major comments: My understanding is that only SNPs are imputed by the program. Is this correct? If this is the case, can the authors comment on the frequency of other types of variants in the SARS-CoV-2 genome? How common are small indels, large indels, or rearrangements? Can the authors include code for building their reference panel? This would enable the same pipeline to be applied to updated SARS-CoV-2 references or to other kinds of viruses entirely. For example, metagenomic DNA sequencing often yields partial viral genomes, and it would be great to use this same pipeline to impute these genomes (where sufficient references exist). I noticed that several of the PANGOLIN lineages seem especially hard to impute. Can the authors comment on why this might be the case? Regarding the PANGOLIN lineages, how to these correspond to specific variants of interest (e.g. delta variant)? Is this information provided to users? A visual could really help here showing the phylogenetic relationships between PANGOLIN lineages and how they relate to variants of interest. The authors indicate that missing regions of partial genome assemblies must be indicated by Ns. This seems like an artificial constraint that may be a pain point for users. Can the authors modify their program to detect missing regions from FASTA files and automatically fill these regions with Ns prior to imputation?

    Minor comments: For the installation options, please provide an alternative to docker. Would it be feasible to add an installation option using conda? In their methods, could the authors clearly define true positives, true negatives, false positives, and false negatives in the context of their validation experiments? Related to this point, I noticed that the precision is consistently high in the validation experiments, but recall can be quite low. I assume this means that the program will not impute a genotype where there is insufficient evidence, leaving it as a "N". In this case, users should have high confidence in all imputed genotypes. Is this correct? All the figures in the manuscript were of low resolution and difficult to read. The authors should use a consistent tense (present or past) throughout the manuscript. In some places future tense was even used: "Once we have validated the robustness of our imputation against different missing regions scenarios, the validation will focus on the imputation of variants"

  2. the

    This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab078), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Siyang Liu

    The authors have introduced an imputation pipeline that integrated softwares of minimac 3, minimac 4 and PANGOLIN to impute the variant of the missing region of the SARS-CoV-2 sequencing data. The accuracy of the imputation for genotyping assay kits is around 0.9. The idea is interesting and may be helpful in a few limited scenario. However, given the high mutation rate of the SARS-CoV-2 and for most of the studies that can generate high quality SARS-CoV-2 (reference-based) genome assembly, I don't think the method will be widely used in the SARS-CoV-2 studies. In addition, it lacks a bit genuine creativity in terms of mathematics behind the method. I think the author's study may be more suitable for a journal like bioinformatics.

  3. SciScore for 10.1101/2021.04.13.439668: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    No key resources detected.


    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.