Variance in Variants: Propagating Genome Sequence Uncertainty into Phylogenetic Lineage Assignment

David Champredon
Devan Becker
Connor Chato
Gopi Gugan
Art Poon

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (ScreenIT)

Abstract

Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. On top of this machine-level uncertainty, there is uncertainty induced by human error, such as errors in data entry or incorrect parameter settings. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty.

Our method uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses.

We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures adds a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.

ScreenIT
Dec 6, 2021
SciScore for 10.1101/2021.11.30.470642: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
Software and Algorithms
Sentences Resources
Deleted bases (‘D’ operations) are not assigned Phred scores, so we assume them to have 0 error probability.
Phred
suggested: (Phred, RRID:SCR_001017)
Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues …
SciScore for 10.1101/2021.11.30.470642: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.
Table 2: Resources
Software and Algorithms
Sentences Resources
Deleted bases (‘D’ operations) are not assigned Phred scores, so we assume them to have 0 error probability.
Phred
suggested: (Phred, RRID:SCR_001017)
Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:
Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
About SciScore
SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.
Read the original source
Version published to 10.1101/2021.11.30.470642 on bioRxiv
Dec 2, 2021

Software and Algorithms
Sentences	Resources
Deleted bases (‘D’ operations) are not assigned Phred scores, so we assume them to have 0 error probability.	Phred suggested: (Phred, RRID:SCR_001017)

Software and Algorithms
Sentences	Resources
Deleted bases (‘D’ operations) are not assigned Phred scores, so we assume them to have 0 error probability.	Phred suggested: (Phred, RRID:SCR_001017)

Genetic estimates of relatedness: Established practices and new opportunities through low coverage whole genome sequencing

This article has 8 authors:
1. Annika Freudiger
2. Natalie Kestel
3. Vladimir Jovanovic
4. Mariana Madruga de Brito
5. Angelina Ruiz-Lambides
6. Katja Nowick
7. Anja Widdig
8. Harald Ringbauer
This article has no evaluationsLatest version Jan 23, 2026
Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome

This article has 6 authors:
1. Jędrzej Kubica
2. Hetvi Jethwani
3. Krzysztof H. Banecki
4. Mauricio Moldes
5. Dariusz Plewczynski
6. Ben Busby
This article has no evaluationsLatest version Dec 17, 2025
An Advanced Entropy Approach for Minimizing False Discoveries in Imputation-Based Association Analyses

This article has 4 authors:
1. Zhihui Zhang
2. Dakai Zhu
3. Xiangjun Xiao
4. Christopher I. Amos
This article has no evaluationsLatest version Dec 17, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genetic estimates of relatedness: Established practices and new opportunities through low coverage whole genome sequencing

Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome

An Advanced Entropy Approach for Minimizing False Discoveries in Imputation-Based Association Analyses