Theoretical estimates on the expected number of mutations for reconstructing clonal lineage trees
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Phylogenetics, like many subdisciplines of computational biology, faces a growing challenge of dealing with increasingly large and complicated data sets that have been enabled by ever-improving technologies for sequencing. The issue is particularly acute for studies of somatic evolution, such as of single-cell populations in cancers, where vast single-cell data sets may now identify hundreds of thousands of genetically distinct cells with similar scales of mutations distinguishing them. At the same time, the complexity of the biology of somatic evolution has led to complex phylogeny methods that struggle to scale to even modest data sizes. In this paper, we explore the theoretical and empirical basis for one strategy for managing these large data sets: subsampling mutations to solve the computationally challenging phylogeny problem followed by faster solutions for placement of mutations on a putatively known guide tree. We specifically focus on the fundamental question of determining the number of mutations sufficient to recover the true phylogenetic tree at some level of resolution with high probability. We theoretically analyze variants of several common models that underlie popular tools for building clonal lineage trees. We further evaluate the robustness of these theoretical bounds through simulations of these models, extensions of them, and real biological datasets. The results suggest that modest numbers of mutations suffice to reconstruct clonal trees for typical numbers of clones, supporting the sub-sampling approach as a general strategy for managing the challenges of ever-growing data sets.