larch: mapping the parsimony-optimal landscape of trees for directed exploration
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Phylogenetic inference algorithms for large data sets typically return a single tree. However, there are often many optimal trees, especially when sequence data is closely related. We develop a compact representation of large collections of maximally parsimonious histories —trees with mutations mapped onto tree edges. Our C++ implementation, larch , leverages this representation for a highly parallel search algorithm. The storage component uses our history DAG structure to compactly represent large families of optimal trees. The search algorithm integrates this storage with matOptimize for rapid tree optimization; the DAG structure allows us to accept thousands of conflicting tree rearrangements in parallel. The integration enables a new type of tree search: one that systematically maps out the collection of good trees, enabling moves that are directed away from the current set of optimal trees to cross valleys and increase the diversity of the set of optimal trees. It is able to identify more parsimonious trees than are found by other methods. We find diverse optimality landscapes for viral datasets, including many distinct plateaux. We also find that our implementation produces similar results whether using a variety of single starting trees or an ensemble of starting trees, indicating effective global optimization.