Pythia 2.0: New Data, New Prediction Model, New Features
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Maximum Likelihood (ML) based phylogenetic inference is time- and resource-intensive, especially when initiating multiple independent inferences from distinct comprehensive tree topologies. Performing multiple independent inferences is often required to (sufficiently) explore the vast search space of possible unrooted binary tree topologies. Yet, these independent inferences do not necessarily converge to a single phylogeny or at least topologically highly similar trees. While for easy-to-analyze multiple sequence alignments (MSAs), one is likely to obtain a conclusive, single phylogeny, difficult-to-analyze MSAs yield topologically highly distinct, yet statistically indistinguishable tree topologies. In 2022, we proposed a compute-intensive approach to quantify the inherent difficulty of a phylogenetic analysis for a specific, given MSA, and also trained a machine-learning based prediction model called Pythia to substantially reduce the computational cost of determining the difficulty. Pythia can predict the difficulty for a given MSA with high accuracy, while being substantially faster than even a single ML tree inference. Pythia predicts the difficulty on a scale from 0 (easy) to 1 (difficult). Here, we present all improvements to Pythia that we have introduced since our initial publication in 2022. We trained a new prediction model using approximately three times more MSAs and a new type of machine learning model. We improved the runtime of two feature computations, and we also introduced two additional prediction features. Our latest version Pythia 2.0 is slightly more accurate than our initial version and is also approximately twice as fast. Finally, we also present and make available, the novel and easy-to-use command line tool PyDLG that allows to compute the ground-truth difficulty seamlessly for a given MSA. This ground-truth difficulty can be used, for instance, as a prediction target for training a new Pythia model.