Sparse learning for scalable phylogenetic network inference
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Phylogenetic networks account for signals of hybridization, reticulation, and gene flow, and provide an opportunity to analyse species evolution from a more complex perspective than bifurcating phylogenetic trees. However, even the fastest algorithms for inferring these networks face scalability challenges as the number of species increases. This limitation arises because these methods use as input a large concordance factors (CFs) table, which summarizes the observed CFs of all possible four-species combinations in each row. The size of this table scales with the fourth power of the number of species, creating computational bottlenecks and highlighting the need for more efficient solutions. Sparse learning has been shown to reduce the dimensionality of large-scale datasets while producing results of comparable quality to those obtained using the full dataset. In this study, we adapted two sparse machine learning models—Elastic Net and Ensemble Learning + Elastic Net—to guide the subsample of an optimal number of rows from the CFs table required to accurately predict the overall phylogenetic network pseudolikelihood. Both methods account for the inherent correlation among rows, which arises because rows overlap in species information. We call this method Qsin . In two simulated datasets, Qsin reduced the dataset by approximately half without compromising accuracy. For the Xiphophorus fishes dataset, which contains 10,626 rows in the CFs table, we recovered the same topology as with the full CFs table but using only 763 rows. Using these subsamples also reduced running times by up to 60% without compromising accuracy. These gains are expected to persist as species numbers increase. Qsin contributes to ongoing efforts to make phylogenetic network inference more efficient and opens the door to analyses of more complex evolutionary histories. The source code for Qsin is freely available at: https://github.com/ulises-rosas/qsin .