STELAR-X: Scaling Coalescent-Based Species Tree Inference to 100,000 Species and Beyond
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Summary methods reconstruct species trees from collections of gene trees by accounting for gene tree discordance, and offer a statistically consistent framework for phylogenomic inference under the multispecies coalescent model. While existing triplet- and quartet-based approaches such as ASTRAL and STELAR have provable statistical consistency, their running time and memory usage restrict their applicability to ultra-large datasets. We introduce STELAR-X, a highly scalable triplet-based phylogenetic inference algorithm that achieves an asymptotically optimal memory complexity of O ( nk ) for n species and k gene trees–essentially matching the input size and allowing analyses to remain feasible as long as the input trees fit in memory–while also substantially reducing running time. STELAR-X achieves this by a comprehensive re-engineering of the underlying data structures and algorithms. We introduce a novel, compact integer tuple-based encoding of tree bipartitions and efficient procedures for rapid pre-computation of bipartition weights. We further leverage GPU parallelism for fast pre-computation of necessary weights. This improved and redesigned computational framework underpins a dynamic programming algorithm with substantially reduced computational overhead. Extensive experiments demonstrate that STELAR-X achieves unprecedented scalability. On simulated datasets with 10,000 taxa and 1,000 gene trees, STELAR-X runs 712× faster than ASTRAL-MP (the most scalable variant of ASTRAL) while using 7.5× less CPU memory. Most significantly, STELAR-X analyzed a dataset of 100,000 taxa and 1,000 genes in 8.5 hours using 86 GB RAM, and a 100,000-gene dataset with 1000 taxa in just 4 minutes using 106 GB RAM – scales that were previously intractable for statistically consistent summary methods. STELAR-X is publicly available at https://github.com/aaniksahaa/STELAR-X .