PsiPartition: Improved Site Partitioning for Genomic Data by Parameterized Sorting Indices and Bayesian Optimization

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Phylogenetics has been widely used in molecular biology to infer the evolutionary relationships among species. With the rapid development of sequencing technology, genomic data with thousands of sites becomes increasingly common in phylogenetic analysis, while heterogeneity among sites arises as one of the major challenges. A single homogeneous model is not sufficient to describe the evolution of all sites and partitioned models are often employed to model the evolution of heterogeneous sites by partitioning them into distinct groups and utilizing distinct evolutionary models for each group. It is crucial to determine the best partitioning, which greatly affects the reconstruction correctness of phylogeny. However, the best partitioning is usually intractable to obtain in practice. Traditional partitioning methods rely on heuristic algorithms or greedy search to determine the best ones in their solution space, are usually time-consuming, and with no guarantee of optimality. In this study, we propose a novel partitioning approach, termed PsiPartition, based on the parameterized sorting indices of sites and Bayesian optimization. We apply our method to empirical data sets and it performs significantly better compared to existing methods, in terms of Bayesian information criterion (BIC) and the corrected Akaike information criterion (AICc). We test PsiPartition on the simulated data sets with different site heterogeneity, alignment lengths, and number of loci. It is demonstrated that PsiPartition evidently and stably outperforms other methods in terms of the Robinson-Foulds (RF) distance between the true simulated trees and the reconstructed trees, especially on the data with more site heterogeneity. More importantly, our proposed Bayesian optimization-based method, for the first time, provides a new general framework to efficiently determine the optimal number of partitions. The corresponding reproducible source code and data are available at http://github.com/xu-shi-jie/PsiPartition .

Article activity feed