Impact of Data Error on Phylogenetic Network Inference from Gene Trees Under the Multispecies Network Coalescent
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Phylogenetic network inference has become an essential tool in evolutionary biology, offering a framework to model complex evolutionary events such as hybridization and horizontal gene transfer. However, a critical but often overlooked challenge is the presence of error in empirical datasets, including sequencing errors, misalignments, and inaccuracies in gene tree estimation. This issue is particularly pressing in the context of phylogenetic networks, which can contain an arbitrary number of parameters and are thus highly susceptible to overfitting. Errors in the input data can lead to artificially inflated network complexity, misrepresenting evolutionary history with non-biological reticulations.
In this study,we systematically examine how different sources of data error influence network inference and show that many widely used methods are vulnerable to these distortions. We find that inaccuracies in gene tree estimation and sequence alignment degrade the reliability of inferred networks. These issues are exacerbated when the number of reticulations that an algorithm can infer exceeds the true number of reticulations in the phylogenetic network. Our analysis underscores the importance of accounting for data error when applying network inference methods and provides practical recommendations for minimizing its impact. By highlighting the vulnerabilities of different approaches and demonstrating how errors propagate through the inference process, we offer practical recommendations for optimizing data processing pipelines. Our findings emphasize the necessity of integrating realistic error models into species network inference methods to enhance their reliability and applicability to real-world biological datasets.