Impact of Data Error on Phylogenetic Network Inference from Gene Trees Under the Multispecies Network Coalescent

Mehrdad Tamiji
Nicolae Sapoval
Luay Nakhleh

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Phylogenetic network inference has become an essential tool in evolutionary biology, offering a framework to model complex evolutionary events such as hybridization and horizontal gene transfer. However, a critical but often overlooked challenge is the presence of error in empirical datasets, including sequencing errors, misalignments, and inaccuracies in gene tree estimation. This issue is particularly pressing in the context of phylogenetic networks, which can contain an arbitrary number of parameters and are thus highly susceptible to overfitting. Errors in the input data can lead to artificially inflated network complexity, misrepresenting evolutionary history with non-biological reticulations.

In this study,we systematically examine how different sources of data error influence network inference and show that many widely used methods are vulnerable to these distortions. We find that inaccuracies in gene tree estimation and sequence alignment degrade the reliability of inferred networks. These issues are exacerbated when the number of reticulations that an algorithm can infer exceeds the true number of reticulations in the phylogenetic network. Our analysis underscores the importance of accounting for data error when applying network inference methods and provides practical recommendations for minimizing its impact. By highlighting the vulnerabilities of different approaches and demonstrating how errors propagate through the inference process, we offer practical recommendations for optimizing data processing pipelines. Our findings emphasize the necessity of integrating realistic error models into species network inference methods to enhance their reliability and applicability to real-world biological datasets.

Version published to 10.1101/2025.05.18.654708 on bioRxiv
May 23, 2025

Testing the validity and adequacy of linguistic phylogenetic analyses

This article has 1 author:
1. Benedict King
This article has no evaluationsLatest version Dec 17, 2025
Optimal Inference of Asynchronous Boolean Network Models

This article has 1 author:
1. Guy Karlebach
This article has no evaluationsLatest version Dec 19, 2025
The weak driver conundrum: data archiving and biological phenomena impact macrogenetic findings

This article has 2 authors:
1. Ivo Colmonero-Costeira
2. Deborah Leigh
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Testing the validity and adequacy of linguistic phylogenetic analyses

Optimal Inference of Asynchronous Boolean Network Models

The weak driver conundrum: data archiving and biological phenomena impact macrogenetic findings