Rehabilitating the benefits of gene tree correction in the presence of incomplete lineage sorting
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Peer Community in Evolutionary Biology)
Abstract
Gene trees play an important role in various areas of phylogenomics. However, their reconstruction often relies on limited-length sequences and may not account for complex evolutionary events, such as gene duplications, losses, or incomplete lineage sorting (ILS), which are not modeled by standard phylogenetic methods. To address these challenges, it is common to first infer gene trees using fast algorithms for conventional models, then refine them through species tree-aware correction methods. Recently, it has been argued that such corrections can lead to overfitting and force gene trees to resemble the species tree, thereby obscuring genuine gene-level variation caused by ILS. In this paper, we challenge and refute this hypothesis, and we demonstrate that, when applied carefully, correction methods can offer significant benefits, even in the presence of ILS.
Article activity feed
-
-
Gene trees inferred from limited sequence data are central to many phylogenomic analyses, including orthology and paralogy assignment, reconstruction of gene family evolution, and downstream functional inference. However, short alignments, heterogeneous evolutionary processes, and incomplete lineage sorting (ILS) combine to make gene tree estimation error pervasive. A common practice is therefore to infer gene trees with fast maximum-likelihood (ML) methods and then apply species-tree-aware correction procedures that collapse low-support branches and refine the resulting polytomies using reconciliation criteria.
Yan et al. (2023) challenged this practice in the presence of ILS. Using simulations, they reported that correcting gene trees toward a species tree—via tools such as TreeFix (Wu et al. 2013) and TRACTION (Christensen et al. …
Gene trees inferred from limited sequence data are central to many phylogenomic analyses, including orthology and paralogy assignment, reconstruction of gene family evolution, and downstream functional inference. However, short alignments, heterogeneous evolutionary processes, and incomplete lineage sorting (ILS) combine to make gene tree estimation error pervasive. A common practice is therefore to infer gene trees with fast maximum-likelihood (ML) methods and then apply species-tree-aware correction procedures that collapse low-support branches and refine the resulting polytomies using reconciliation criteria.
Yan et al. (2023) challenged this practice in the presence of ILS. Using simulations, they reported that correcting gene trees toward a species tree—via tools such as TreeFix (Wu et al. 2013) and TRACTION (Christensen et al. 2020)—frequently increases topological error, and argued that such corrections can overfit gene trees to the species tree, obscuring genuine discordance due to ILS. Given widespread use of correction methods in phylogenomics, this claim has direct methodological consequences.
In the preprint I am recommending here, Lafond and Scornavacca (2025) re-examine this question by analysing the same simulated data used by Yan et al. The study focuses on ecceTERA (Jacox et al. 2016), a parsimony-based reconciliation tool that minimizes duplication–loss (and optionally duplication–transfer–loss) cost while only modifying branches below a user-specified bootstrap threshold. The key question is whether such correction can reduce gene tree topological error relative to uncorrected ML trees, even when ILS is present. The authors reuse the original simulation design of Yan et al, based on 11 taxa and gene trees generated under a coalescent process with varying branch lengths, producing multiple datasets covering a grid of ILS intensity, sequence length, and mutation rate. On these datasets, Lafond and Scornavacca apply ecceTERA in two settings: duplication–loss (DL) and duplication–transfer–loss (DTL). In both cases, branches with bootstrap support below a threshold are collapsed, and ecceTERA searches for refinements minimizing the corresponding reconciliation cost.
The principal result is that ecceTERA-based correction, when restricted to branches below a 50% bootstrap support threshold, generally reduces or maintains topological error compared to uncorrected gene trees. The study highlights the central role of branch support thresholds. Branches reflecting true ILS-driven discordance are often well-supported; collapsing only very weakly supported branches reduces the risk of erasing genuine signal while addressing regions likely to represent reconstruction artefacts. The choice of a 50% threshold aligns with the majority-rule criterion previously advocated in Bayesian consensus tree reporting (Holder et al. 2008), providing both empirical and theoretical justification.
Following feedback from peer-review, the authors explicitly qualified the scope of their conclusions. Likelihood and Bayesian methods that explicitly model ILS and other processes, such as StarBEAST2 (Ogilvie et al. 2017), BPP (Flouri et al. 2023), PHYLDOG (Boussau et al. 2013), or AleRax (Morel et al. 2024), typically provide higher accuracy when computationally feasible and remain preferable in such settings. The present work instead addresses the practical question of whether fast correction procedures can still be safely used when only ML trees are available at genome scale. In that domain, the results support the continued use of species-tree-aware correction, under conservative thresholds and with appropriate reconciliation models.
The study's limitations are clearly stated. The simulated data contain ILS but no true duplications, losses, or transfers, even though these events are the focus of reconciliation-based models. Additional simulations that jointly model ILS and gene-level events would be needed to fully characterize performance in more complex evolutionary scenarios. Moreover, the work does not aim to re-benchmark correction methods against fully probabilistic co-estimation approaches; it focuses on relative improvements over ML-only pipelines.
Despite these restrictions, the article provides a clear and quantitatively supported answer to a practically relevant question. It shows that gene tree correction is not intrinsically detrimental under ILS and that previous negative conclusions arose in part from the use of a high bootstrap threshold and particular correction tools. The work therefore refines current understanding of when and how species-tree-aware correction should be applied in phylogenomics workflows.
References
Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V (2013). Genome-scale coestimation of species and gene trees. Genome Research 23:323–330. https://doi.org/10.1101/gr.141978.112
Christensen, S., Molloy, E. K., Vachaspati, P., Yammanuru, A. & Warnow, T. (2020). Non-parametric correction of estimated gene trees using TRACTION. Algorithms Mol. Biol. 15, 1. https://doi.org/10.1186/s13015-019-0161-8
Flouri T, Jiao X, Huang J, Rannala B, Yang Z. (2023). Efficient Bayesian inference under the multispecies coalescent with migration. Proc. Natl. Acad. Sci. U.S.A. 120(44):e2310708120. https://doi.org/10.1073/pnas.2310708120
Holder MT, Sukumaran J, Lewis PO. (2008). A justification for reporting the majority-rule consensus tree in Bayesian phylogenetics. Systematic Biology 57(5):814–821. https://doi.org/10.1080/10635150802422308
Jacox E, Chauve C, Szöllősi GJ, Ponty Y, Scornavacca C. (2016). ecceTERA: comprehensive gene tree–species tree reconciliation using parsimony. Bioinformatics 32(13):2056–2058. https://doi.org/10.1093/bioinformatics/btw105
Morel B, Williams TA, Stamatakis A, Szöllősi GJ. 2024. AleRax: a tool for gene and species tree co-estimation and reconciliation under a probabilistic model of gene duplication, transfer, and loss. Bioinformatics 40:btae162. https://doi.org/10.1093/bioinformatics/btae162
Ogilvie HA, Bouckaert RR, Drummond AJ. (2017). StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Molecular Biology and Evolution 34(8):2101–2114. https://doi.org/10.1093/molbev/msx126
Lafond M, Scornavacca C (2025). Rehabilitating the benefits of gene tree correction in the presence of incomplete lineage sorting. bioRxiv, ver.3 peer-reviewed and recommended by PCI Evolutionary Biology https://doi.org/10.1101/2025.07.09.663893
Wu, Y.-C., Rasmussen, M. D., Bansal, M. S. & Kellis, M. (2013). TreeFix: statistically informed gene tree error correction using species trees. Syst. Biol. 62, 110–120. https://doi.org/10.1093/sysbio/sys076
Yan Z, Ogilvie HA, Nakhleh L. (2023). “Correcting” gene trees to be more like species trees frequently increases topological error. Genome Biology and Evolution 15(6):evad094. https://doi.org/10.1093/gbe/evad094
-
