A Systematic Investigation of Overfitting in Maximum Likelihood Phylogenetic Inference

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Maximum Likelihood (ML) tree inference reconstructs phylogenies from Multiple Sequence Alignments (MSAs). Since MSAs are inherently noisy, ML tools may experience overfitting, whereby the inferred topology incorrectly models noise alongside the true phylogenetic signal. We statistically assess overfitting in ML tools using log-likelihood scores on unseen sites as primary metric. We deploy a 10-fold Monte Carlo cross-validation approach, partitioning 9,062 empirical and 6,342 simulated MSAs into training (80%) and testing (20%) sites. We conduct inferences using RAxML-NG, IQ-TREE, Fast-Tree, and RAxML-NG ES (a recently released Early Stopping version) on the training MSAs. We store all intermediate improved topologies and subsequently evaluate them on the testing sites. We perform a linear regression on the final segments of the derived testing curves, and statistically evaluate the line slopes via the sign test. Our results indicate that ML tools do not overfit. For RAxML-NG (standard and ES) and IQ-TREE, the overall trend is non-significant for 86-98% of the empirical MSAs, while for less than 1% the tools exhibit overfitting. FastTree shows more positive trends, suggesting premature termination, especially on protein MSAs. Topological accuracy curves on simulated data further confirm that tools do not systematically diverge from the true topology. To complement these findings, we test whether strategies to mitigate overfitting can benefit ML inferences. To this end, we also benchmark a site-based holdout validation (HV) version of RAxML-NG. The results confirm that the overfitting is absent and also indicate that excluding MSA sites substantially reduces phylogenetic signal as well as accuracy.

Article activity feed