A Systematic Investigation of Overfitting in Maximum Likelihood Phylogenetic Inference

Anastasis Togkousidis
Olivier Gascuel
Alexandros Stamatakis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Maximum Likelihood (ML) tree inference reconstructs phylogenies from Multiple Sequence Alignments (MSAs). Since MSAs are inherently noisy, ML tools may experience overfitting, whereby the inferred topology incorrectly models noise alongside the true phylogenetic signal. We statistically assess overfitting in ML tools using log-likelihood scores on unseen sites as primary metric. We deploy a 10-fold Monte Carlo cross-validation approach, partitioning 9,062 empirical and 6,342 simulated MSAs into training (80%) and testing (20%) sites. We conduct inferences using RAxML-NG, IQ-TREE, Fast-Tree, and RAxML-NG ES (a recently released Early Stopping version) on the training MSAs. We store all intermediate improved topologies and subsequently evaluate them on the testing sites. We perform a linear regression on the final segments (we use four distinct segment-window configurations) of the derived testing curves, and statistically evaluate the line slopes via the sign test. Our results indicate that ML tools do not overfit. For RAxML-NG (standard and ES) and IQ-TREE, the overall trend is non-significant for 86-98% of the empirical MSAs (across all four segment-window configurations), while for less than 1% the tools exhibit over-fitting. FastTree shows more positive trends, suggesting premature termination, especially on protein MSAs. Topological accuracy curves on simulated data further confirm that tools do not systematically diverge from the true topology. To complement these findings, we test whether strategies to mitigate overfitting can benefit ML inferences. To this end, we also benchmark a site-based holdout validation (HV) version of RAxML-NG. The results confirm that the overfitting is absent and also indicate that excluding MSA sites substantially reduces phylogenetic signal as well as accuracy.

Version published to 10.1101/2025.10.07.680876 on bioRxiv
Oct 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed