Biased estimates of phylogenetic branch lengths resulting from the discretised Gamma model of site rate heterogeneity

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A standard method in phylogenetic reconstruction for representing variation in substitution rates between sites in the genome is the discrete Gamma model (DGM). Relative rates are assumed to be distributed according to a discretised Gamma distribution, where the probabilities that a site is included in each class are equal. Here, we identify a serious bias in the branch lengths of reconstructed phylogenies when the DGM is used, whereby lengths are usually, and often substantially, overestimated, and the magnitude of this effect varies (usually increasing) with the number of sequences in the alignment. We show that the alternative “FreeRate” model, which assumes no parametric distribution and allows the class probabilities to vary, is not subject to the issue. We further establish that the reason for the behaviour is the equal class probabilities, not the discretisation itself. We also explore the mathematical reasons for the phenomenon, showing that the effect is determined by the bias in the mean relative rate of evolution amongst the observed sites and that branch lengths will be overestimated where there is a long-tail of fast-evolving sites in the true rate distribution, the usual situation in real datasets. We recommend that the DGM be retired from general use. While FreeRate is an immediately available replacement, it is highly parameterised and known to be difficult to fit, and thus there is scope for innovation in rate heterogeneity models.

Article activity feed