Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks

Luca Nesterenko
Luc Blassel
Philippe Veber
Bastien Boussau
Laurent Jacob

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Phylogenetic inference aims at reconstructing the tree describing the evolution of a set of sequences descending from a common ancestor. The high computational cost of state-of-the-art maximum likelihood and Bayesian inference methods limits their usability under realistic evolutionary models. Harnessing recent advances in likelihood-free inference and geometric deep learning, we introduce Phyloformer, a fast and accurate method for evolutionary distance estimation and phylogenetic reconstruction. Sampling many trees and sequences under an evolutionary model, we train the network to learn a function that enables predicting a tree from a multiple sequence alignment. On simulated data, we compare Phyloformer to FastME—a distance method—and two maximum likelihood methods: FastTree and IQTree. Under a commonly used model of protein sequence evolution and exploiting graphics processing unit (GPU) acceleration, Phyloformer outpaces all other approaches and exceeds their accuracy in the Kuhner–Felsenstein metric that accounts for both the topology and branch lengths. In terms of topological accuracy alone, Phyloformer outperforms FastME, but falls behind maximum likelihood approaches, especially as the number of sequences increases. When a model of sequence evolution that includes dependencies between sites is used, Phyloformer outperforms all other methods across all metrics on alignments with fewer than 80 sequences. On 3,801 empirical gene alignments from five different datasets, Phyloformer matches the topological accuracy of the two maximum likelihood implementations. Our results pave the way for the adoption of sophisticated realistic models for phylogenetic inference.

Version published to 10.1093/molbev/msaf051
Mar 11, 2025
Arcadia Science
Aug 29, 2024

Phyloformer starts (bottom left) from a one-hot encoded MSA

Below I point out that PF+FastME seems to do well when it comes to inferring branch lengths (as indicated by low KF distances), but it struggles with topological accuracy (as indicated by RF distances), particularly for larger trees.

I wonder if you might be able recover some topological accuracy by supplementing the one-hot encoded MSA with some form of complementary, or even alternative feature representation. One potential that comes to mind is that perhaps obtaining protein feature vectors derived from a Position-Specific Scoring Matrix (PSSM) constructed from the MSA.

This would be a 20 AA x 20 AA = 400-long feature vector for each sequence (such as implemented by PSSMCOOL - https://doi.org/10.1093/biomethods/bpac008). Including this as either a substitute for the …

Phyloformer starts (bottom left) from a one-hot encoded MSA

Below I point out that PF+FastME seems to do well when it comes to inferring branch lengths (as indicated by low KF distances), but it struggles with topological accuracy (as indicated by RF distances), particularly for larger trees.

I wonder if you might be able recover some topological accuracy by supplementing the one-hot encoded MSA with some form of complementary, or even alternative feature representation. One potential that comes to mind is that perhaps obtaining protein feature vectors derived from a Position-Specific Scoring Matrix (PSSM) constructed from the MSA.

This would be a 20 AA x 20 AA = 400-long feature vector for each sequence (such as implemented by PSSMCOOL - https://doi.org/10.1093/biomethods/bpac008). Including this as either a substitute for the one-hot encoded MSA, or as a complementary layer may provide additionally relevant information for PF to learn, resulting in improved topological accuracy while still using FastME.

Read the original source
Arcadia Science
Aug 29, 2024

Performance measures for different tree reconstruction method. a) Kuhner-Felsenstein (KF) distance, which takes into account both topology and branch lengths of the compared trees; b) mean absolute error (MAE) on pairwise distances, which ignores topology; c) normalized Robinson-Foulds (RF) distance, which only takes into account tree topology. The alignments for which trees are inferred, were simulated under the LG+GC sequence model and are all 500 amino acids long. For each measure, we show 95% confidence intervals estimated with 1000 bootstrap samples.

These results paired with the runtimes are really quite impressive! But the contrast between the results for KF distances as compared to RF distances are interesting, and seems like they may be worth unpacking.

In particular, it's notable that the RF distances at greater tree sizes …

Performance measures for different tree reconstruction method. a) Kuhner-Felsenstein (KF) distance, which takes into account both topology and branch lengths of the compared trees; b) mean absolute error (MAE) on pairwise distances, which ignores topology; c) normalized Robinson-Foulds (RF) distance, which only takes into account tree topology. The alignments for which trees are inferred, were simulated under the LG+GC sequence model and are all 500 amino acids long. For each measure, we show 95% confidence intervals estimated with 1000 bootstrap samples.

These results paired with the runtimes are really quite impressive! But the contrast between the results for KF distances as compared to RF distances are interesting, and seems like they may be worth unpacking.

In particular, it's notable that the RF distances at greater tree sizes for PF+FastME seem to converge with FastME, being greater than seen for IQTree/FastTree, with the difference increasing along with tree size.

As you say, RF is just the sum of differences in bipartitions between two trees, whereas KF considers both differences in topology and branch length. You find that PF+FastME consistently infers trees with lower or equivalent KF distances to IQTree and FastTree. But, as tree size increases, RF distances increase for PF+FastME at a high rate, exceeding those of FastTree and IQTree starting at relatively small trees (~20 tips).

Together, these results would suggest that PF+FastME estimates branch lengths well. This is maybe expected but a great thing to see, since PF is effectively trained to infer those evolutionary distances that FastME uses to infer branch lengths! However, despite accurately inferring branch-lengths, there seems to be a larger number of topological errors in the larger trees inferred by PF+FastME as compared to the other methods.

Do you have any intuition as to why this discrepancy arises? Or any thoughts on how you might modify the model/model architecture to better account for and mitigate this effect?

Read the original source
Arcadia Science
Aug 29, 2024

Normalized Robinson-Foulds distance (above) and Kuhner-Felsenstein distance (below) for different tree reconstruction methods on the Cherry (left) and SelReg (right) test sets (alignment length=500).

We see a similar pattern as before (in fig 4) in the SelReg dataset, where KF distances show FastME and PF+FastME inferring trees with better branch lengths, but (particularly for larger trees) greater topological errors. Additionally, with increasing tree size, most methods seem to show a trajectory wherein RF distance decreases with increasing tree size, whereas PF+FastME is still increasing (for both the Cherry and SelReg datasets). I think it would be worth expanding the datasets if possible to see what happens at even larger tree sizes (e.g. 250, 500, 1000 leaves). Do these patterns continue or plateau? If so, where?

Read the original source
Arcadia Science
Aug 29, 2024

Comparison of topology reconstruction accuracy between Phyloformer and other methods on empirical data. In both panels, we show the normalized RobinsonFoulds distance between reconstructed gene trees and the corresponding concatenate tree.

Given that you report results for both RF and KF distances above in figure 5, I'd suggest doing the same here, as it seems clear that RF and KF are capturing quite different features of topological differences, particularly for IQTree vs FastME, and PF+FastME.

Read the original source
Arcadia Science
Aug 29, 2024

Tree comparison metrics for different tree reconstruction methods on the LG+GC+indels test set (alignment length=500). Legend as in Fig. 2, with Phyloformer finetuned on alignments with gaps named PFIndel+FastME and in cyan.

Seems we see the same pattern as in Fig 2 here - this time with RF distances for PF+FastME converging with FastME at slightly larger tree sizes (~50-60 tips).

Read the original source
Version published to 10.1101/2024.06.17.599404v1 on bioRxiv
Jun 22, 2024

Improving the Scalability of Bayesian Phylodynamic Inference through Efficient MCMC Proposals

This article has 4 authors:
1. Remco Bouckaert
2. Paula Weidemüller
3. Luis Esquivel Gomez
4. Nicola Felix Müller
This article has no evaluationsLatest version Jun 24, 2025
Sampling Aware Ancestral State Inference

This article has 4 authors:
1. Yexuan Song
2. Ivan Gill
3. Ailene MacPherson
4. Caroline Colijn
This article has no evaluationsLatest version May 23, 2025
Impact of Data Error on Phylogenetic Network Inference from Gene Trees Under the Multispecies Network Coalescent

This article has 3 authors:
1. Mehrdad Tamiji
2. Nicolae Sapoval
3. Luay Nakhleh
This article has no evaluationsLatest version May 23, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Improving the Scalability of Bayesian Phylodynamic Inference through Efficient MCMC Proposals

Sampling Aware Ancestral State Inference

Impact of Data Error on Phylogenetic Network Inference from Gene Trees Under the Multispecies Network Coalescent