Phyloformer: Fast, accurate and versatile phylogenetic reconstruction with deep neural networks
This article has been Reviewed by the following groups
Listed in
 Evaluated articles (Arcadia Science)
Abstract
Phylogenetic inference aims at reconstructing the tree describing the evolution of a set of sequences descending from a common ancestor. The high computational cost of stateoftheart Maximum likelihood and Bayesian inference methods limits their usability under realistic evolutionary models. Harnessing recent advances in likelihoodfree inference and geometric deep learning, we introduce Phyloformer, a fast and accurate method for evolutionary distance estimation and phylogenetic reconstruction. Sampling many trees and sequences under an evolutionary model, we train the network to learn a function that enables predicting the former from the latter. Under a commonly used model of protein sequence evolution and exploiting GPU acceleration, it outpaces fast distance methods while matching maximum likelihood accuracy on simulated and empirical data. Under more complex models, some of which include dependencies between sites, it outperforms other methods. Our results pave the way for the adoption of sophisticated realistic models for phylogenetic inference.
Article activity feed

Phyloformer starts (bottom left) from a onehot encoded MSA
Below I point out that PF+FastME seems to do well when it comes to inferring branch lengths (as indicated by low KF distances), but it struggles with topological accuracy (as indicated by RF distances), particularly for larger trees.
I wonder if you might be able recover some topological accuracy by supplementing the onehot encoded MSA with some form of complementary, or even alternative feature representation. One potential that comes to mind is that perhaps obtaining protein feature vectors derived from a PositionSpecific Scoring Matrix (PSSM) constructed from the MSA.
This would be a 20 AA x 20 AA = 400long feature vector for each sequence (such as implemented by PSSMCOOL  https://doi.org/10.1093/biomethods/bpac008). Including this as either a substitute for the …
Phyloformer starts (bottom left) from a onehot encoded MSA
Below I point out that PF+FastME seems to do well when it comes to inferring branch lengths (as indicated by low KF distances), but it struggles with topological accuracy (as indicated by RF distances), particularly for larger trees.
I wonder if you might be able recover some topological accuracy by supplementing the onehot encoded MSA with some form of complementary, or even alternative feature representation. One potential that comes to mind is that perhaps obtaining protein feature vectors derived from a PositionSpecific Scoring Matrix (PSSM) constructed from the MSA.
This would be a 20 AA x 20 AA = 400long feature vector for each sequence (such as implemented by PSSMCOOL  https://doi.org/10.1093/biomethods/bpac008). Including this as either a substitute for the onehot encoded MSA, or as a complementary layer may provide additionally relevant information for PF to learn, resulting in improved topological accuracy while still using FastME.

Performance measures for different tree reconstruction method. a) KuhnerFelsenstein (KF) distance, which takes into account both topology and branch lengths of the compared trees; b) mean absolute error (MAE) on pairwise distances, which ignores topology; c) normalized RobinsonFoulds (RF) distance, which only takes into account tree topology. The alignments for which trees are inferred, were simulated under the LG+GC sequence model and are all 500 amino acids long. For each measure, we show 95% confidence intervals estimated with 1000 bootstrap samples.
These results paired with the runtimes are really quite impressive! But the contrast between the results for KF distances as compared to RF distances are interesting, and seems like they may be worth unpacking.
In particular, it's notable that the RF distances at greater tree sizes …
Performance measures for different tree reconstruction method. a) KuhnerFelsenstein (KF) distance, which takes into account both topology and branch lengths of the compared trees; b) mean absolute error (MAE) on pairwise distances, which ignores topology; c) normalized RobinsonFoulds (RF) distance, which only takes into account tree topology. The alignments for which trees are inferred, were simulated under the LG+GC sequence model and are all 500 amino acids long. For each measure, we show 95% confidence intervals estimated with 1000 bootstrap samples.
These results paired with the runtimes are really quite impressive! But the contrast between the results for KF distances as compared to RF distances are interesting, and seems like they may be worth unpacking.
In particular, it's notable that the RF distances at greater tree sizes for PF+FastME seem to converge with FastME, being greater than seen for IQTree/FastTree, with the difference increasing along with tree size.
As you say, RF is just the sum of differences in bipartitions between two trees, whereas KF considers both differences in topology and branch length. You find that PF+FastME consistently infers trees with lower or equivalent KF distances to IQTree and FastTree. But, as tree size increases, RF distances increase for PF+FastME at a high rate, exceeding those of FastTree and IQTree starting at relatively small trees (~20 tips).
Together, these results would suggest that PF+FastME estimates branch lengths well. This is maybe expected but a great thing to see, since PF is effectively trained to infer those evolutionary distances that FastME uses to infer branch lengths! However, despite accurately inferring branchlengths, there seems to be a larger number of topological errors in the larger trees inferred by PF+FastME as compared to the other methods.
Do you have any intuition as to why this discrepancy arises? Or any thoughts on how you might modify the model/model architecture to better account for and mitigate this effect?

Normalized RobinsonFoulds distance (above) and KuhnerFelsenstein distance (below) for different tree reconstruction methods on the Cherry (left) and SelReg (right) test sets (alignment length=500).
We see a similar pattern as before (in fig 4) in the SelReg dataset, where KF distances show FastME and PF+FastME inferring trees with better branch lengths, but (particularly for larger trees) greater topological errors. Additionally, with increasing tree size, most methods seem to show a trajectory wherein RF distance decreases with increasing tree size, whereas PF+FastME is still increasing (for both the Cherry and SelReg datasets). I think it would be worth expanding the datasets if possible to see what happens at even larger tree sizes (e.g. 250, 500, 1000 leaves). Do these patterns continue or plateau? If so, where?

Comparison of topology reconstruction accuracy between Phyloformer and other methods on empirical data. In both panels, we show the normalized RobinsonFoulds distance between reconstructed gene trees and the corresponding concatenate tree.
Given that you report results for both RF and KF distances above in figure 5, I'd suggest doing the same here, as it seems clear that RF and KF are capturing quite different features of topological differences, particularly for IQTree vs FastME, and PF+FastME.

Tree comparison metrics for different tree reconstruction methods on the LG+GC+indels test set (alignment length=500). Legend as in Fig. 2, with Phyloformer finetuned on alignments with gaps named PFIndel+FastME and in cyan.
Seems we see the same pattern as in Fig 2 here  this time with RF distances for PF+FastME converging with FastME at slightly larger tree sizes (~5060 tips).
