Performance and Robustness of Parameter Estimation from Phylogenetic Trees Using Neural Networks

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which corrects graph neural network result using a recurrent neural network) can compute estimates faster than MLE and is less affected by tree size. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If there are detectable phylogenetic signals present, our approach delivers results that are comparable to MLE but without inherent biases.

Article activity feed

  1. Performance Analysis

    I would like to see figures depicting the performance (error) of these different methods for each parameter estimate, but plotted as a function of tree size (i.e. a head map of error as a function of parameter value and tree size) - I can't help but wonder if part of the pattern of increasing error rates as a function of increasing diversification parameter is simply a result of there being greater variability in simulated tree shape/size at these larger parameter values.

    Additionally, I think it could be quite informative to plot a heatmap of error with speciation and extinction rates as the x and y axes - I suspect this would highlight a clear, predictable pattern, particularly with increasing error rates being characteristic of parameter combinations where both speciation and extinction rates are high, leading to high species turnover and thus greater "volatility" of diversification outcomes.

  2. GNN: Predictions obtained by the graph neural network using the phylogenies.

    Again, I can't help but wonder/suspect that the performance of the GNN here is limited due to the fact that GraphSAGE cannot actually leverage/take advantage of edge weights/branch lengths.

    Related, the performance of the GNN may be improved further if node positional information in the tree is encoded using some of the node representations implemented by pytorch geometric as node features (e.g. the Laplacian Eigenvector positional encodings - https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.transforms.AddLaplacianEigenvectorPE.html#torch_geometric.transforms.AddLaplacianEigenvectorPE).

  3. We use the AdamW (Adaptive Moment Estimation with decoupled weight decay) optimizer (Loshchilov and Hutter, 2017) to iteratively update the neural networks’ parameters to minimize the loss function. We used default AdamW argument settings.

    It's nice to see that you used AdamW over the original Adam optimizer, as it correctly implements L2 regularization. But, it might be worth considering the use of either adaptive learning rate schedulers on loss plateau to improve/generalize the ability of each of these three models to learn during the training procedure, as well as to possibly explore the impact of varying the value of weight_decay (regularization strength), as the default value for this parameter (and realistically for the learning rate) is unlikely to be the same for these three models which differ in model complexity.

  4. We started with a GNN to make initial predictions and explored the effectiveness of both DNN and LSTM for correcting residuals, either individually or in sequence.

    Did you explore the effect/impact of ordering your model predictions on the outcomes? If so, what sort of variability in outcome did you find?

  5. In the GNN, the full phylogeny was interpreted as a graph and could in that form be used as input data

    I comment on this in Appendix C, but did you explore the use of different GNN architectures? GraphSAGE does quite well for inductive learning tasks and for aggregating node information from multiple hops, but it does not take into account edge weights or edge attributes, which here could be interpreted as branch lengths - I suspect this information would be incredibly useful for a GNN model capable of leveraging this information.

  6. For the graph neural network, we used GraphSAGE

    Is there a particular reason why you chose to use GraphSAGE? I understand it does well in aggregating information from nodes that are multiple hops away and thus performs well in large graphs, but it is unable to take advantage of edge weights (i.e. branch lengths) which could be incredibly information rich for diversification parameter estimation. Additionally, although you capped the tree sise to 1500 tips (3000 nodes), these graphs really are not terribly large in the context of GNN applications - they are also quite sparse, as they are fully bifurcating.

    Mostly just interested in your thought process here! Did you explore the use of alternative GNN convolutional layers or models?

  7. R package “eveGNN” (a codebase of phylogeny simulation, data transformation, neural network training and MLE computation for our study)

    Is this meant to be EvoNN? Or is this a separate package/repo? If the latter, please be sure to include a link to the repo!

  8. Our analyses encompass three different diversification scenarios for which likelihood-based inference approaches already exist:

    I think it certainly makes sense to evaluate these different model architectures for these commonly studied diversification scenarios for which existing, comparable PCMs exist. However, I do think it's going to be incredibly important to evaluate how these do under increasingly complex scenarios, such as those where diversification rates are time- and branch-heterogeneous and either depend on some other continuously (or discretely) varying trait(s), or themselves evolve according to some model of evolution (e.g. Martin et al., 2023 - https://doi.org/10.1093/sysbio/syac068).

    I say this because although these diversification scenarios may require even more information-rich, larger phylogenies to accurately infer diversification parameters/make phylogenetic predictions, these are likely to be the scenarios where more classical MLE-based approaches exhibit a high degree of model inadequacy, leading these new approaches to shine.

  9. Phylogenetic trees can also be viewed as graphs, suggesting that graph neural networks have potential applicability in phylogenetics.

    It's great to see this recognition/appreciation for the potential of GNNs in the context of phylogenetic inference - there are a huge range of applications here!