Exploring genome gene content and morphological analysis to test recalcitrant nodes in the animal phylogeny

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

An accurate phylogeny of animals is needed to clarify their evolution, ecology, and impact on shaping the biosphere. Although datasets of several hundred thousand amino acids are nowadays routinely used to test phylogenetic hypotheses, key deep nodes in the metazoan tree remain unresolved: the root of animals, the root of Bilateria, and the monophyly of Deuterostomia. Instead of using the standard approach of amino acid datasets, we performed analyses of newly assembled genome gene content and morphological datasets to investigate these recalcitrant nodes in the phylogeny of animals. We explored extensively the choices for assembling the genome gene content dataset and model choices of morphological analyses. Our results are robust to these choices and provide additional insights into the early evolution of animals, they are consistent with sponges as the sister group of all the other animals, the worm-like bilaterian lineage Xenacoelomorpha as the sister group of the other Bilateria, and tentatively support monophyletic Deuterostomia.

Article activity feed

  1. Modelling morphological evolution by using stochastic processes is more intricate than modelling molecular sequence evolution because it cannot be assumed that the same evolutionary process is acting on all characters identically.

    Could this not be said about gene content evolution as well? Patterns and dynamics of gene family gain/loss is certainly expected to be a heterogeneous process - does the reversible binary substitution model allow for site heterogeny?

  2. The phylogenetic analysis of gene content data utilises genome-derived proteomes and converts the presence or absence of gene families in the genomes of the terminals into a binary data matrix [9,25,37,38].

    This is the same type of gene-content encoding used by Ryan et al. 2013 as their second dataset, which coincidentally supported the hypothesis of Ctenophores as sister.

    I'm curious as to why an alternative recoding of these gene families was not considered - gene copy number. From the analyses conducted here to infer orthogroups using OrthoFinder, this encoding would be fairly straightforward to accomplish, and could perhaps be even more information rich than presence absence. Additionally this would be an entirely distinct way of encoding these data as compared to the approach employed by Ryan et al., (2013).

    Additionally, there exist now several methods that are capable of inferring species trees from multi-copy gene family trees, which are by design more robust to, and even improved by the existence of paralogs (e.g. Asteroid -https://doi.org/10.1093/bioinformatics/btac832, SpeciesRax - https://doi.org/10.1093/molbev/msab365, ASTRAL-Pro 2 - https://doi.org/10.1093/bioinformatics/btac620).

  3. This is an impressive paper containing a significant amount of work, and I love the amount of effort focused on exploring how sensitive the results are to methodological approaches and parameter specifications. In particular, I'm delighted to see the exploration of how the choice of inflation parameter impacts downstream phylogenetic inference - given that this parameter has profound influence on all downstream analyses, I think all studies using these types of approaches should be similarly cautious/considerate.

    In general, I feel that the choice to encode 'gene content' as gene presence makes distinguishing these results from past efforts, particularly that of Pett at al., 2019, more challenging. That is, gene content is encoded in the same way across the two studies - and both the present study and that of Pett et al., come to the same conclusion that the Porifera-sister hypothesis is most strongly supported. Together, it makes these new phylogenetic analyses of gene content less compelling as corroborating evidence.

    I feel that an alternative encoding of gene content - as gene copy number as an ordered-discrete character e.g. (0,1,2,3, etc) - would have been valuable to explore. This type of encoding would be more information rich, and would be inherently distinct from past efforts. Alternatively, the authors could have used a growing number of methods that are capable of inferring species trees from multi-copy gene families (see later comment). I would hope to see some discussion of these alternatives, though i suspect implementation of either would be non-trivial.

  4. We assembled a large number of new gene content datasets (see Methods, Fig. 1) to extensively test the effect of different parameter combinations when identifying homogroups and orthogroups, because this crucial step remains a challenge [40,41] and may influence the outcome of the downstream phylogenetic analysis [42]. For example, state-of-the-art methods provide two parameters (the E-value [similarity] and I-value [granulation or inflation]) which have a direct impact on the inferred gene family assignment (E-value) and splitting of gene families into orthogroups (I-value).

    I am delighted to see such a thorough exploration of the sensitivity of phylogenomic inference to these parameters, as defaults are typically used without consideration but can, as you say, have profound influence on downstream outcomes.

  5. Considering that previous amino-acid alignment-based phylogenomic analyses showed model- and data dependency [e.g., 9,18], which therefore did not lead to conclusive results, alternative approaches might help to select between phylogenetic hypotheses.

    Unclear a priori why this might not always be expected to be the case, even when encoding data in other ways. Fundamentally, these are still evolutionary models.

  6. The second coded the presence/absence of orthogroups. When this second coding strategy is used, individual orthogroups within each protein family are treated as individual characters. This is the same strategy introduced and justified by Pett et al. [37]

    I think a graphical depiction of this may be useful. As written, it's unclear how you are encoding orthogroups within homogroups.

    Based on your description below, it appears to me that the only difference is in the implementation of MCL clustering to sequence similarity, with orthofinder2 used to infer orthogroups (i.e. normalizing similarity scores to account for sequence length), and homomcl for homogroups). Your statement that orthogroups are nested within protein families doesn't appear to be consistent with this? Am I mistaken in my interpretation?

  7. This is an impressive paper containing a significant amount of work, and I love the amount of effort focused on exploring how sensitive the results are to methodological approaches and parameter specifications. In particular, I'm delighted to see the exploration of how the choice of inflation parameter impacts downstream phylogenetic inference - given that this parameter has profound influence on all downstream analyses, I think all studies using these types of approaches should be similarly cautious/considerate.

    In general, I feel that the choice to encode 'gene content' as gene presence makes distinguishing these results from past efforts, particularly that of Pett at al., 2019, more challenging. That is, gene content is encoded in the same way across the two studies - and both the present study and that of Pett et al., come to the same conclusion that the Porifera-sister hypothesis is most strongly supported. Together, it makes these new phylogenetic analyses of gene content less compelling as corroborating evidence.

    I feel that an alternative encoding of gene content - as gene copy number as an ordered-discrete character e.g. (0,1,2,3, etc) - would have been valuable to explore. This type of encoding would be more information rich, and would be inherently distinct from past efforts. Alternatively, the authors could have used a growing number of methods that are capable of inferring species trees from multi-copy gene families (see later comment). I would hope to see some discussion of these alternatives, though i suspect implementation of either would be non-trivial.

  8. Modelling morphological evolution by using stochastic processes is more intricate than modelling molecular sequence evolution because it cannot be assumed that the same evolutionary process is acting on all characters identically.

    Could this not be said about gene content evolution as well? Patterns and dynamics of gene family gain/loss is certainly expected to be a heterogeneous process - does the reversible binary substitution model allow for site heterogeny?

  9. The second coded the presence/absence of orthogroups. When this second coding strategy is used, individual orthogroups within each protein family are treated as individual characters. This is the same strategy introduced and justified by Pett et al. [37]

    I think a graphical depiction of this may be useful. As written, it's unclear how you are encoding orthogroups within homogroups.

    Based on your description below, it appears to me that the only difference is in the implementation of MCL clustering to sequence similarity, with orthofinder2 used to infer orthogroups (i.e. normalizing similarity scores to account for sequence length), and homomcl for homogroups). Your statement that orthogroups are nested within protein families doesn't appear to be consistent with this? Am I mistaken in my interpretation?

  10. We assembled a large number of new gene content datasets (see Methods, Fig. 1) to extensively test the effect of different parameter combinations when identifying homogroups and orthogroups, because this crucial step remains a challenge [40,41] and may influence the outcome of the downstream phylogenetic analysis [42]. For example, state-of-the-art methods provide two parameters (the E-value [similarity] and I-value [granulation or inflation]) which have a direct impact on the inferred gene family assignment (E-value) and splitting of gene families into orthogroups (I-value).

    I am delighted to see such a thorough exploration of the sensitivity of phylogenomic inference to these parameters, as defaults are typically used without consideration but can, as you say, have profound influence on downstream outcomes.

  11. The phylogenetic analysis of gene content data utilises genome-derived proteomes and converts the presence or absence of gene families in the genomes of the terminals into a binary data matrix [9,25,37,38].

    This is the same type of gene-content encoding used by Ryan et al. 2013 as their second dataset, which coincidentally supported the hypothesis of Ctenophores as sister.

    I'm curious as to why an alternative recoding of these gene families was not considered - gene copy number. From the analyses conducted here to infer orthogroups using OrthoFinder, this encoding would be fairly straightforward to accomplish, and could perhaps be even more information rich than presence absence. Additionally this would be an entirely distinct way of encoding these data as compared to the approach employed by Ryan et al., (2013).

    Additionally, there exist now several methods that are capable of inferring species trees from multi-copy gene family trees, which are by design more robust to, and even improved by the existence of paralogs (e.g. Asteroid -https://doi.org/10.1093/bioinformatics/btac832, SpeciesRax - https://doi.org/10.1093/molbev/msab365, ASTRAL-Pro 2 - https://doi.org/10.1093/bioinformatics/btac620).

  12. Considering that previous amino-acid alignment-based phylogenomic analyses showed model- and data dependency [e.g., 9,18], which therefore did not lead to conclusive results, alternative approaches might help to select between phylogenetic hypotheses.

    Unclear a priori why this might not always be expected to be the case, even when encoding data in other ways. Fundamentally, these are still evolutionary models.