Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This valuable study revisits the effects of substitution model selection on phylogenetics by comparing reversible and non-reversible DNA substitution models. The authors provide evidence that 1) non time-reversible models sometimes perform better than general time-reversible models when inferring phylogenetic trees out of simulated viral genome sequence data sets, and that 2) non time-reversible models can fit the real data better than the reversible substitution models commonly used in phylogenetics, a finding consistent with previous work. However, the methods are incomplete in supporting the main conclusion of the manuscript, that is that non time-reversible models should be incorporated in the model selection process for these data sets.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The vast majority of phylogenetic trees are inferred from molecular sequence data (nucleotides or amino acids) using time-reversible evolutionary models which assume that, for any pair of nucleotide or amino acid characters, the relative rate of X to Y substitution is the same as the relative rate of Y to X substitution. However, this reversibility assumption is unlikely to accurately reflect the actual underlying biochemical and/or evolutionary processes that lead to the fixation of substitutions. Here, we use empirical viral genome sequence data to reveal that evolutionary non-reversibility is pervasive among most groups of viruses. Specifically, we consider two non-reversible nucleotide substitution models: (1) a 6-rate non-reversible model (NREV6) in which Watson-Crick complementary substitutions occur at identical relative rates and which might therefor be most applicable to analyzing the evolution of genomes where both complementary strands are subject to the same mutational processes (such as might be expected for double-stranded (ds) RNA or dsDNA genomes); and (2) a 12-rate non-reversible model (NREV12) in which all relative substitution types are free to occur at different rates and which might therefore be applicable to analyzing the evolution of genomes where the complementary genome strands are subject to different mutational processes (such as might be expected for viruses with single-stranded (ss) RNA or ssDNA genomes).

Using likelihood ratio and Akaike Information Criterion-based model tests, we show that, surprisingly, NREV12 provided a significantly better fit to 21/31 dsRNA and 20/30 dsDNA datasets than did the general time reversible (GTR) and NREV6 models with NREV6 providing a better fit than NREV12 and GTR in only 5/30 dsDNA and 2/31 dsRNA datasets. As expected, NREV12 provided a significantly better fit to 24/33 ssDNA and 40/47 ssRNA datasets. Next, we used simulations to show that increasing degrees of strand-specific substitution bias decrease the accuracy of phylogenetic inference irrespective of whether GTR or NREV12 is used to describe mutational processes. However, in cases where strand-specific substitution biases are extreme (such as in SARS-CoV-2 and Torque teno sus virus datasets) NREV12 tends to yield more accurate phylogenetic trees than those obtained using GTR.

We show that NREV12 should, be seriously considered during the model selection phase of phylogenetic analyses involving viral genomic sequences.

Article activity feed

  1. eLife assessment

    This valuable study revisits the effects of substitution model selection on phylogenetics by comparing reversible and non-reversible DNA substitution models. The authors provide evidence that 1) non time-reversible models sometimes perform better than general time-reversible models when inferring phylogenetic trees out of simulated viral genome sequence data sets, and that 2) non time-reversible models can fit the real data better than the reversible substitution models commonly used in phylogenetics, a finding consistent with previous work. However, the methods are incomplete in supporting the main conclusion of the manuscript, that is that non time-reversible models should be incorporated in the model selection process for these data sets.

  2. Reviewer #1 (Public Review):

    The study by Sianga-Mete et al revisits the effects of substitution model selection on phylogenetics by comparing reversible and non-reversible DNA substitution models. This topic is not new, previous works already showed that non-reversible, and also covarion, substitution models can fit the real data better than the reversible substitution models commonly used in phylogenetics. In this regard, the results of the present study are not surprising. Specific comments are shown below.

    Major comments

    It is well known that non-reversible models can fit the real data better than the commonly used reversible substitution models, see for example,
    https://academic.oup.com/sysbio/article/71/5/1110/6525257
    https://onlinelibrary.wiley.com/doi/10.1111/jeb.14147?af=R
    The manuscript indicates that the results (better fitting of non-reversible models compared to reversible models) are surprising but I do not think so, I think the results would be surprising if the reversible models provide a better fitting.
    I think the introduction of the manuscript should be increased with more information about non-reversible models and the diverse previous studies that already evaluated them. Also I think the manuscript should indicate that the results are not surprising, or more clearly justify why they are surprising.

    In the introduction and/or discussion I missed a discussion about the recent works on the influence of substitution model selection on phylogenetic tree reconstruction. Some works indicated that substitution model selection is not necessary for phylogenetic tree reconstruction,
    https://academic.oup.com/mbe/article/37/7/2110/5810088
    https://www.nature.com/articles/s41467-019-08822-w
    https://academic.oup.com/mbe/article/35/9/2307/5040133
    While others indicated that substitution model selection is recommended for phylogenetic tree reconstruction,
    https://www.sciencedirect.com/science/article/pii/S0378111923001774
    https://academic.oup.com/sysbio/article/53/2/278/1690801
    https://academic.oup.com/mbe/article/33/1/255/2579471
    The results of the present study seem to support this second view. I think this study could be improved by providing a discussion about this aspect, including the specific contribution of this study to that.

    The real data was downloaded from Los Alamos HIV database. I am wondering if there were any criterion for selecting the sequences or if just all the sequences of the database for every studied virus category were analysed. Also, was any quality filter applied? How gaps and ambiguous nucleotides were considered? Notice that these aspects could affect the fitting of the models with the data.

    How the non-reversible model and the data are compared considering the non-reversible substitution process? In particular, given an input MSA, how to know if the nucleotide substitution goes from state x to state y or from state y to state x in the real data if there is not a reference (i.e., wild type) sequence? All the sequences are mutants and one may not have a reference to identify the direction of the mutation, which is required for the non-reversible model. Maybe one could consider that the most abundant state is the wild type state but that may not be the case in reality. I think this is a main problem for the practical application of non-reversible substitution models in phylogenetics.

  3. Reviewer #2 (Public Review):

    The authors evaluate whether non time reversible models fit better data presenting strand-specific substitution biases than time reversible models. Specifically, the authors consider what they call NREV6 and NREV12 as candidate non time-reversible models. On the one hand, they show that AIC tends to select NREV12 more often than GTR on real virus data sets. On the other hand, they show using simulated data that NREV12 leads to inferred trees that are closer to the true generating tree when the data incorporates a certain degree of non time-reversibility. Based on these two experimental results, the authors conclude that "We show that non-reversible models such as NREV12 should be evaluated during the model selection phase of phylogenetic analyses involving viral genomic sequences". This is a valuable finding, and I agree that this is potentially good practice. However, I miss an experiment that links the two findings to support the conclusion: in particular, an experiment that solves the following question: does the best-fit model also lead to better tree topologies?

    On simulated data, the significance of the difference between GTR and NREV12 inferences is evaluated using a paired t test. I miss a rationale or a reference to support that a paired t test is suitable to measure the significance of the differences of the wRF distance. Also, the results show that on average NREV12 performs better than GTR, but a pairwise comparison would be more informative: for how many sequence alignments does NREV12 perform better than GTR?