The effect of gene tree dependence on summary methods for species tree inference
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Peer Community in Evolutionary Biology)
Abstract
When inferring the evolutionary history of species and the genes they contain, the phylogenetic trees of genes can be different from those of the species and to each other, due to a variety of causes, including incomplete lineage sorting. We often wish to infer the species tree, but only reconstruct the gene trees from sequences. We then combine the gene trees to produce a species tree; methods to do this are known as summary methods, of which ASTRAL is currently among the most popular. ASTRAL has been shown to be accurate in many practical scenarios through extensive simulations. However, these simulations generally assume that the input gene trees are independent of each other (infinite recombination between loci). This is known to be unrealistic, as genes that are close to each other on the chromosome (or are co-evolving) have dependent phylogenies. In this paper, we develop a model for generating dependent gene trees within a species tree, based on the coalescent with recombination. We then use these trees as input to ASTRAL to reassess its accuracy for dependent gene trees. Our results allow us to evaluate the impact of any level of dependence on the accuracy of ASTRAL, both when gene trees are known and estimated from sequences. We find that a fixed amount of dependence reduces the effective sample size by a constant factor. In current phylogenomic datasets, loci are generally sampled at large genomic distances to reduce gene tree dependence, thereby limiting the number of genes available for inference. However, full independence between genes is not required for accurate species tree estimation, and excluding gene trees may reduce inference accuracy. This creates a trade-off between the number of genes used and the degree of gene tree dependence. We therefore propose a method to identify the minimum genomic separation required to maintain satisfactory inference accuracy.
Article activity feed
-
-
Dear Dr. He,
Thank you for submitting your manuscript to PCI Evol Biol. We have received the comments from three reviewers. They all appreciated the work but raised concerns, which are generally substantial. I would like to ask the authors to revise the manuscript and address each of the comments satisfactorily by attaching the point-by-point response to the comments of the reviewers.
We note that the RF distance, which measures the deviation of a tree topology from the true topology, corresponds to the MSE of an estimate of a continuous trait. In this sense, the paper shows that the variance of the estimate decreases with the sample size (number of genes), but the rate of decrease is not so large because of the autocorrelation between successive gene trees. As long as intragenic recombination is rare, the species tree estimate is …
Dear Dr. He,
Thank you for submitting your manuscript to PCI Evol Biol. We have received the comments from three reviewers. They all appreciated the work but raised concerns, which are generally substantial. I would like to ask the authors to revise the manuscript and address each of the comments satisfactorily by attaching the point-by-point response to the comments of the reviewers.
We note that the RF distance, which measures the deviation of a tree topology from the true topology, corresponds to the MSE of an estimate of a continuous trait. In this sense, the paper shows that the variance of the estimate decreases with the sample size (number of genes), but the rate of decrease is not so large because of the autocorrelation between successive gene trees. As long as intragenic recombination is rare, the species tree estimate is consistent. It is important to remember the concern that genes can often undergo intragenic recombination, because there are long introns between the exons. The gene trees estimated by concatenating the exons may lead to biased inferences about the variability of the gene trees. It would be worthwhile to simulate genes as long as the real genes with introns and to show the limit of the information content in the genomes. To adequately measure the limit, I would prefer the branch score distance (Kuhner and Felsenstein (1994)), which measures the amount of deviation from the true speciation tree in units of the effective sizes of the ancestral populations. Evaluating with this measure would be ideal, but it can be considered in the discussion section if it is too difficult.
Kuhner, M. K. and Felsenstein, J. (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol., 11, 459–468.
Sincerely,
Hirohisa Kishino
-
This paper addresses the important question of whether the species tree inference method ASTRAL, which is a summary statistic method (e.g., treating inferred gene trees as observed data) is sensitive to low levels of interlocus recombination. Although several studies exist that examine the effects of intralocus recombination on species tree inference most do not consider interlocus recombination or summary methods. One exception is the paper by Wang and Liu (2016) (cited by the authors) that examines the impact of interlocus recombination on ASTRAL but does so by using different recombination detection methods to choose loci that have greater numbers of intervening recombinations and therefore have less correlated gene trees. Recombination detection methods miss many recombinations and the Wang and Liu paper is more about which method …
This paper addresses the important question of whether the species tree inference method ASTRAL, which is a summary statistic method (e.g., treating inferred gene trees as observed data) is sensitive to low levels of interlocus recombination. Although several studies exist that examine the effects of intralocus recombination on species tree inference most do not consider interlocus recombination or summary methods. One exception is the paper by Wang and Liu (2016) (cited by the authors) that examines the impact of interlocus recombination on ASTRAL but does so by using different recombination detection methods to choose loci that have greater numbers of intervening recombinations and therefore have less correlated gene trees. Recombination detection methods miss many recombinations and the Wang and Liu paper is more about which method for choosing loci (linkage disequilibrium-based versus breakpoint detection) produces better performance, rather than the overall effects of interlocus recombination on accuracy of species tree inference. One reason that prior studies have been less concerned with interlocus recombination is that if the assumption is violated and gene trees are correlated the likelihood simply becomes a composite likelihood (CL) and using a CL normally has more innocent effects than using an incorrect model (as when intralocus recombination causes chimeric loci to be used to infer a gene tree). The authors' results seem to support this CL interpretation. They really only see an association between accuracy (as measured by RF distance) and recombination rate when using less than 1600 loci. These days most analyses use several thousand loci so the results in this paper appear largely irrelevant in practical analyses. The findings also support the idea that the CL obtained by assuming independence among loci may cause overestimates of confidence, as is typically the case, but inferences seem to be trending towards an asymptote where inferences are consistent. Since ASTRAL is typically used to obtain point estimates the bottom line message seems to be to use several thousand loci and not worry about the CL approximation. I have some minor concerns about the methodology. The author's use a heuristic method they have developed for simulating the coalescent with recombination that is somewhat similar to the Markov approximation of Wiuf and Hein. They state that exact methods such as ms/msprime are too slow. However, the computational expanse is a function of the size of a region and it is practical to simulate smaller regions, in which gene tree correlations are strong, with these methods quite rapidly. It would have been nice to have seen a subset of exact simulations for some parameter ranges to confirm that the results are not an artifact of the authors' heuristic simulation method. In general, it would be nice to see more extensive tests of the correctness of the simulated data. Overall, I think this is an important contribution that explores the behavior of a widely-used species tree inference method ASTRAL when linkage disequilibrium is significant. However, I would like to see more practical discussion of what these results mean for working phylogeneticists. For example, a recommendation that using more loci widely distributed across genomes can improve accuracy and nullify the effects of linkage disequilibrium when using a program such as ASTRAL.
Title and abstract
Does the title clearly reflect the content of the article? [ ] Yes, [X ] No (please explain), [ ] I don't know
The title is too vague. MOst people will not know that "summary methods" are for species tree inference and "affects" is also quite uninformative. Perhaps something like: "Gene tree correlations, interlocus recombination, and the accuracy of summary based species tree inference"
Does the abstract present the main findings of the study? [ X] Yes, [ ] No (please explain), [ ] I don’t know
IntroductionAre the research questions/hypotheses/predictions clearly presented? [X] Yes, [ ] No (please explain), [ ] I don’t know
Does the introduction build on relevant research in the field? [X] Yes, [ ] No (please explain), [ ] I don’t know
Materials and methodsAre the methods and analyses sufficiently detailed to allow replication by other researchers? [ ] Yes, [X] No (please explain), [ ] I don’t knowThe description of the authors' novel simulation method is incomplete.
Are the methods and statistical analyses appropriate and well described? [ X] Yes, [ ] No (please explain), [ ] I don’t know
ResultsIn the case of negative results, is there a statistical power analysis (or an adequate Bayesian analysis or equivalence testing)? [X] Yes, [ ] No (please explain), [ ] I don’t know
Are the results described and interpreted correctly? [ ] Yes, [X] No (please explain), [ ] I don’t knowI think the authors' should focus more on the large-sample results, many of the trends in the graphs are only pronounced for very small numbers of loci, far fewer than are used in many genomic analyses.
DiscussionHave the authors appropriately emphasized the strengths and limitations of their study/theory/methods/argument? [ X] Yes, [ ] No (please explain), [ ] I don’t know
Are the conclusions adequately supported by the results (without overstating the implications of the findings)? [ X] Yes, [ ] No (please explain), [ ] I don’t know
-
This study “Gene tree dependence from finite recombination affects summary methods” conducts a large-scale series of simulation-based analyses to dissect the impacts of recombination (and/or linkage dependency) on species tree inference with ASTRAL. The study spans new findings and methods, including a new simulation-based model for gene tree dependency, comparisons with varying numbers of gene trees, locus lengths, amount of ILS, recombination rates, and target trees (human, fungal). By varying these conditions, the authors seek to insight into the role of gene tree dependency among loci while contributing to broader knowledge seeking to understand the complexity of species tree inference at scale. All of the evolutionary and experimental factors targeted by this study are important considerations for both theory and empirical …
This study “Gene tree dependence from finite recombination affects summary methods” conducts a large-scale series of simulation-based analyses to dissect the impacts of recombination (and/or linkage dependency) on species tree inference with ASTRAL. The study spans new findings and methods, including a new simulation-based model for gene tree dependency, comparisons with varying numbers of gene trees, locus lengths, amount of ILS, recombination rates, and target trees (human, fungal). By varying these conditions, the authors seek to insight into the role of gene tree dependency among loci while contributing to broader knowledge seeking to understand the complexity of species tree inference at scale. All of the evolutionary and experimental factors targeted by this study are important considerations for both theory and empirical phylogenetic. It is well-known that current species tree methods sometimes struggle with complex and realistic processes, such as recombination and/or lack thereof between loci (i.e., genealogical dependency). Their results contribute to a growing appreciation of the complexity of phylogenomic problems, and the challenges of model realism vs. analytical practice.
Altogether, I enjoyed reading your manuscript and thinking about the problems targeted by the study. Also, I commend the authors on the scale of the study, which includes varying a large number of evolutionary and experimental factors throughout. Indeed, the thinning experiment was perhaps one of the more interesting angles of the study, as this remains a popular strategy in attempts to address linkage. That said, I have a few suggestions for improvement in your study, primarily focused on the clarity of the analyses and interpretation.
Title and abstract
Does the title clearly reflect the content of the article? [X] Yes
Does the abstract present the main findings of the study? [X] YesIntroduction
Are the research questions/hypotheses/predictions clearly presented? [X] Yes
Does the introduction build on relevant research in the field? [X] YesThe introduction clearly explains the motivation, however, I have just a couple suggestions:
Lines 3-9: please provide relevant citations for statements
Lines 45-46: please provide citations for the mentioned Bayesian species tree methods.Materials and methods
Are the methods and analyses sufficiently detailed to allow replication by other researchers? [X] No
Are the methods and statistical analyses appropriate and well described? [X] YesSufficient details are generally provided, however more clarity could be helpful with the following suggestions:
Figure 2: the figure caption is insufficient in details relative to the in-text mentions. For example, is the black tree considered the “guide tree”? Also, under this model, are these gene samples occurring within a species tree? That would be helpful to adjust this figure for clarity.
Line 143: the justification for the new approach could benefit from a little more clarity. The authors state that “ms” is too slow to use. Just how slow? (predicted times?)
Line 148: please reference (from the Supplementary?) the section that discusses the comparison with msprime. For example “see Supplementary Materials: comparison with msprime)”. This is an important comparison of the paper to provide confidence in the new simulation model by benchmarking with a standard in the field.
Line 148 (and the discussion): Importantly: the authors state “we conduct some tests on an ultrametric tree to determine that the results from our model do not differ significantly from those produced using gene trees generated by msprime.” Yet, I did not find the results of the statistic tests that indeed evaluate whether their model and msprime are indeed equivalent. Perhaps I am missing something here. Please provide these results as they are important to the arguments of the study. Figure 24b is only a visual comparison, can the authors consider providing a more quantitative comparison to bolster their new method? There is minimal discussion on this point as well.
Line 180: Please consider providing the model parameters used in the simulation that were estimated from the biological dataset (and how where they estimated?).Results
Are the results described and interpreted correctly? [X] No.(discussion/results): please provide a more quantitative comparison of the proposed simulation model vs. the field standard of msprime(see comment above in Methods).
Lines 269-270: please consider adding citations for “common practice to thin…” for examples in the literature.
Lines 317-322: The strategy used to estimation recombination rate from the empirical data is a somewhat weaker point of the study. Are there citations perhaps for the preformance of this approach for R estimation that matches simulation vs. observed RF averages? There are many other reasons besides recombination that can cause RF-distances to differ (poor model fit, etc.), and this approach of matching average pairwise RF seems uncommon (or very rare at least). At least, this can be discussed as a potential limitation in the discussion, as more model-based approaches for estimation recombination rate exist (ldhelmet, etc. and many more).
The thinning experiments (lines 267-28)) is one of the more interesting components of the study. However, if I understand correctly, the number of gene trees differs between unthinned and thinned datasets, meaning that number of gene trees, rather than linkage, explains their results. If I would suggest one analyses to the authors: run an experiment that holds the number of gene trees constant. For example, pick 1000 unthinned trees vs. 1000 thinned trees, which can be accomplished in the simulation experiment. This would more clearly target the question at hand. However, this is only a suggestion for the authors that would bolster their arguments, and at least should be mentioned as a limitation in the discussion.
Discussion
Have the authors appropriately emphasized the strengths and limitations of their study/theory/methods/argument? [X ] No (please explain).
Are the conclusions adequately supported by the results (without overstating the implications of the findings)? [X] YesOverall, the discussion does a good job of interpretation. However, I have a few suggestions:
- Mention and cite other methods for recombination rate inferences from genomes
- Comparison of the current method vs. msprime (the results of the tests mentioned).
- More appropriately mention and cite other summary methods (lines 433-438)
- Mention limitations of focusing on only two phylogeny case studies (mammals, fungi), as the results of this study could differ depending on the specific target phylogeny.
- Mention that the (current) thinning experiment is likely driven by numbers of gene tree, rather that the ad-hoc strategy of thinning itself. Simulations that hold the number of gene tree constant (but did/didn’t thin) would be needed to address this more clearly (see comment in results).
-
The authors present a very interesting and highly topical set of analyses on the effects of recombination on phylogenomic analyses.
Title and abstract Does the title clearly reflect the content of the article? [ X] Yes, [ ] No (please explain), [ ] I don't know
Does the abstract present the main findings of the study? [X ] Yes, [ ] No (please explain), [ ] I don’t know
IntroductionAre the research questions/hypotheses/predictions clearly presented? [ ] Yes, [ ] No (please explain), [ ] I don’t know
Does the introduction build on relevant research in the field? [ X] Yes, [ ] No (please explain), [ ] I don’t know
Materials and methodsAre the methods and analyses sufficiently detailed to allow replication by other researchers? [ ] Yes, [ X] No (please explain), [ ] I don’t know. Could be improved. Please refer to my comments below.
Are the …The authors present a very interesting and highly topical set of analyses on the effects of recombination on phylogenomic analyses.
Title and abstract Does the title clearly reflect the content of the article? [ X] Yes, [ ] No (please explain), [ ] I don't know
Does the abstract present the main findings of the study? [X ] Yes, [ ] No (please explain), [ ] I don’t know
IntroductionAre the research questions/hypotheses/predictions clearly presented? [ ] Yes, [ ] No (please explain), [ ] I don’t know
Does the introduction build on relevant research in the field? [ X] Yes, [ ] No (please explain), [ ] I don’t know
Materials and methodsAre the methods and analyses sufficiently detailed to allow replication by other researchers? [ ] Yes, [ X] No (please explain), [ ] I don’t know. Could be improved. Please refer to my comments below.
Are the methods and statistical analyses appropriate and well described? [ ] Yes, [ ] No (please explain), [X ] I don’t know. In some cases, I think the authors text was ambiguous/hard to follow and requires more clarification and explanation.
ResultsIn the case of negative results, is there a statistical power analysis (or an adequate Bayesian analysis or equivalence testing)? [ ] Yes, [ ] No (please explain), [ ] I don’t know. I think not relevant to this study?
Are the results described and interpreted correctly? [ ] Yes, [ ] No (please explain), [ x] I don’t know. I cannot tell whether results are interpreted correctly as I think some results are counterintutive to me, and the methods need clarification, at least to me.
DiscussionHave the authors appropriately emphasized the strengths and limitations of their study/theory/methods/argument? [ ] Yes, [ ] No (please explain), [ x] I don’t know. See above. I think wording is ambiguous/confusing in spots and until cleaned up and clarified, I am not able to discern whether things are valid or not.
Are the conclusions adequately supported by the results (without overstating the implications of the findings)? [ ] Yes, [ ] No (please explain), [ X] I don’t know. Same here. My main concern is connection between the simulation results and real data results. This connection is key, and for me, as a reader i could not easily make this connection but perhaps with further clarification on methods, all would be well?The authors might consider the following points in revising/improving their manuscript.
1) In the abstract, where say “ASTRAL has been shown to be practically accurate in many scenarios through extensive simulations. However, these simulations generally assume that the input gene trees are independent of each other”, should probably also mention that all or nearly all simulations have assumed no recombination within loci, which is never known, rarely tested, and unrealistic biases summary coalescent methods to be more accurate? In particular, when loci are extremely long, as in the mammalian dataset referred to in the current paper (in the Abstract), the assumption of no recombination within loci is surely violated, will bias individual loci because each locus will be a mini-concatenation of multiple different ‘c-genes’ and potentially fail in anomaly zone conditions where general concatenation also fails under the multispecies coalescent (MSC).
2) In the abstract, where say “Our results show that ASTRAL performs more poorly with greater dependence, both when gene trees are known and estimated from sequences. Indeed, the effect of dependence between gene trees is comparable to (if not greater than) the effect of gene tree estimation error”, if can demonstrate this point in the remainder of the paper, this is a very interesting and important conclusion.
3) In Figure 1, as is, in this example, the derived ‘black’ allele is an autapomorphy and has no phylogenetic information for the three alleles sampled at the tips of the gene tree. The hypothetical scenario would make more impact if the derived black allele evolved in the common ancestor of A, B, and C and was then sampled in taxa A and B (thus being a synapomorphy in the gene tree that groups A+B) while the ancestral white allele was sampled in C. This way, it would make more clear to the reader how a derived character state (black) was ‘mis-sorted’ amongst ancestral polymorphism at an internode in the species tree, resulting in a gene tree that groups A+B while the species tree supports B+C?
4) On p. 3, paragraph 1, where say “A number of other paradigms are also available, including full-likelihood, Bayesian, and co-estimation methods”, should provide a citation or two that references these types of methods or use ‘e.g.” and cite one as an example of this type?
5) On p. 3, line 42, where say “Furthermore, extensive simulations have been performed studying its accuracy under practical conditions, showing that ASTRAL is highly accurate under the MSC model”, it could be argued that nearly NONE of the simulations done to date have been ‘realistic’ as nearly all have assumed no recombination within loci, including all citations given here? This is noted in the next paragraph, but why say these prior simulations were realistic in one paragraph and contradict this in the next paragraph?
6) On p. 3, line 54, where say “The former assumption has been tested (Lanier and Knowles, 2012; Zhu et al., 2022) and found to have relatively little practical effect on the accuracy of species tree inference”, I think this is true for the very limited simulations that were done in these studies. But, very few challenging situations were simulated. Most of the prominent disagreements among phylogenetic methods generally focus at a few nodes in trees where branch lengths are super short and divergences among taxa are high with much rate variation among taxa. In such situations, one would expect ‘concatenation’ within loci (ie, merging multiple c-genes in a locus as a single locus = concatenation) to cause failure. But, such situations have, I think, not been simulated, so researchers continue to cite the few papers that have examined recombination within loci as not problematic. But surely, such concatenation within loci (due to recombination within a locus) will impact results; it just hasn’t been simulated, and these are the types of situations that give different results for different methods in the real world (e.g., conflicts among different coalescent methods or conflicts between concatenation and coalescence)?
7) On p. 3, line 57, change “genes are located near to each other on the same chromosome” to “some genes are located near to each other on the same chromosome”?
8) On p. 6, line 144, where the authors say “Although our model lacks the long-range dependence structure of the full coalescent with recombination, it is faster to simulate and retains more flexibility”, I trust the authors, and more importantly other reviewers or the editors, that the mathematics and logic of the simulation procedure is valid.
9) On p. 6, line 166, the authors note that “The species tree, shown in Figure 3, was previously estimated with MP-EST (Liu et al., 2010) on the biological dataset from Song et al. (2012), containing 447 genes with average length 3099 bp”, but this statement is somewhat misleading or lacks citation to lots of prior work on this dataset that is relevant here. Multiple papers have critiqued this dataset because the “average length of genes” is not 3099 bp but instead is ~140,000 bp (Springer and Gatesy, 2016 MPE; also see Gatesy and Springer, 2014 PNAS)! As authors on this paper realize, 140,000 bp of DNA is a long stretch of DNA to be not recombining. Also, many technical errors in this dataset have been noted by multiple authors (mixed up terminals and homology errors) and several studies have analyzed the dataset using ASTRAL in the past (and getting a different phylogenetic result from MP-EST used in the original paper). Is any of this problematic in how gene trees were simulated in the current study, as described in the following paragraph?
10) On p. 6, line 182, the authors say “We then use IQ-TREE (Minh et al., 2020; Nguyen et al., 2015) to estimate gene trees from these sequences, and use the estimated gene trees as input to ASTRAL”. Were gene tree nodes with zero support (according to approximate likelihood ratio test, very low bootstrap, or near zero-length branches) collapsed or were gene trees fully resolved ML trees? IQ-TREE generally spits out one ML tree, but for such short sequence lengths (500-1000 bp) there are usually multiple zero length internal branches in gene trees. Most modern workers collapse such low/no support branches as this is known to greatly increase accuracy of species tree inference when ASTRAL is used. Not doing so makes the simulations difficult to interpret within a modern context wherein this is basically common practice (or should be). It would be best to collapse branches or use weighted ASTRAL to infer species trees using bootstrap or some other support scores mapped on optimal ML gene trees as suggested by Mirarab and many others.
11) On p. 6, line 185 where say “From initial results, we found that accuracy varies the most when the recombination rate R lies between 0 and 1, so we use this range for our recombination rate parameter”, it might be good to relate right here what the implications are of recombination rates between 0 and 1? For example, if as in the Song et al. mammal dataset of 447 genes randomly sampled across a mammal-sized genome, how much dependence between these 447 loci would there be? I assume that for a mammalian genome, which is quite large, 447 randomly positioned loci would be completely or nearly completely independent of each other? Is this the case, or would much dependence among loci be expected given realistic rates of recombination (as estimated from real mammalian DNA)? I guess my point here is that the dependence among loci simulated here should be directly related to realistic biological conditions for this study to have impact and for its conclusions to be valid. So, even hear early in the methods section, it would be good to relate to the reader the commonsense interpretation of these simulated conditions relative to real data. Even better would be to ask, given that the genomic positions of the 447 loci used in Song et al. are known in species with well assembled genomes such as H. sapiens, would the authors expect even any dependence among loci in this empirical dataset given the known distances among loci in H. sapiens (or in other mammals in the dataset with well assembled genomes) and the authors’ best estimates of average recombination rates in human, mouse, or other mammals? It would be good for the reader to have a better feel early on for how these artificially generated data relate to the real data so that the importance (or not) of the simulations can be easily assessed by the reader.
12) On p. 7, line 7, where the authors say that “It has been shown that ASTRAL performs worse with an increased amount of ILS (Mirarab et al., 2014), so we are interested in the performance of ASTRAL with dependent gene trees in this case. We multiply the branch lengths of the species tree by 0.2 (denoted by 0.2×), which is equivalent to multiplying the effective population size by 5”, could it be clarified whether all branch lengths are multiplied by 0.2 (internal and terminal branches) or whether just the internal branch lengths are multiplied by 0.2? When internal branch lengths are reduced, there is more ILS which makes phylogenetic inference more challenging and there are fewer informative substitutions on shorter branches that makes inference more challenging, but when terminal branch lengths are reduced by 0.2, there are fewer multiple hits, less homoplasy, and less gene tree reconstruction error, which makes reconstruction of gene trees more accurate. So, if all branch lengths are reduced, this could increase or decrease accuracy of summary coalescent methods due to the opposed effects of increased ILS/reduced numbers of synapomorphies on shorter internal branches but also a reduction in multiple hits at sites on long terminal branches (and less homoplasy).
13) On p. 8, line 211, the authors say “To compare the performance of our model with msprime (Baumdicker et al., 2022; Kelleher et al., 2016), which is restricted to ultrametric species trees”. Would it be possible to do the same for the mammalian dataset here and not just the fungus one? Or, is there a daunting time constraint issue, or did the mammal dataset give a result that was not good and support the authors view? I am not saying this latter point is the case, but a critical reader of the final paper might think it is odd not to compare msprime for the mammal dataset as well and think the authors are hiding something?
14) On p. 8, Figure 4. Most phylogenomic datasets these days include thousands of loci. If I am interpreting this figure correctly, it seems there is very little error, no matter how much dependence among loci when sampling 800, 1600, or 3200 loci? Or is this not correct?
15) On p. 9, line 232, the authors say “Figure 4b shows the results for estimated gene trees from mixed sequence lengths, with detailed results for 500 bp and 1000 bp …”. Are these the same simulated gene lengths used in Figure 4a? Maybe this can be stated in the Figure 4 legend?
16) On p. 9, line 238, the authors say “In Figure 5, we compare the relative effects of these factors with recombination rate fixed to two realistic values (estimated in the Estimating the recombination rate section) of 0.317 and 0.055.” The ‘estimating the recombination rate section’ should be put in the Methods section of the paper. It is awkward for the reader to have to jump ahead in the text in the Results section to try to understand methods that are not in the Methods section but instead later in the Results section.
17) On p. 11, line 273, the authors say “To study this effect, we simulate 1000 dependent gene trees from the 37-taxon mammalian species tree for several recombination rates (0.2, 1, and 5). We then thin our dataset by subsampling every nth tree, for n ranging from 1 to 50, and use the resulting trees as input into ASTRAL. This is replicated 1000 times, with the results shown in Figure 10”. But for a typical mammalian genome that is ~3 giga basepairs (3,000,000,000 bp), if distribute 1000 loci evenly across this span of 3 giga bp, there would be an average of 3,000,000 bp separating each locus from its nearest neighbor. Wouldn’t this long distance between loci guarantee, assuming typical mammalian rates of recombination, that loci are basically completely independent of one another? Or, is this not correct? If correct, this would seem to contradict the strongest conclusions of this manuscript? Even if distribute 10,000 loci along a genome that is 3,000,000,000 bp long, the genes would be on average 300, 000 bp apart from each other, which is still a large genomic distance. So, a question related to the ‘realistic rates of recombination’ used in the paper is, ‘is 3 million bp between loci really not enough distance between loci for loci to act as completely independent loci given average recombination rates for mammalian genomes? I think a connection between the above and the dependencies between loci simulated in the paper would ensure to the reader that the dependencies simulated in the manuscript are relevant to real genome sizes and real recombination rates and the numbers of loci sampled in most modern phylogenomic studies (usually a couple thousand loci). From my perspective, the ‘thinning’ results presented here seemed suspect, unless I am not understanding exactly what was done (if so, I apologize to the authors).
18) On p. 13, line 307, the authors say “This dataset contains 424 effective genes (after 21 genes with mislabelled sequences and outliers were removed from the original 447-gene dataset) with average length 3157 bp”. There are several issues here. First, were the same mislabeled sequences and outliers those reported in previous publications? If so, this work should be cited. Second, as was noted earlier in this review, the average length of loci in this dataset was not 3157 bp and was more than an order of magnitude longer (due to long introns that were excluded from each gene). So, as argued previously (and not refuted or even contested by Song et al.), surely there has been extensive recombination within each of these often very long loci analyzed by Song et al. (longest gene was >1,000,000 bp in the human genome!). Some discussion of this point related to past discussion on this topic should be included since the current manuscript focuses on the effects of recombination and it has been shown that different exons in these very long genes do not yield the same gene trees due perhaps to their great distance from each other in the genome. Again, it would be best to collapse zero or near zero length internal branch lengths in the ML gene trees, as this has been shown over and over to increase accuracy in simulations and congruence in empirical phylogenomic studies (and is common practice in recent papers).
19) On p. 13, line 315, the authors should define what they mean here by ‘consecutive genes’ and ‘adjacent’. To me, the wording is too ambiguous and confusing as is.
20) On p. 13, line 317, the authors say “To estimate the recombination rate, we match the average consecutive pairwise normalized RF distance between the real trees to those from simulations. The average consecutive pairwise normalised RF distance of the real gene trees is 0.359 (pairs of genes that are in different chromosomes are not taken into account); this is significantly different from the value for simulated independent gene trees on the same species tree (0.585; a t-test of the difference gives p-value < 2.2 × 10−16). This indicates that the real gene trees are not independent.” This is a surprising result to me. Did the authors estimate gene trees for the set of simulated genes, or did they use the ‘true’ simulated gene tree topologies? How far apart from each other are the genes in real mammalian genomes? If they are a million bp apart (or more) on average, what explains the dependence observed here? If did not compare to ‘true’ simulated gene trees and gene trees were estimated from simulated data, how long was each gene in the simulation? I guess just a bit more explanation in this section would be good as this is a critical section (relating the simulation results to realworld data) to ensure that the reader understands exactly what has been done and its implications. To start, a clear definition of what is meant by “the average consecutive pairwise normalized RF distance” should be given. To me, this is too ambiguous as currently written to be sure about exactly what the authors are doing and I think will be confusing to other readers of the manuscript. Perhaps a diagram might help or just more text explanation or both.
21) On p. 13, line 323 and following paragraph. I do not understand this text; it seems shifty to me, but I think for most readers to understand what was done here, more explanation/clarification needs to be given as to how these analyses were done and why in this way. Again, this is a very critical part of the manuscript where the authors attempt to convince that dependence among gene trees is an important factor, but just given the size of mammalian genomes and a sample of just a few 100 loci and reasonable recombination rates, it does not seem immediately credible that such dependence among loci exists. In line 325 when say used both true and estimated gene trees from the simulations, how many bp were simulated for each gene? Was this the same length that was used in the empirical mammalian dataset, or is this not relevant, or am I completely confused here?
22) On p. 16, line 406, the authors note that “In this paper, we developed a model to simulate dependent gene trees within a species tree under a realistic process of incomplete lineage sorting and recombination”. A key question, to me, is whether ‘a realistic process’ was simulated in this paper and that dependence among loci is a huge problem when say 1000 loci are randomly sampled from large (say 3,000,000,000 bp) genomes? Even with 1000 loci randomly sampled, most genes/loci would be expected to be very far apart by >1 million bp? So, I think a key thing for the reviewers to address is the connection of their simulations to reality for the mammalian dataset where much information is freely available (e.g., the actual length of the genes, which is quite long and too long for credible coalescent analysis, in my view, as well as the actual distance between genes in this dataset, which is known for many species with good genome assemblies, such as H. sapiens). Given the text as is, the authors use a procedure (to me, murky) to make a connection between simulations and reality, but simply mapping the genes to the Homo genome, noting distances between them and then using estimated recombination rates for Homo (or other mammals), it would seem hard to argue that the loci are mostly independent from each other as very far apart? Or not? I dunno, but this should be explored?
-
Estimating the species tree is a fundamental first step in evolutionary biology. Gene trees reflect the species tree. However, they are not necessarily identical, particularly due to incomplete lineage sorting (ILS). Due to recombination, their topologies vary among loci. The divergence times of gene trees are generally older than those of the speciation tree due to ancestral polymorphism. The extent and variability of coalescences in gene trees relative to speciation in the species tree depend on effective population sizes at ancestral nodes. Therefore, multi-locus gene trees contain information about speciation times and effective population sizes. Powerful methods have been developed to reconstruct the species tree from a set of gene trees. Summary methods and coalescent model-based methods account for the variability of gene trees. …
Estimating the species tree is a fundamental first step in evolutionary biology. Gene trees reflect the species tree. However, they are not necessarily identical, particularly due to incomplete lineage sorting (ILS). Due to recombination, their topologies vary among loci. The divergence times of gene trees are generally older than those of the speciation tree due to ancestral polymorphism. The extent and variability of coalescences in gene trees relative to speciation in the species tree depend on effective population sizes at ancestral nodes. Therefore, multi-locus gene trees contain information about speciation times and effective population sizes. Powerful methods have been developed to reconstruct the species tree from a set of gene trees. Summary methods and coalescent model-based methods account for the variability of gene trees. These methods are proved to be statistically consistent, meaning that the reconstructed species tree is topologically identical to the true species tree when many gene trees are analyzed. The question is how many genes are required to achieve sufficient accuracy.
He et al. (2026) examined the accuracy of ASTRAL by measuring the normalized Robinson-Foulds (RF) distance between the estimated tree and the true tree. This was the first time that the accuracy had been examined while taking into account dependency among the gene trees due to insufficient recombination. They generalized the two-locus, three-taxon model of Slatkin and Pollack (2006) to simulate dependent trees at loci separated by a specified recombination rate. By simulating dependent gene trees using a previously reported 37-taxon mammalian species tree, they demonstrated the impact of insufficient recombination rates (R) between neighboring loci on accuracy and the number of loci required for analysis. They also found that the effective sample size is proportional to the number of loci analyzed, given the value of R.
Here, R = 2Ne × rd is the recombination rate per individual per coalescent unit. 2Ne is the effective population size, r is the recombination rate per site per generation and d is the genomic distance between loci. Accuracy decreases significantly when loci are distributed densely and R is less than 1. For example, for Ne = 1e4 and r = 1e-8, R = 1 corresponds to d = 5,000 bp. The length of intergenic regions varies greatly, from a few kb to more than a Mb, with a median of around 50 kb. Therefore, filtering the genes to validate the assumed independence between loci is unnecessary.
Furthermore, the median total gene length is 10–100 kbp, whereas the median total exon length per gene is around 1.2 kbp. Intron sequences are not used for determining orthologues in large-scale databases such as OrthoDB and OrthoMaM, partly due to their low sequence conservation and the large variation in sequence lengths between species. Intron lengths range from ~1 kb to ~50 kb, which suggests that there is recombination between exons within a gene. Concatenation of exon sequences may preserve the mean difference between coalescent and speciation times. However, this brings variation between genes closer to the mean. This may affect the inference of effective population sizes and it is difficult to predict its effect on the inference of species tree topology, particularly the order of speciation events that occurred over short periods of time. Although an exon-tree-based inference seems like an option, each exon tree contains a lot of uncertainty due to the short length of exons (averaging 150–250 bp). He et al. (2026) focused on assessing the effect of inter-locus dependence. Nevertheless, their findings could also shed light on species tree reconstruction and phylogenetics.
References
Wanting He, Celine Scornavacca, Yao-ban Chan (2026) The effect of gene tree dependence on summary methods for species tree inference. bioRxiv, ver.3 peer-reviewed and recommended by PCI Evolutionary Biology https://doi.org/10.1101/2024.06.06.597697
Slatkin M, Pollack JL (2006). The concordance of gene trees and species trees at two linked loci. Genetics 172, 1979–1984.
-
