The population frequency of human mitochondrial DNA variants is highly dependent upon mutational bias
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Review Commons)
Abstract
Next-generation sequencing can quickly reveal genetic variation potentially linked to heritable disease. As databases encompassing human variation continue to expand, rare variants have been of high interest, since the frequency of a variant is expected to be low if the genetic change leads to a loss of fitness or fecundity. However, the use of variant frequency when seeking genomic changes linked to disease remains very challenging. Here, I explored the role of selection in controlling human variant frequency using the HelixMT database, which encompasses hundreds of thousands of mitochondrial DNA (mtDNA) samples. I found that a substantial number of synonymous substitutions, which have no effect on protein sequence, were never encountered in this large study, while many other synonymous changes are found at very low frequencies. Further analyses of human and mammalian mtDNA datasets indicate that the population frequency of synonymous variants is predominantly determined by mutational biases rather than by strong selection acting upon nucleotide choice. My work has important implications that extend to the interpretation of variant frequency for non-synonymous substitutions.
Article activity feed
-
-
Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Reviewer 1 (Evidence, reproducibility and clarity):
The main message of this paper, as far as I understood since I am not a molecular bioinformatician but I am certainly interested in mtDNA variations especially related to disease, is that there is a very obvious bias among synonymous changed in the ORF of human mtDNA, more frequent for aminoacids with 4 variants, more frequent in P position, and much more frequently characterized by transversion rather than transition substitutions. This survey is well written and, although edited in a rather technical language, the message is reachable and interesting. I also agree on the conclusions of the Author …
Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Reviewer 1 (Evidence, reproducibility and clarity):
The main message of this paper, as far as I understood since I am not a molecular bioinformatician but I am certainly interested in mtDNA variations especially related to disease, is that there is a very obvious bias among synonymous changed in the ORF of human mtDNA, more frequent for aminoacids with 4 variants, more frequent in P position, and much more frequently characterized by transversion rather than transition substitutions. This survey is well written and, although edited in a rather technical language, the message is reachable and interesting. I also agree on the conclusions of the Author concerning the considerations that this set of new data should prompt one to draw also considering non-synonymous, potentially pathogenic mutations. The only contribution I feel I can provide to this manuscript is to invite the Authors to consider the possibility that the selection may be due to a preferred codon bias, linked to the higher or lower compliance of different codon to be translated by the translational in situ machinery of mitochondria. I am not sure that this applies also for mitochondrial mitochondria and related factors (you may want to ask Aleksey Amunts in Stockholm or Bob Lightowlers or Zoscha Lightowlers in Newcastle on this matter). I do know that this is certainly a problem for recombinant proteins containing, for instance, mammalian MTS fused with a bacterial restriction enzyme; in most of the cases the bacterial sequence has to be recoded using the preferred codon for mammalian system in order to increase translation by an eukaryotic (mammalian) translation machinery. I wonder whether you could discuss this possibility in your paper and maybe perform some further comparative measurement to test it.
I appreciate the supportive comments of Reviewer 1 regarding the accessibility of our manuscript, and I address comments related to codon bias below.
__Reviewer 1 (Significance): __
The paper provides novel information on the structure and constrains of mtDNA variants in humans, opens an area of investigation which is new and potentially relevant, with some possible implications also on pathogenic mtDNA mutations in humans.
I thank Reviewer 1 for their positive comments about the novelty of this work and the important implications of our study.
Reviewer 1 (Referee Cross-commenting):
I said in my first comment that I am not a bioinformatician, but Referee 2 made a great job in identifying some critical points and suggest the Authors how to cope with them. I maintain my opinion, that I think it's shared by referee 2, that the paper conveys an interesting and rather unexpected message, and that if the Authors are able to answer properly to the points raised by referee 2 the paper should be published.
We are quite glad to hear that Reviewer 1 would like to see this manuscript published, provided that the items noted by the reviewers are properly addressed.
Response to Reviewer 1:
R1Q1 (Continuation from Referee Cross-commenting)__: __I confirm that the only contribution I feel I can provide to this manuscript is to invite the Authors to consider the possibility that the selection may be due to a preferred codon bias, linked to the higher or lower compliance of different codons to be translated by the translational in situ machinery of mitochondria. I wonder whether the Authors could consider this possibility in the Discussion and possibly perform some further comparative measurement to test it.
__R1A1: __My manuscript takes into consideration the possibility that codon-specific preferences would determine the frequency of mtDNA variants. Findings that argue against codon bias as a strong source of selection include:
At two-fold degenerate P3s, nearly every site (> 97%) harbored at least one HelixMTdb sample associated with a non-reference base. It is worth noting that HelixMTdb is not enriched for known mitochondrial disease variants.
SSNEs are very tightly associated with transversions from the human reference sequence, implicating mutational biases as a cause of any limited diversity in the HelixMTdb.
Every possible base can be found at 99% of >500 analyzed I-P3 positions (those P3s at which the base at codon positions one and two is identical throughout the alignment), arguing against the idea that codon bias plays a significant role in controlling variant frequency across mammals. The only exception that I identified in my extensive analysis is the P3 found within the first methionine codon of COX3.
Earlier, more limited studies of mitochondrial codon choice (citations of these earlier studies can be found in the manuscript) also argue against substantial selection based upon codon choice.
Finally, I would note that the set of tRNAs encoded by vertebrate mtDNAs is quite limited, with only one tRNA linked to each codon family defined by codon positions P1 and P2. There is no evidence, to my knowledge, that nucleus-encoded tRNAs enter human mitochondria. Therefore, the scope of potential selection linked to, for example, translation speed and protein folding seems particularly limited at vertebrate mitochondria.
While most evidence does not support strong selection on mtDNA codon choice in vertebrates, I do report divergence in TSS distributions obtained from the I-P3s of different amino acids within the same degeneracy class (eg. two-fold purine, two-fold pyrimidine, four-fold), hinting at some minimal role for codon preferences at P3. However, on the whole, mutational propensities are likely to be the predominant factor controlling synonymous variation.
Reviewer 2 (Evidence, reproducibility and clarity):
The manuscript explores a large database of human mtDNA sequences and performs some comparative analysis across mammals to characterise the profile of mtDNA mutations. It finds that some variants are surprisingly poorly represented in human mtDNA and suggests that mutational bias rather than selection is the dominant driver of this heterogeneity.
This is an interesting message and an efficient and interpretable of a large-scale dataset to shed light on biological mechanisms, which is a highly desirable philosophy. The factors shaping human mtDNA heterogeneity are of immense interest for several fields from population genetics to medicine, making this a valuable perspective. My comments are mainly quite fine-grained and reflect instances where I think the argument could be tighter, rather than fundamental flaws in the approach. In the cases where these points are due to my own naivety, I apologise and suggest that more explanation of these points could help other readers like me!
I am happy to read that Reviewer 2 (Dr. Iain Johnston) finds my approach to be fundamentally sound, and I certainly appreciate the insightful comments and suggestions that he has provided.
__Reviewer 2 (Significance): __
I wrote the above review without realising the reviewer interface would be categorised in this way. Here's a repeat of my "significance" comments
The manuscript explores a large database of human mtDNA sequences and performs some comparative analysis across mammals to characterise the profile of mtDNA mutations. It finds that some variants are surprisingly poorly represented in human mtDNA and suggests that mutational bias rather than selection is the dominant driver of this heterogeneity.
This is an interesting message and an efficient and interpretable of a large-scale dataset to shed light on biological mechanisms, which is a highly desirable philosophy. The factors shaping human mtDNA heterogeneity are of immense interest for several fields from population genetics to medicine, making this a valuable perspective.
I am very pleased that the reviewer appreciates the importance and potential impact of my analysis. We agree that mtDNA heterogeneity is likely to be of high medical relevance.
Response to Reviewer 2:
__R2Q1: __The first paragraph is focused on humans without explicitly saying so; missing heritability is less of an issue in, for example, plants [Brachi et al., 2011. Genome biology, 12(10), pp.1-8]. This focus should be clearer (or the differences across kingdoms mentioned!). It's also worth noting that the argument about pathogenic variants being infrequent because of selection can only address missing heritability in pathogenic variants, and cannot (directly) inform the missing heritability in traits like height etc. Also, the whole motivation with respect to missing heritability currently comes across as a bit of a non sequitur. An introduction section could be used to help describe how the analysis of the provenance of mtDNA mutations contributes to the missing heritability question.
__R2A1: __I agree that beginning the manuscript with a discussion of genome-side association studies may distract the reader from the main topic at hand: the utility of variant frequency when predicting pathogenicity in humans. I have changed the Introduction accordingly.
__R2Q2: __I also suggest that such an introduction section introduces the (later cited) previous work from Reyes and others on mutational profiles in mtDNA to set the scene.
__R2A2: __I now provide these citations in the second paragraph of the Introduction. However, I do not expand further upon mutational propensities in that section, with an eye toward minimizing manuscript length toward publication as a short report.
__R2Q3: __An early result, that 35% of possible synonymous mutations do not appear in a dataset, lacks a null hypothesis. Depending on the size of the dataset this may be very surprising or very unsurprising : an order of magnitude estimate of what proportion would be expected under uniform mutation and zero selection would help comparison here. I guess this can be as simple as 16k/3*4 __R2A3: __The reviewer raises an excellent point regarding how 'surprising' it should be to the reader, previous to downstream analyses revealing transition/transversion biases, that so many synonymous substitutions are lacking within this dataset. While the authors of the HelixMT study removed mtDNA from highly related individuals from the analysis, the vast majority of the mtDNAs analyzed (91.2%) were from haplogroup N and of inferred European ancestry (doi.org/10.1101/798264). The authors of the HelixMTdb study do note that nearly all mtDNA lineages were present in the study, presumably encompassing roughly 100,000 years of human mtDNA evolution. That said, how this information alone may be used to quantitatively model expectations under zero selection is unclear.
To address this question of whether sample diversity might be very limited in the HelixMTdb study, I have carried out additional analyses on this dataset. I now assess, for third codon positions allowing two-fold synonymous change (serine and leucine not included, due to their decoding by two different tRNAs), how often only one nucleotide was found at that position. For two-fold degenerate P3s, > 97% (n=1604) harbored both nucleotide possibilities within the database. This result strongly suggests that mtDNA diversity was well sampled in the HelixMTdb study, since a database consisting of highly related samples would presumably be characterized by a greater number of sites showing total identity. Moreover, when considering analyzed four-fold degenerate P3s (again, leucine and serine codons were omitted), only a very small number of sites showed no diversity (1%), with more than half of sites harboring at least three different bases. My interpretation is that the HelixMTdb authors have successfully sampled a very diverse set of human mitochondrial genomes. I have added these new analyses to the manuscript as Fig. 2a and 2b.
I have also changed the word 'surprising' to 'noteworthy' within the relevant portion of my manuscript text.
__R2Q4: __I think some comments and additional framing of the diversity in the central database would be valuable and important for interpretation. I believe it has, for example, rather more European rows than African ones, thus (to take a very basic view) sampling a less diverse population more than a more diverse one.
__R2A4: __I now state explicitly that the vast majority of the mtDNAs analyzed (91.2%) were from haplogroup N and of inferred European ancestry. Also, please see point R2A3 for further discussion of the human mtDNA diversity reflected within HelixMTdb.
__R2Q5: __Another rhetorically important number lacking a comparison with a null is that guanine was detected at >3000 P3 positions accepting synonymous purine substitutions. This is cited as evidence that nucleotide frequencies at P3s don't reflect selection inherent to translation. But this link isn't clear -- if such selection was present, how different from 3000 would Iexpect this number to be? Isn't there a continuum of possibilities? Is the key idea that 3000 is greater than some other number, and if so, what is that?
__R2A5: __The purpose of this figure is simply to demonstrate that no nucleotide is ruled out when considering silent substitutions at the P3 of any amino acid. This is consistent with (although does not prove, and I believe that the I-P3 analysis provides stronger evidence on this point) a minimal role for mitochondrial codon preference in mtDNA evolution. To reflect that my point is more general, and not to be taken as a quantitative comparison, I changed my text to: 'However, even considering the relative depletion of guanine from all four-fold degenerate P3s and two-fold degenerate purine P3s, guanine was nonetheless detected at thousands of P3 positions (Fig. 3b)'.
__R2Q6: __I also wasn't clear whether/how the finding that little selection inherent to translation was implicitly extended to suggest little general selection overall. The following section only considers selection acting at specific P3 sites, thus implicitly discarding other hypotheses about general selection based on nucleotide content but not inherent to translation. Perhaps I am misunderstanding this translation link, but selection based on general nucleotide profiles (for example, due to thermodynamic stability [Samuels, Mech. Ageing Dev. 2005; 126: 1123-1129] or availability of nucleotides [Aalto & Raivio, Mech. Ageing Dev. 2005; 126: 1123-1129; Ott et al., Apoptosis. 2007; 12: 913-922]) would seem to still be on the table?
__R2A6: __I would argue against selection upon nucleotide choice linked to local changes to mtDNA thermodynamic stability. Most prominently, when considering two-fold degenerate sites, nucleotide differences from the reference sequence were identified within the HelixMTdb at almost every analyzed position (Fig. 2a), even though hydrogen bond strength between opposing bases would be affected in every case (AT>GC or vice versa). Of course, my argument here applies generally, and there may be a small subset of sites for which nucleotide substitutions can cause a pronounced functional defect because of a change to local mtDNA structure.
I would also argue against mitochondrial nucleotide availability as a source of selective pressure within the human population. When considering the entire L-strand sequence (NC_012920.1), nucleotide counts are as follows:
A 5124
C 5181
G 2169
T 4094
And when considering both strands, nucleotide counts and frequencies are as follows:
A 9218 (27.8%)
C 7350 (22.2%)
G 7350 (22.2%)
T 9218 (27.8%)
One nucleotide substitution would lead to a change in nucleotide frequencies by less than 0.02%. While the formal possibility exists that mitochondrial nucleotide availability lies exquisitely close to an important threshold, there is no current evidence to support this proposition. And here again, the diversity of P3 nucleotide choice found among the HelixMTdb samples would argue against this possibility.
That said, it is worth noting that nucleotide frequencies, and mtDNA mutation rates relative to nuclear mutation rates do appear to differ among clades (PMID: 8524045 and 28981721). Therefore, while selection related to nucleotide availability seems an unlikely explanation for the variant frequencies that I have recovered at degenerate sites among human samples, I certainly would not rule out taxon-specific dietary, environmental, or physiological factors that, over longer evolutionary timescales, might shape mtDNA nucleotide frequencies.
I would like to raise the possibility of another source of selection upon nucleotide choice. Specifically, one might propose that synonymous mtDNA substitutions could affect the binding of proteins controlling the replication, compaction, or expression of mtDNA. Indeed, an intriguing study has reported that human cells manifest a mtDNA footprinting pattern (PMID: 30002158), suggestive of regulatory sites bound to protein or sites of transcriptional pausing. However, Blumberg et al. found no statistically significant difference in human synonymous change at footprinted sites, arguing against a strong selective pressure on nucleotide choice at footprinted P3s. Moreover, footprinting sites identified in the above-mentioned study are conserved in mouse and human, but I have shown that all four nucleotides are acceptable at all four-fold degenerate sites (n=252), all two-fold degenerate pyrimidine sites (n=157), and 99% of two-fold degenerate purine sites (n=152) within the mammalian I-P3 set, again arguing against general limitations on nucleotide choice caused by protein association. These analyses cannot, however, totally rule out the possibility that a subset of individual P3s are under some selection due to their role in binding or traversal of proteins.
__R2Q7: __A reptile is chosen as an outgroup for a comparative analysis of mammals. As always when a choice is made, the question arises: what if that choice was different? Perhaps the corresponding figures can be presented for two other choices of outgroup to demonstrate that there's nothing particularly unrepresentative about this reptile?
__R2A7: __While preparing this revised manuscript, I have performed an updated analysis using the most current mammalian mtDNA dataset available on RefSeq. For these new tests, I used Iguana iguana, rather than Anolis punctatus, as an outgroup. The new results are essentially indistinguishable from my previous findings. Importantly, when old TSS values and new TSS values for I-P3 sites were compared by linear regression, the R-squared value is 0.9955, with a p-value of
__R2Q8: __Another analysis involves classifying variant frequency into discrete groups based on percentage appearance, then seeking links with the TSS statistic. First, it is not clear why discretisation is needed here. A statistical model embracing the continuous nature of variant frequency requires fewer arbitrary choices (e.g. of numbers and boundaries of classes).
__R2A8: __A primary audience of this manuscript will certainly be the human genetics community, which commonly speaks in terms of variant classes (eg. 'common', 'rare', 'ultra-rare'). Therefore, I prefer to also use such classifications when analyzing the relationship between TSS and mtDNA variant frequency. I took advantage of the following references when generating frequency classifications:
Bomba L, Walter K, Soranzo N. 2017. The impact of rare and low-frequency genetic variants in common disease. Genome Biol 18:77.
McInnes G, Sharo AG, Koleske ML, Brown JEH, Norstad M, Adhikari AN, Wang S, Brenner SE, Halpern J, Koenig BA, Magnus DC, Gallagher RC, Giacomini KM, Altman RB. 2021. Opportunities and challenges for the computational interpretation of rare variation in clinically important genes. Am J Hum Genet 108:535–548.
__R2Q9: __Second, an interpretation point here is in danger of equating absence of evidence with evidence of absence. Without an estimate of statistical power, an absence of a significant relationship cannot suggest that anything is likely or unlikely, only that there may not be sufficient power to detect an effect.
R2A9: To address this point, I have changed my text as follows:
Old: 'However, I detected no significant relationship between TSS and variant frequency for four-fold degenerate I-P3s (Fig. 2d), indicating that the highly elevated SSNE abundance at four-fold degenerate P3s is unlikely to be due to selection.'
New: 'However, I detected no significant relationship between TSS and variant frequency for four-fold degenerate I-P3s (Fig. 2d), consistent with the idea that the highly elevated SSNE abundance at four-fold degenerate P3s is unlikely to be due to selection.'
__R2Q10: __Figs 1a and 1e have a log vertical axis but I think the lowest points actually corresponds to zero? This is not compatible with a log axis and the zero position should be explicitly labelled with its own tick (perhaps in parentheses to highlight the discontinuity).
__R2A10: __Quite correct, and I had neglected to clarify those details in the previous version of the manuscript. I now designate the samples with zero counts in the population using a smaller dot size, and I describe this approach in the figure legend.
__R2Q11: __The methods are presented in an interesting way, with specific filenames for the code associated with each part of the pipeline explicitly provided. This is (very!) nice but it would also be good to describe in words what each piece of code does (e.g. "this was used as input for x.py, which counts the mutations and outputs a profile" or some such). This is indeed sometimes written but some parts lack an explanation.
R2A11: I have now expanded my description of several scripts within the Methodology section.
__R2Q12: __I could do with an additional sentence or two on the statistical analysis. As Kolmogorov-Smirnov tests examine differences between distributions, it's not immediately unambiguous how they are applied to total count statistics. Are count distributions with respect to variant frequency analysed for each amino acid separately? Or are the amino acids somehow ordered and the distributions across them compared? Or something else?
__R2A12: __TSS distributions are held for each individual amino acid, which are then compared by Kolmogorov-Smirnov testing only within a given degeneracy category (four-fold degenerate, two-fold degenerate purine, two-fold degenerate pyrimidine). I have now elaborated upon this statistical test selection, and other details of the analysis, in the Methodology section.
Reviewer 2 (Referee Cross-commenting):
I agree that codon bias is an interesting potential axis of selection. Even if the analysis rejects the hypothesis of selective effects inherent to translation, it is conceivable that codon bias could be shaped by selection in other indirect ways (depending on how "inherent" is defined, these could include tRNA/nucleotide availability, GC content and thermodynamic stability, etc). I think this aligns with my suggestion that modes of selection that are not directly linked to translation could be explored in more depth before discounting selective effects overall. IJ
I hope that I have now successfully addressed points related to codon bias, GC content, and thermodynamic stability in the manuscript, as well as here in this response to the reviewers.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The manuscript explores a large database of human mtDNA sequences and performs some comparative analysis across mammals to characterise the profile of mtDNA mutations. It finds that some variants are surprisingly poorly represented in human mtDNA and suggests that mutational bias rather than selection is the dominant driver of this heterogeneity.
This is an interesting message and an efficient and interpretable of a large-scale dataset to shed light on biological mechanisms, which is a highly desirable philosophy. The factors shaping human mtDNA heterogeneity are of immense interest for several fields from population genetics to …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The manuscript explores a large database of human mtDNA sequences and performs some comparative analysis across mammals to characterise the profile of mtDNA mutations. It finds that some variants are surprisingly poorly represented in human mtDNA and suggests that mutational bias rather than selection is the dominant driver of this heterogeneity.
This is an interesting message and an efficient and interpretable of a large-scale dataset to shed light on biological mechanisms, which is a highly desirable philosophy. The factors shaping human mtDNA heterogeneity are of immense interest for several fields from population genetics to medicine, making this a valuable perspective. My comments are mainly quite fine-grained and reflect instances where I think the argument could be tighter, rather than fundamental flaws in the approach. In the cases where these points are due to my own naivety, I apologise and suggest that more explanation of these points could help other readers like me!
The first paragraph is focused on humans without explicitly saying so; missing heritability is less of an issue in, for example, plants [Brachi et al., 2011. Genome biology, 12(10), pp.1-8]. This focus should be clearer (or the differences across kingdoms mentioned!). It's also worth noting that the argument about pathogenic variants being infrequent because of selection can only address missing heritability in pathogenic variants, and cannot (directly) inform the missing heritability in traits like height etc. Also, the whole motivation with respect to missing heritability currently comes across as a bit of a non sequitur. An introduction section could be used to help describe how the analysis of the provenance of mtDNA mutations contributes to the missing heritability question. I also suggest that such an introduction section introduces the (later cited) previous work from Reyes and others on mutational profiles in mtDNA to set the scene.
An early result, that 35% of possible synonymous mutations do not appear in a dataset, lacks a null hypothesis. Depending on the size of the dataset this may be very surprising or very unsurprising : an order of magnitude estimate of what proportion would be expected under uniform mutation and zero selection would help comparison here. I guess this can be as simple as 16k/34 << 200k. Also the ancestry of the dataset is important here: if all samples are highly related then a more homogenous mutational profile is unsurprising. Perhaps one could assign a quantity like an effective population size to the database and compare this to 16k/34? I think some comments and additional framing of the diversity in the central database would be valuable and important for interpretation. I believe it has, for example, rather more European rows than African ones, thus (to take a very basic view) sampling a less diverse population more than a more diverse one.
Another rhetorically important number lacking a comparison with a null is that guanine was detected at >3000 P3 positions accepting synonymous purine substitutions. This is cited as evidence that nucleotide frequencies at P3s don't reflect selection inherent to translation. But this link isn't clear -- if such selection was present, how different from 3000 would we expect this number to be? Isn't there a continuum of possibilities? Is the key idea that 3000 is greater than some other number, and if so, what is that?
I also wasn't clear whether/how the finding that little selection inherent to translation was implicitly extended to suggest little general selection overall. The following section only considers selection acting at specific P3 sites, thus implicitly discarding other hypotheses about general selection based on nucleotide content but not inherent to translation. Perhaps I am misunderstanding this translation link, but selection based on general nucleotide profiles (for example, due to thermodynamic stability [Samuels, Mech. Ageing Dev. 2005; 126: 1123-1129] or availability of nucleotides [Aalto & Raivio, Mech. Ageing Dev. 2005; 126: 1123-1129; Ott et al., Apoptosis. 2007; 12: 913-922]) would seem to still be on the table?
A reptile is chosen as an outgroup for a comparative analysis of mammals. As always when a choice is made, the question arises: what if that choice was different? Perhaps the corresponding figures can be presented for two other choices of outgroup to demonstrate that there's nothing particularly unrepresentative about this reptile?
Another analysis involves classifying variant frequency into discrete groups based on percentage appearance, then seeking links with the TSS statistic. First, it is not clear why discretisation is needed here. A statistical model embracing the continuous nature of variant frequency requires fewer arbitrary choices (e.g. of numbers and boundaries of classes). Second, an interpretation point here is in danger of equating absence of evidence with evidence of absence. Without an estimate of statistical power, an absence of a significant relationship cannot suggest that anything is likely or unlikely, only that there may not be sufficient power to detect an effect.
Figs 1a and 1e have a log vertical axis but I think the lowest points actually corresponds to zero? This is not compatible with a log axis and the zero position should be explicitly labelled with its own tick (perhaps in parentheses to highlight the discontinuity).
The methods are presented in an interesting way, with specific filenames for the code associated with each part of the pipeline explicitly provided. This is (very!) nice but it would also be good to describe in words what each piece of code does (e.g. "this was used as input for x.py, which counts the mutations and outputs a profile" or some such). This is indeed sometimes written but some parts lack an explanation.
I could do with an additional sentence or two on the statistical analysis. As Kolmogorov-Smirnov tests examine differences between distributions, it's not immediately unambiguous how they are applied to total count statistics. Are count distributions with respect to variant frequency analysed for each amino acid separately? Or are the amino acids somehow ordered and the distributions across them compared? Or something else?
Iain Johnston
Significance
I wrote the above review without realising the reviewer interface would be categorised in this way. Here's a repeat of my "significance" comments
The manuscript explores a large database of human mtDNA sequences and performs some comparative analysis across mammals to characterise the profile of mtDNA mutations. It finds that some variants are surprisingly poorly represented in human mtDNA and suggests that mutational bias rather than selection is the dominant driver of this heterogeneity.
This is an interesting message and an efficient and interpretable of a large-scale dataset to shed light on biological mechanisms, which is a highly desirable philosophy. The factors shaping human mtDNA heterogeneity are of immense interest for several fields from population genetics to medicine, making this a valuable perspective.
Referee Cross-commenting
I agree that codon bias is an interesting potential axis of selection. Even if the analysis rejects the hypothesis of selective effects inherent to translation, it is conceivable that codon bias could be shaped by selection in other indirect ways (depending on how "inherent" is defined, these could include tRNA/nucleotide availability, GC content and thermodynamic stability, etc). I think this aligns with my suggestion that modes of selection that are not directly linked to translation could be explored in more depth before discounting selective effects overall. IJ
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
The main message of this paper, as far as I understood since I am not a molecular bioinformatician but I am certainly interested in mtDNA variations especially related to disease, is that there is a very obvious bias among synonymous changed in the ORF of human mtDNA, more frequent for aminoacids with 4 variants, more frequent in P position, and much more frequently characterized by transversion rather than transition substitutions. This survey is well written and, although edited in a rather technical language, the message is reachable and interesting. I also agree on the conclusions of the Author concderning the considerations …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
The main message of this paper, as far as I understood since I am not a molecular bioinformatician but I am certainly interested in mtDNA variations especially related to disease, is that there is a very obvious bias among synonymous changed in the ORF of human mtDNA, more frequent for aminoacids with 4 variants, more frequent in P position, and much more frequently characterized by transversion rather than transition substitutions. This survey is well written and, although edited in a rather technical language, the message is reachable and interesting. I also agree on the conclusions of the Author concderning the considerations that this set of new data should prompt one to draw also considerin g non-synonymous, potentially pathogenic mutations. The only contribution I feel I can provide to this manuscript is to invite the Authors to coinsider the possibility that the selection may be due to a preferred codon bias, linked to the higher or lower campliance of different codon to be translated by the translational in situ machinery of mitochondria. I am not sure that this applies also for mitochondrial mitochondria and related factors (you may want to ask Aleksey Amunts in Stockholm or Bob Lightowlers or Zoscha Lightowlers in Newcastle on this matter). I do know that this is certainly a problem for recombinant proteins containing, for instance, mammalian MTS fused with a bacterial restriction enzyme; in most of the cases the bacterial sequence has to be recoded using the preferred codon for mammalian syste in orderr to increase translation by an eukaryotic (mammalian) translation machinery. I wonder whether you could discuss this possibility in your paper and maybe perform some further comparative measurement to test it.
Significance
The paper provides novel information on the structure and constrains of mtDNA variants in humans, opens an area of investigation which is new and potentially relevant, with some possible implications also on pathogenic mtDNA mutations in humans.
Referee Cross-commenting
I said in my first comment that I am not a bioinformatician, but Referee 2 made a great job in identifying some critical points and suggest the Authors how to cope with them. I maintain my opinion, that I think it's shared by referee 2, that the paper conveys an interesting and rather unexpected message, and that if the Authors are able to answer properly to the points raised by referee 2 the paper should be published. I confirm that the only contribution I feel I can provide to this manuscript is to invite the Authors to consider the possibility that the selection may be due to a preferred codon bias, linked to the higher or lower compliance of different codons to be translated by the translational in situ machinery of mitochondria. I wonder whether the Authors could consider this possibility in the Discussion and possibly perform some further comparative measurement to test it.
-
