A catalog of homoplasmic and heteroplasmic mitochondrial DNA variants in humans

Alexandre Bolze
Fernando Mendez
Simon White
Francisco Tanudjaja
Magnus Isaksson
Ruomu Jiang
Andrew Dei Rossi
Elizabeth T. Cirulli
Misha Rashkin
William J. Metcalf
Joseph J. Grzymski
William Lee
James T. Lu
Nicole L. Washington

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

High quality population allele frequencies of DNA variants can be used to discover new biology, and study rare disorders. Here, we created a public catalog of mitochondrial DNA variants based on a population of 195,983 individuals. We focused on 3 criteria: (i) the population is not enriched for mitochondrial disorders, or other clinical phenotypes, (ii) all genomes are sequenced and analyzed in the same clinical laboratory, and (iii) both homoplasmic and heteroplasmic variants are reported. We found that 47% of the mitochondrial genome was invariant in this population, including large stretches in the 2 rRNA genes. This information could be used to annotate the mitochondrial genome in future studies. We also showed how to use this resource for the interpretation of pathogenic variants for rare mitochondrial disorders. For example, 42% of variants previously reported to be pathogenic for Leber Hereditary Optic Neuropathy (LHON) should be reclassified.

eLife
Sep 1, 2020
###Reviewer #3:

Bolze and colleagues describe a new database of mitochondrial variation that consists of a greater number of samples than existing databases. To overcome some of the limitations of existing databases, they use the same sequencing pipeline for all samples, do not select for any particular phenotypes, and report both heteroplasmic and homoplasmic calls. They demonstrate the utility of their database by defining intervals of invariable regions, which may indicate mutational constraint and could aid in interpreting candidate variants in disease patients. The authors also calculate the filtering allele frequency for LHON variants and suggest that the allele frequencies for many LHON variants in their database and UKB are too high for the variants to be considered pathogenic and that they should be reclassified. The main …
###Reviewer #3:

Bolze and colleagues describe a new database of mitochondrial variation that consists of a greater number of samples than existing databases. To overcome some of the limitations of existing databases, they use the same sequencing pipeline for all samples, do not select for any particular phenotypes, and report both heteroplasmic and homoplasmic calls. They demonstrate the utility of their database by defining intervals of invariable regions, which may indicate mutational constraint and could aid in interpreting candidate variants in disease patients. The authors also calculate the filtering allele frequency for LHON variants and suggest that the allele frequencies for many LHON variants in their database and UKB are too high for the variants to be considered pathogenic and that they should be reclassified. The main limitations of this database, as stated by the authors, are the lack of diverse haplogroups and the relatively low depth of coverage considering the variable heteroplasmy of the mitochondria. The technical aspects of the data aggregation and database are solid, and the scientific analyses are sound. I have only a few comments that would strengthen the paper.

There is no discussion of how to distinguish heteroplasmy from sequencing errors. While some filtering was done akin to germline variant filtering (particularly that calls at positions with fewer than 10 reads were removed), this could still result in a ~1/11 variant being called as heteroplasmic (at 9%). The spike in Figure 3F (final panel) around 90% ARF could suggest that something like this could be happening (homoplasmic variants with sequencing errors reverting to another base). Was there a minimum heteroplasmy level used for this analysis? Perhaps showing these plots filtered to a minimum of 2, 5, etc of the same alternate allele would reveal a sensible cutoff that could then be used for the whole paper.

Line 484: This is the only mention of NUMTs in the paper, but the complications that can arise from them are not detailed by the authors. Considering the mitochondrial coverage, how confident are the authors that their low heteroplasmic calls are not false positives resulting from NUMTs?

Along the same lines, the authors use HaplotypeCaller, which is a standard tool for germline variation but not optimized for mitochondrial calling. Was this run in haploid or diploid mode? It would be useful to state the limitations of using this tool to call mitochondrial variants as it is designed for diploids.

The suggestion that "all protein-coding genes in the mitochondrial genome were highly intolerant to LoF variants" is certainly plausible, but not definitive from the current data. While 0 LoFs are observed, how many would be expected? If these genes are small (which they must be since they are on a very small chromosome), the number of expected variants based on a mutational model (akin to [Samocha et al., 2014]) would likely be <1, and thus 0 would not necessarily be remarkable. Given that, you may not be quite powered to do this at a per-gene level, but pooling all the genes may provide enough power to make a broader statement. The same goes for the % of bases invariable analysis (Figure 5) - it would be good to make this more quantitative, perhaps comparing these proportions to autosomes, or within each other (are the tRNA and rRNA ones significantly different from the protein-coding? Would it be possible to split protein-coding by synonymous, missense, LoF?).

"Indeed, we found that no haplogroup markers -- even those from haplogroups not represented in our dataset -- were mapped to these highly constrained regions" - is this not circular? Markers that delineate haplogroups are found as homoplasmic calls that were used to determine the constrained regions, so it stands to reason that these would not be found in them, no? But perhaps I'm missing something.
Read the original source
eLife
Sep 1, 2020
###Reviewer #2:

The authors represent a resource of human mtDNA variants and heteroplamies from 195983 individuals, and scoring 14,324 mutations. The resource is of value. It may be possible to criticize the European ancestry- heavy data set, and the American specificity of it, but the authors fully acknowledge and disclose this in their manuscript, and make the data available to others to continue the work. Other high depth human papers are out there (Wei 2019 reference) and others, but the data is often not available due to patient confidentiality issues as in Wei 2019. Having this dataset available is of great intrinsic value.

I only have a few comments that would require looking into the data for a few small things, or changing the writing of the manuscript.

Comments:
1. My biggest concern is that the authors use a read-aligning method …
###Reviewer #2:

The authors represent a resource of human mtDNA variants and heteroplamies from 195983 individuals, and scoring 14,324 mutations. The resource is of value. It may be possible to criticize the European ancestry- heavy data set, and the American specificity of it, but the authors fully acknowledge and disclose this in their manuscript, and make the data available to others to continue the work. Other high depth human papers are out there (Wei 2019 reference) and others, but the data is often not available due to patient confidentiality issues as in Wei 2019. Having this dataset available is of great intrinsic value.

I only have a few comments that would require looking into the data for a few small things, or changing the writing of the manuscript.

Comments:

My biggest concern is that the authors use a read-aligning method where they take in all calls where the was at least 1 read mapping to mtDNA. The logic seems to be that they do not want to discard reads that may "mis-map" to the NuMTS, but this leads to another, potentially larger problem of potentially including NuMTS as heteroplasmic variants (See PMID: 23972387). For instance, the recent claim of paternal mtDNA transmission appears to be the result of a complex NuMT that was able to amplify in the strategies used in the original study (PMID: 32269217). More details on how the authors exclude the possibility of NuMTS incorporation are needed, especially in light of the 1+ alignment parameters used.

Line 340 - 357 - regarding LHON. The problem with choosing LHON for this analysis is that it has a complicated clinical manifestation, which may not support the handling of the 14484t>C allele in the manner present. First, the 8:1 male to female ratio in becoming afflicted (with homoplasmic LHON), the fact that many people with the homoplasmic allele will not become afflicted, and the fact that it can onset late in life (after having children) all could contribute to it's allele being more representative in a random sampling of the population.

While the authors are correct that the allele on its own may not be pathogeneic in specific haplogroup backgrounds (Howell 2003 reference), or require the co-expression with secondary "affector" mtDNA mutations (ex. PMID: 25342614 - alleles including 3397A>G, 3497C>T, 3571C-T, 3745G>A, and other "helper" mutations in MitoMap). The paper need a bit more on the 14484 conclusion due to all of these issues. Perhaps finding linkage (or lack thereof) to these helper alleles would strengthen this section sufficiently.

Lines 206 - 207. How did the authors handle AGG / AGA codons? In 2010 a lab published evidence that AGA and AGG may not be true stop codons, but are simply not coded in the human mtDNA genome (PMID: 20075246). While this finding remains not universally accepted, it does explain the lack of an AGA/AGG-binding translational termination factor in the mitochondria. It is possible that the authors are in a position to comment on the behaviour of AGA or AGG codons, relevant to their section on PCG-truncating mutations.

The work - especially discussing the control region, overlaps a bit more with Wei et al. 2019 than the manuscript lets on. A bit more direct openness about this overlap and similar finding should be introduced into the manuscript, within the discussion.
Read the original source
eLife
Sep 1, 2020

###Reviewer #1:

Bolze et al. report their effort to sequence the mitochondrial genomes of ~200,000 individuals. The authors generated a large, unified database that can be used for the investigation of mitochondrial mutations and the prediction of pathogenic alleles. Importantly, it addresses key limitations of other currently available sources, mainly it is not biased for mitochondrial diseases, all analyses were done in the same lab and using the same bioinformatics tools, and heteroplasmic alleles are reported. The authors then use their source to draw conclusions on the nature of mitochondrial mutations, their distribution across the mt-genome, and to challenge previously annotated pathogenic mutations, specifically for LHON disease.

For example, figure 3A, which is one of the main take home messages from the paper, does not reflect …

###Reviewer #1:

Bolze et al. report their effort to sequence the mitochondrial genomes of ~200,000 individuals. The authors generated a large, unified database that can be used for the investigation of mitochondrial mutations and the prediction of pathogenic alleles. Importantly, it addresses key limitations of other currently available sources, mainly it is not biased for mitochondrial diseases, all analyses were done in the same lab and using the same bioinformatics tools, and heteroplasmic alleles are reported. The authors then use their source to draw conclusions on the nature of mitochondrial mutations, their distribution across the mt-genome, and to challenge previously annotated pathogenic mutations, specifically for LHON disease.

For example, figure 3A, which is one of the main take home messages from the paper, does not reflect hardly any "interesting" alleles. The vast majority of the >14,000 discovered variants cannot be seen on the plot. Unfortunately, many of the plots display the same data in similar, and unnecessary formats, making the figures dense and confusing. Examples include figure 3F (mean and max ARF distribution) and figure 5A, B & C.

Another, and more concerning issue, is the quality of heteroplasmic variants. The authors mention very briefly in the Methods section what was done to consider NUMTS - nuclear copies of mtDNA - that may be mutated and thus bias SNV calling. From their short description, it seems like NUMTS could be a source of errors. Furthermore, Figure 2E shows that the vast majority of individuals had {less than or equal to}1 heteroplasmic variation. This observation cannot be reconciled with the basis underlying current methods to infer cellular lineages based on heteroplasmy in a cellular population (PMID: 30827679).

These issues are particularly critical when using the data to draw conclusions on the pathogenesis of mutations, which is the focus of the last part of the manuscript. When considering the effect of m.14484T>C mutation on LHON disease, the authors argue that this mutation should be reclassified as non-pathogenic as it satisfies the "Bening Strong 1" criteria. Given the above limitations, this is certainly too strong of a conclusion. Stronger evidence for this claim is required, especially since all subjects carrying this mutation are from the same haplogroup.

Lastly, to assess the probability that m.14484T>C is indeed non-pathogenic, the authors use previously published estimates of the "maximum credible population allele frequency". Despite the abundance of papers that estimate these parameters, the authors provide only one number, with no error or range estimates, and show that the frequency of m.14484T>C is higher than expected. It is important to understand what is the certainty of this claim, and ideally to reflect it as a range around the dashed lines in Figure 6.

Read the original source
eLife
Sep 1, 2020

##Preprint Review

This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript.

###Summary:

Bolze and colleagues describe a new database of mitochondrial variation that consists of a greater number of samples than existing databases. To overcome some of the limitations of existing databases, they use the same sequencing pipeline for all samples, do not select for any particular phenotypes, and report both heteroplasmic and homoplasmic calls. They demonstrate the utility of their database by defining intervals of invariable regions, which may indicate mutational constraint and could aid in …

##Preprint Review

This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript.

###Summary:

Bolze and colleagues describe a new database of mitochondrial variation that consists of a greater number of samples than existing databases. To overcome some of the limitations of existing databases, they use the same sequencing pipeline for all samples, do not select for any particular phenotypes, and report both heteroplasmic and homoplasmic calls. They demonstrate the utility of their database by defining intervals of invariable regions, which may indicate mutational constraint and could aid in interpreting candidate variants in disease patients. The authors also calculate the filtering allele frequency for LHON variants and suggest that the allele frequencies for many LHON variants in their database and UKB are too high for the variants to be considered pathogenic and that they should be reclassified. The main limitations of this database, as stated by the authors, are the lack of diverse haplogroups and the relatively low depth of coverage considering the variable heteroplasmy of the mitochondria.

While the database is indeed unique and will likely be very valuable for the community, on the whole, the computational analyses are in several places superficial, in some cases even flawed and overall not as well presented as they could be.

Read the original source
Version published to 10.1101/798264 on bioRxiv
Oct 8, 2019

Population differences in allele frequencies modify the clinical interpretation of genetic variants associated with rare diseases in Chilean patients

This article has 10 authors:
1. Pablo Alarcón-Arias
2. Rosa Pardo-Vargas
3. Patricia Castro-Santos
4. Guillermo Lay-Son
5. M Leonor Bustamante
6. Marcelo Miranda
7. Paola Krall
8. Ignacia Fuentes
9. Roberto Díaz-Peña
10. Ricardo A. Verdugo
This article has no evaluationsLatest version Feb 3, 2026
Population Genetic Data of 23 Autosomal STR Loci in the Population of Uzbekistan

This article has 4 authors:
1. Dinara Tosheva
2. Normatov Asilbek
3. Yokubov Murodjon
4. Amanturdiyev Ikrom
This article has no evaluationsLatest version Jan 8, 2026
Ribosomal DNA copy number variation shapes human physiology and disease risk

This article has 12 authors:
1. Anil Raj
2. Jordan Brown
3. Nathaniel Thayer
4. Manuel Hotz
5. Irene Lam
6. Nicole Fong
7. Elena Sorokin
8. Marjola Thanaj
9. Daphna Rothschild
10. Jonathan Pritchard
11. Maria Barna
12. David Hendrickson
This article has no evaluationsLatest version Jan 21, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Population differences in allele frequencies modify the clinical interpretation of genetic variants associated with rare diseases in Chilean patients

Population Genetic Data of 23 Autosomal STR Loci in the Population of Uzbekistan

Ribosomal DNA copy number variation shapes human physiology and disease risk