Identifying single origin rare variants in population genomic data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Genomic analyses have shown that some mutations in large population genomic datasets may be the result of repeated, independent events at the same locus. However, the possibility of recurrent mutation is often ignored, even when it has the potential to introduce errors, such as when assuming co-ancestry for demographic analysis. Even rare variants such as doubletons, which should be particularly informative about recent demography, may have multiple origins despite arising relatively recently in the population. Here, we develop methods to (1) estimate the frequency of recurrent doubletons in a population genomic dataset from the occurrence of tri-allelic sites with two different singleton mutations, and (2) identify a subset of high confidence single origin doubletons based on the presence of a linked rare variant on the surrounding shared haplotype. Applying these methods to data for the malaria mosquito Anopheles gambiae sampled from across Africa, we estimate that ∼16% of doubletons had independent origins. We then identify a subset of doubletons highly likely (∼99%) to have a single origin, which consists of ∼68% of all the expected single origin doubletons (and ∼57% of all observed doubletons). The effectiveness of our methods is demonstrated by both further data analyses and coalescent simulations, and these doubletons are then used to test population genetic hypotheses about recombination, selection, and isolation by distance. The methods developed here should be useful for demographic inference when populations or sample sizes are large enough that recurrent mutation cannot be ignored.

Article activity feed