Population labels can be generated directly from targeted next-generation sequencing data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The validity of genetic studies is reliant on the selection of appropriately matched population controls to prevent erroneous associations between population-specific genetic variants and disease. Such studies have traditionally relied on self-declared ethnicity which is likely to produce inaccurate predictions and is ethically problematic. More recently, ancestry informative markers (AIMs) have been used to determine the genetic similarity of an individual to ancestry reference populations. These AIMS, however, mostly reside in the non-coding DNA, making it difficult to determine ancestry from sequencing data which does not cover the whole genome. To address this, we implemented an empirical methodology that utilizes Procrustes analysis and a random forest classification to select genetically similar gnomAD control populations for study samples. This approach avoids the problems associated with using ethnicity as a substitute for genetic similarity and can be used to select suitable controls for studies that rely on exome or targeted sequencing data.