ParaMask, a new method to identify multicopy genomic regions, corrects major biases in whole-genome sequencing data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multicopy genomic regions are repeated sequences that can bias genomics analyses. Here, we present a method to identify and filter multicopy regions in population-level genomic data of any species. The broad applicability of this method stems from a flexible Expectation-Maximization framework to detect excess heterozygosity while simultaneously fitting inbreeding levels. By combining this signature with read ratio deviations, excess sequencing coverage, and a clustering technique, our method attains high power. We show that multicopy regions create biases that confound evolutionary genomics analyses, and that by identifying these regions with our method and filtering them, we can correct these biases.