ParaMask: a new method to identify multicopy genomic regions, corrects major biases in whole-genome sequencing data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multicopy genomic regions are repeated sequences that can bias genomic analyses. Here, we present a method, ParaMask, to identify and filter multicopy regions in population-level genomic data of any species. The broad applicability of this method stems from a flexible Expectation-Maximization framework to detect excess heterozygosity while simultaneously fitting inbreeding levels. By combining this signature with read-ratio deviations, excess sequencing depth, and a clustering technique, our method attains high recall. We show that multicopy regions create biases that confound evolutionary genomic analyses and that by identifying these regions with our method and filtering them, we can correct these biases.

Article activity feed