Detection and characterization of low and high genome coverage regions using an efficient running median and a double threshold approach

This article has been Reviewed by the following groups

Read the full article

Abstract

Motivation

Next Generation Sequencing (NGS) provides researchers with powerful tools to investigate both prokaryotic and eukaryotic genetics. An accurate assessment of reads mapped to a specific genome consists of inspecting the genome coverage as number of reads mapped to a specific genome location. Most current methods use the average of the genome coverage ( sequencing depth ) to summarize the overall coverage. This metric quickly assess the sequencing quality but ignores valuable biological information like the presence of repetitive regions or deleted genes. The detection of such information may be challenging due to a wide spectrum of heterogeneous coverage regions, a mixture of underlying models or the presence of a non-constant trend along the genome. Using robust statistics to systematically identify genomic regions with unusual coverage is needed to characterize these regions more precisely.

Results

We implemented an efficient running median algorithm to estimate the genome coverage trend. The distribution of the normalized genome coverage is then estimated using a Gaussian mixture model. A z -score statistics is then assigned to each base position and used to separate the central distribution from the regions of interest (ROI) (i.e., under and over-covered regions). Finally, a double threshold mechanism is used to cluster the genomic ROIs. HTML reports provide a summary with interactive visual representations of the genomic ROIs.

Availability

An implementation of the genome coverage characterization is available within the Sequana project. The standalone application is called sequana_coverage . The source code is available on GitHub ( http://github.com/sequana/sequana ), and documentation on ReadTheDocs ( http://sequana.readtheodcs.org ). An example of HTML report is provided on http://sequana.github.io .

Contact

dimitri.desvillechabrol@pasteur.fr , thomas.cokelaer@pasteur.fr

Article activity feed

  1. Abstract

    A revised and updated version adapted from this preprint was published on 6th September 2018 in GigaScience called:

    Sequana coverage: detection and characterization of genomic variations using running median and mixture models https://doi.org/10.1093/gigascience/giy110

    As an open access, open peer review journal the peer reviews of this paper are available here:

    Review 1. http://dx.doi.org/10.5524/REVIEW.101353 Review 2. http://dx.doi.org/10.5524/REVIEW.101350 Review 3. http://dx.doi.org/10.5524/REVIEW.101351