NCBoost v2: a classifier for non-coding variants in Mendelian diseases

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

The current diagnostic rate of rare diseases through whole-genome sequencing has stabilized at around 30% on average, highlighting the need for improved computational scores to identify pathogenic variants. In 2019, we developed NCBoost, a supervised-learning approach that mined a comprehensive set of sequence constraint features and proved particularly well suited to identifying high-effect pathogenic non-coding variants in genetic diseases. Since its first release, the substantial increase in the number of variants available for training, as well as the enhanced capacity to detect purifying selection signals from large-scale genome sequencing projects, motivated an update of NCBoost.

Results

We implemented NCBoost v2, a pathogenicity score for non-coding single-nucleotide variants, trained on the largest set of curated pathogenic variants in monogenic Mendelian diseases available to date. It leverages conservation features computed from recent large-scale genomic consortia such as Zoonomia and gnomAD, and incorporates recent splice-altering predictive scores. NCBoost v2 outperformed alternative state-of-the-art methods in a variety of scenarios, providing more consistent scores across non-coding genomic regions and fine-tuning the scoring of pathogenic splice-altering variants in Mendelian disease genes.

Availability

NCBoost v2 software is implemented in Python 3.10 and is freely available under the GNU General Public License Version 3 at https://doi.org/10.5281/zenodo.16029049 and https://github.com/RausellLab/NCBoost-2 , together with precomputed scores for the human genome assembly GRCh38.

Article activity feed