varCADD: large sets of standing genetic variation enable genome-wide pathogenicity prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Machine learning and artificial intelligence are increasingly being applied to identify phenotypically causal genetic variation. These data-driven methods require comprehensive training sets to deliver reliable results. However, large unbiased datasets for variant prioritization and effect predictions are rare as most of the available databases do not represent a broad ensemble of variant effects and are often biased towards protein-coding genome, or even towards few well-studied genes. To overcome these issues, we propose several alternative training sets derived from subsets of human standing variation. Specifically, we use variants identified from whole-genome sequences of 71,156 individuals contained in gnomAD v3.0 and approximate the benign set with frequent and the deleterious set with rare standing variation. We apply the Combined Annotation Dependent Depletion framework (CADD) and train several alternative models using CADD v1.6. Using the NCBI ClinVar validation set, we demonstrate that the alternative models have state-of-the art accuracy, outperforming the widely used pathogenicity score CADD v1.6 in certain genomic regions.
Being larger than conventional databases, including the evolutionary-derived training dataset of about 30 million variants in CADD, standing variation datasets cover a broader range of genomic regions and rare instances of the applied annotations. For example, they cover more recent evolutionary changes common in gene regulatory regions, which are more challenging to assess with conventional tools. Finally, datasets derived from standing variation better represent allelic changes in the human genome and do not require extensive simulations and adaptations to annotations of the evolutionary-derived sequence alterations used for CADD training. We provide datasets as well as trained models to the community for further development and application.
Suggestion for a Graphical Abstract
Author‘s Summary
Here, we are presenting the varCADD approach for predicting variant deleteriousness. Throughout time, pathogenic allelic changes are selected against by purifying selection, while neutral or beneficial changes can be passed along to next generations. Consequently, the frequencies of pathogenic variants are decreasing, beneficial alleles are increasing and frequencies of neutral variants are subject to drift. For that, allele frequencies in standing variation can be used as a proxy for their deleteriousness. To train a machine learning model for variant prioritization, frequent variants from gnomAD 3.0 were used as proxy-benign set and rare variants as proxy-deleterious set. The resulting training set exceeds excisting data sets in their size and allows for genome-wide coverage of molecular effects. The training set was annotated with sequence conservation, epigenetic, sequence-based and other features using the CADD v1.6 framework, after which a logistic regression model was trained. The output of the model can be interpreted as a probability for a variant to have a deleterious effect on genetic function.