Gene Specific Pathogenicity Predictor for Chromatin-Remodeling BAF Complex-Associated Neurodevelopmental Disorders
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Advancements in whole genome sequencing have increased the number of variants of uncertain significance (VUS) identified in patient genomes. This has created a diagnostic bottleneck for genetic counselors tasked with sifting through these variants and determining those most likely to be causative for a patient's clinical presentation. Machine learning (ML) tools can aid in identifying pathogenic variants from VUS, but there is a need for gene-specific algorithms that predict pathogenic variants with high accuracy. To address this need, we present a workflow for developing gene-specific, ensemble-learning ML tools, that leverage outputs from other algorithms, locations of variants within the gene, and evolutionary conservation data to make a prediction of pathogenicity. Variants in SMARCA2 and SMARCA4 that are associated with rare neurodevelopmental diseases were used to screen 15 ML algorithms. A random forest learner was tuned to yield a final accuracy of 0.93 on holdout data. Generalizing this predictor to other BAF complex proteins resulted in a sharp decline in performance. We trained a final predictor for all genes in the study to create a predictor that identifies pathogenic variants in these BAF subunits with an accuracy of 0.91 on holdout data. This predictor specific to BAF complex proteins performs with higher accuracy and AUROC than any other predictor. The decline in performance when generalized to other proteins emphasizes the need for the gene-specific calibration of predictors. Our workflow for the development of such models provides a quick, computationally inexpensive route for improving the ML tools available to genetic counselors.