MobiDeep: an AI-based meta-score for scoring non-coding DNA variations

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background The interpretation of non-coding variants (NCVs) from genome sequencing represents a major bottleneck in the diagnosis of rare diseases. Existing variant effect predictors (VEPs) show variable performance across different genomic contexts, and a lack of region-specific clinical guidance hinders accurate variant prioritization. This study aimed to rigorously benchmark state-of-the-art VEPs to define region-specific thresholds and to develop MobiDeep, a novel meta-score designed to improve NCV prioritization. Methods We curated a high-confidence dataset of 448 pathogenic NCVs (ClinVar, HGMD, literature) and 38,146 presumed benign NCVs. Critically, variants affecting splicing were excluded to focus on strictly regulatory mechanisms. We benchmarked the performance of ReMM, CADD, GPN-MSA, Cactus241way, and phyloP, both globally and stratified by genomic region (e.g., 5'UTR, 3'UTR). Subsequently, we developed MobiDeep, a neural network integrating these five scores, optimized using Optuna and validated on an independent holdout set of pathogenic NCVs. Results Benchmarking confirmed that no single tool is universally optimal, with performance varying significantly by genomic context; while ReMM excelled in non-coding exons (AUROC = 0.987), GPN-MSA demonstrated superior performance for 3'UTRs (AUROC = 0.901). We established data-driven clinical thresholds, identifying an optimal global cutoff of 10.37 for CADD v1.7, validating previous works of CADD ≥ 10 for regulatory variants and 0.80 for ReMM. Building on these insights, MobiDeep significantly outperformed all individual predictors on an independent test set, achieving an AUROC of 0.973 and an AUPRC of 0.888. In large-scale simulations mimicking a diagnostic, MobiDeep prioritized causal variants effectively, placing 52.0% and 75% within the top 5 and top 20 ranks respectively. Furthermore, the model correctly prioritized all Clinvar pathogenic variants in the recently discovered RNU4-2 non-coding gene. Conclusions Our findings confirm that individual predictors and uniform thresholds are insufficient for interpreting the diverse landscape of non-coding variants. We demonstrate that region-specific calibration is essential for accurate prioritization.. Our meta-score MobiDeep improves classification performance compared to existing tools. This meta-score serves as a robust filter to streamline the identification of high-confidence variants, thereby facilitating focused manual review and subsequent biological validation in diagnostic settings.

Article activity feed