Leveraging Synthetic Genomics and AI to Predict Pathogenic Variants in Hereditary Hearing Loss

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Consanguineous marriages heighten the risk of recessive genetic disorders, including congenital sensorineural hearing loss (CSHL), by increasing the prevalence of homozygous pathogenic variants. However, limited availability of large-scale annotated variant datasets remains a major challenge in developing accurate machine learning (ML) classifiers for variant pathogenicity prediction in such contexts. In this study, we propose a novel approach leveraging synthetic whole exome sequencing (WES) data to develop a robust ML model for distinguishing pathogenic from non-pathogenic variants specific to consanguineous hearing loss cases. We first applied a rule-based probabilistic framework to simulate clinically realistic distributions of variant features, including chromosomal position, allele frequency, CADD, SIFT, PolyPhen, and ClinVar-like pathogenicity labels. To enhance data diversity and mitigate bias from manually defined rules, we employed rule based probabilistic simulation to generate complex, high-fidelity synthetic variants that preserved conditional dependencies across multiple annotation layers. A balanced dataset of 5000 variants was generated and used to train and evaluate several machine learning models, including XGBoost, random forest, and logistic regression. The models achieved high accuracy and strong discriminative power, as measured by ROC-AUC and F1-score, validating the feasibility of synthetic data in precision genomics. Our results demonstrate the potential of combining rule-based priors with generative models to overcome data scarcity in rare genetic disorders and enable ML-based variant classification in consanguineous populations. This synthetic data–driven pipeline offers a scalable and ethical alternative for training predictive models in underrepresented genetic conditions, ultimately facilitating early diagnosis and personalized healthcare interventions for inherited hearing loss.

Article activity feed