Genome-Wide Association Study of Dyslexia: A Comprehensive Machine Learning Pipeline Achieving Over 98% Accuracy

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Dyslexia affects approximately 10% of children worldwide, hindering the development of critical reading and writing skills. Although heritability estimates for dyslexia reach up to 70%, the identification of robust genetic markers has proven challenging. Recent advances in large-scale genomic data generation and sophisticated machine learning (ML) algorithms have enabled deeper exploration of genotype–phenotype relationships. In this study, we investigated a curated dataset of the top 10,000 single nucleotide polymorphisms (SNPs) associated with dyslexia from a genome-wide association study (GWAS) performed by 23andMe. We aimed to classify SNPs into those reaching genome-wide significance (p < 5×10^(-8)) versus those not meeting this threshold. Our novel pipeline combined three supervised ML algorithms—Logistic Regression, XGBoost, and CatBoost—augmented by robust hyperparameter tuning. We achieved a test-set accuracy of up to 98.5%, with an accompanying Area Under the ROC Curve (AUC) of 0.9987 using XGBoost. We further integrated unsupervised clustering via Agglomerative Clustering and dimensionality reduction through Uniform Manifold Approximation and Projection (UMAP) to assess the structure of the data, revealing a moderate silhouette score of 0.2069. These findings suggest that machine learning approaches can significantly enhance the identification of genetic architectures in dyslexia-related datasets.

Article activity feed