Genome-Wide Association Study of Dyslexia: A Comprehensive Machine Learning Pipeline Achieving Over 98% Accuracy

Nora Alice Fink
Michael Fink

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Dyslexia affects approximately 10% of children worldwide, hindering the development of critical reading and writing skills. Although heritability estimates for dyslexia reach up to 70%, the identification of robust genetic markers has proven challenging. Recent advances in large-scale genomic data generation and sophisticated machine learning (ML) algorithms have enabled deeper exploration of genotype–phenotype relationships. In this study, we investigated a curated dataset of the top 10,000 single nucleotide polymorphisms (SNPs) associated with dyslexia from a genome-wide association study (GWAS) performed by 23andMe. We aimed to classify SNPs into those reaching genome-wide significance (p < 5×10^(-8)) versus those not meeting this threshold. Our novel pipeline combined three supervised ML algorithms—Logistic Regression, XGBoost, and CatBoost—augmented by robust hyperparameter tuning. We achieved a test-set accuracy of up to 98.5%, with an accompanying Area Under the ROC Curve (AUC) of 0.9987 using XGBoost. We further integrated unsupervised clustering via Agglomerative Clustering and dimensionality reduction through Uniform Manifold Approximation and Projection (UMAP) to assess the structure of the data, revealing a moderate silhouette score of 0.2069. These findings suggest that machine learning approaches can significantly enhance the identification of genetic architectures in dyslexia-related datasets.

Version published to 10.20944/preprints202502.1834.v1
Feb 25, 2025

Deep-learning-derived glaucoma-related endophenotypes enable novel genome-wide genetic and functional discovery

This article has 9 authors:
1. Liyin Chen
2. Yan Zhao
3. Saber Kazeminasab Hashemabad
4. Tobias Elze
5. Mohammad Eslami
6. Mengyu WANG
7. Janey Wiggs
8. Ayellet Segre
9. Nazlee Zebardast
This article has no evaluationsLatest version Jan 22, 2026
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026
Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Deep-learning-derived glaucoma-related endophenotypes enable novel genome-wide genetic and functional discovery

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods