A High-Accuracy Machine-Learning Approach for Dyslexia Screening Based on Gamified Interaction Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Dyslexia is a learning difficulty of neurobiological origin, affecting over 10% of the global population, yet often underdiagnosed in transparent orthographies such as Spanish. This paper proposes a novel high-accuracy machine-learning pipeline to detect dyslexia risk using data from a 15-minute gamified online test. The test captures fine-grained interaction and linguistic performance metrics (Clicks, Hits, Misses, Scores, Accuracy, Missrate) across 32 targeted exercises. We incorporate advanced techniques such as hyperparameter tuning, class balancing (SMOTE and scale_pos_weight), and ensemble methods (XGBoost, CatBoost, Random Forest, LightGBM, Gradient Boosting, Logistic Regression, SVM, and MLP). Benchmarking confirms that our approach robustly outperforms a prior published model on a dataset of 3,644 Spanish-speaking children and adolescents (7–17 years old), of which 392 have professionally diagnosed dyslexia. Our final ensemble achieves an accuracy of 88.34% with an F1-score of 0.48 for the minority class. While not a formal diagnostic tool, this machine-learning screening pipeline can provide early-stage risk identification for dyslexia, with minimal hardware requirements and in a scalable, self-administered format. We present full methodological details, feature-importance analysis, confusion matrices, and ensemble performance, alongside a careful comparison to previous studies in transparent orthographies.