A High-Accuracy Machine-Learning Approach for Dyslexia Screening Based on Gamified Interaction Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Dyslexia is a learning difficulty of neurobiological origin, affecting over 10% of the global population, yet often underdiagnosed in transparent orthographies such as Spanish. This paper proposes a novel high-accuracy machine-learning pipeline to detect dyslexia risk using data from a 15-minute gamified online test. The test captures fine-grained interaction and linguistic performance metrics (Clicks, Hits, Misses, Scores, Accuracy, Missrate) across 32 targeted exercises. We incorporate advanced techniques such as hyperparameter tuning, class balancing (SMOTE and scale_pos_weight), and ensemble methods (XGBoost, CatBoost, Random Forest, LightGBM, Gradient Boosting, Logistic Regression, SVM, and MLP). Benchmarking confirms that our approach robustly outperforms a prior published model on a dataset of 3,644 Spanish-speaking children and adolescents (7–17 years old), of which 392 have professionally diagnosed dyslexia. Our final ensemble achieves an accuracy of 88.34% with an F1-score of 0.48 for the minority class. While not a formal diagnostic tool, this machine-learning screening pipeline can provide early-stage risk identification for dyslexia, with minimal hardware requirements and in a scalable, self-administered format. We present full methodological details, feature-importance analysis, confusion matrices, and ensemble performance, alongside a careful comparison to previous studies in transparent orthographies.

Article activity feed