Disparities and Predictive Modeling of Foundational Learning in Somaliland: A Gender-, Location-, and School-Type-Based Analysis Using Machine Learning and Regression Approaches
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study aimed to develop predictive models to identify key factors driving foundational learning outcomes and explore gender and contextual disparities among Grade 2–3 students in Somaliland. Utilizing data from the 2022 Somaliland National Learning Assessment (N = 47,269 students from 1,112 schools), the research integrated student-level Early Grade Reading Assessment (EGRA) and Early Grade Mathematics Assessment (EGMA) scores with school-level details. A cross-sectional, quantitative approach was employed, analyzing data through descriptive statistics, two-way ANOVA, binary logistic regression, and supervised machine learning classifiers (Logistic Regression, Decision Tree, Random Forest, XGBoost) to predict low performance (bottom 25th percentile). A significant learning crisis was evident, with 25.6% of students (12,102) identified as low performers in literacy and 25.0% (11,838) in numeracy; 8.8% (4,144 students) were low performers in both. Gender disparities varied by subject: males exhibited slightly higher mean EGRA scores (M = 398.08 vs. M = 392.79 for females), while females achieved higher mean EGMA scores (M = 694.60 vs. M = 684.39 for males). Logistic regression confirmed males had lower odds of low literacy performance (OR = 0.894, p < .001) but higher odds of low numeracy performance (OR = 1.132, p < .001). While private school students had higher mean scores, public school attendance was associated with lower odds of low literacy (OR = 0.740, p < .001) and low numeracy (OR = 0.940, p = .040). School location was the most potent predictor: urban students consistently outperformed rural counterparts (e.g., EGRA M = 414.45 urban vs. M = 380.69 rural) and had substantially lower odds of low performance in literacy (OR = 0.494, p < .001) and numeracy (OR = 0.500, p < .001). Random Forest feature importance analysis underscored location's dominance, accounting for 87.4% (Low_EGRA) and 84.1% (Low_EGMA) of predictive power. Tree-based ML models (Decision Tree, Random Forest, XGBoost) achieved marginally better, albeit modest, F1-scores (≈ 0.412) in identifying low performers compared to standard logistic regression (F1-score ≈ 0.396 for Low_EGRA). Findings demand urgent policy attention towards equitable resource distribution and support for rural schools. Gender-responsive pedagogical strategies are needed to address subject-specific learning needs. The nuanced performance of public versus private schools suggests focusing on quality improvement and identifying effective practices in public schools that support struggling learners. The modest accuracy of ML models indicates they should complement, rather than replace, teacher assessments in student evaluation frameworks. Future research should prioritize longitudinal studies to establish causality, incorporate more granular data (e.g., teacher quality, household factors), employ qualitative methods to understand contextual nuances, and advance the development of fair, transparent, and more accurate ML models for identifying at-risk students in resource-constrained settings like Somaliland.