Developing a School-level Dropout Prediction Model Using Educational Administrative Data and Regression Tree Algorithms
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Student dropout poses long-standing educational and social challenges, yet predictive studies using large-scale administrative data at the school level remain limited. This study proposes a dropout prediction model for high schools by leveraging 2023 educational datasets from NEIS and the School Information Disclosure System. After integrating and preprocessing the full dataset, regression-tree models—Decision Tree, Random Forest, and XGBoost—were evaluated, with XGBoost demonstrating the strongest performance. To ensure suitability for early-semester prediction, features unavailable before mid-semester were removed, and SHAP-based feature selection was conducted to determine an optimal feature subset. The final XGBoost model, constructed with 33 features, achieved an Adjusted R² of approximately 0.429. SHAP analysis revealed that variables such as class count, department type count, number of teachers, and general classroom count were the most influential predictors, while several resource-related variables showed negative associations with dropout rates. The findings highlight the potential of early-semester administrative data for identifying schools at elevated risk of dropout and offer practical implications for targeted intervention and resource allocation.