Developing a School-level Dropout Prediction Model Using Educational Administrative Data and Regression Tree Algorithms

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Student dropout poses long-standing educational and social challenges, yet predictive studies using large-scale administrative data at the school level remain limited. This study proposes a dropout prediction model for high schools by leveraging 2023 educational datasets from NEIS and the School Information Disclosure System. After integrating and preprocessing the full dataset, regression-tree models—Decision Tree, Random Forest, and XGBoost—were evaluated, with XGBoost demonstrating the strongest performance. To ensure suitability for early-semester prediction, features unavailable before mid-semester were removed, and SHAP-based feature selection was conducted to determine an optimal feature subset. The final XGBoost model, constructed with 33 features, achieved an Adjusted R² of approximately 0.429. SHAP analysis revealed that variables such as class count, department type count, number of teachers, and general classroom count were the most influential predictors, while several resource-related variables showed negative associations with dropout rates. The findings highlight the potential of early-semester administrative data for identifying schools at elevated risk of dropout and offer practical implications for targeted intervention and resource allocation.

Article activity feed