Developing a School-level Dropout Prediction Model Using Educational Administrative Data and Regression Tree Algorithms

Gun-woo Choi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Student dropout poses long-standing educational and social challenges, yet predictive studies using large-scale administrative data at the school level remain limited. This study proposes a dropout prediction model for high schools by leveraging 2023 educational datasets from NEIS and the School Information Disclosure System. After integrating and preprocessing the full dataset, regression-tree models—Decision Tree, Random Forest, and XGBoost—were evaluated, with XGBoost demonstrating the strongest performance. To ensure suitability for early-semester prediction, features unavailable before mid-semester were removed, and SHAP-based feature selection was conducted to determine an optimal feature subset. The final XGBoost model, constructed with 33 features, achieved an Adjusted R² of approximately 0.429. SHAP analysis revealed that variables such as class count, department type count, number of teachers, and general classroom count were the most influential predictors, while several resource-related variables showed negative associations with dropout rates. The findings highlight the potential of early-semester administrative data for identifying schools at elevated risk of dropout and offer practical implications for targeted intervention and resource allocation.

Version published to 10.35542/osf.io/2tg4h_v1 on OSF Preprints
Dec 8, 2025

Predictive Modeling of Graduate Vocational Mobility Using Multivariate Attributes

This article has 1 author:
1. Irshad Ahmed Abbasi
This article has no evaluationsLatest version Jan 5, 2026
Enhancing Student Retention in Higher Education Institutions (HEIs): Machine Learning Approach

This article has 4 authors:
1. Emeka Umendu
2. Mustansar Ghazanfar
3. Aaron Kans
4. Md Atiqur Rahman Ahad
This article has no evaluationsLatest version Jan 13, 2026
Decoding SAT Scores: A Multifaceted Analysis of Socioeconomic and Educational Influences Across Diverse Regions

This article has 2 authors:
1. Margaret Liu
2. Wei Lu
This article has no evaluationsLatest version Jan 5, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Predictive Modeling of Graduate Vocational Mobility Using Multivariate Attributes

Enhancing Student Retention in Higher Education Institutions (HEIs): Machine Learning Approach

Decoding SAT Scores: A Multifaceted Analysis of Socioeconomic and Educational Influences Across Diverse Regions