Machine Learning Models for Predicting Student Dropout, Enrollment, and Graduation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Higher-order student dropout involves high levels of individual and institutional costs and is usually a long-term process that is influenced by academic, social, and financial variables. This paper examines the hypothesis of using machine learning models that have been trained on the available information during the time of enrollment to predict three different student outcomes, namely, dropout, continued enrollment and graduation. Based on administrative data of 4424 degree-seeking students at one university, we create a set of features 36 enrollment time variables including demographic traits, pre-university academic preparation, program-related facts and chosen financial measures. It is a three-class supervised learning problem that is formulated as the prediction task. We rank three common algorithms multinomial logistic regression, random forest, and extreme gradient boosting (XGBoost) and put them through a single pipeline that incorporates median imputation, feature scaling where necessary, stratified train-validation-test 60/20/20 split and Bayesian hyperparameter optimization using Optuna. The model performance is analyzed based on five random seeds by macro-averaged F1-score as the main measure, which is supplemented by the accuracy, class-specific precision and recall, confusion matrices, and multi-classes receiver operating characteristic curves. Findings indicate that enrollment-time data by itself can help make practically useful predictions. The tuned XGBoost models perform best, and have the most consistent performance, with macro-F1 of about 0.70-0.72 and a macro ROC-AUC of about 0.88-0.90 across the seeds with the tuned random forests coming in a close behind; multinomial logistic regression has a weaker performance but can be interpreted. There is a high hit rate with all models predicting graduates best, a moderate level of predictability with dropouts, and the most difficult to predict with still-enrolled students. The analyses of feature-importance reveal that academic preparation and early academic indicators, program and financial variables are the most significant predictors of outcomes. The results show that machine learning models used at enrollment time can be used as an early-warning layer of institutional retention work, allowing students to be identified at risk before commencing their coursework. The research paper ends with a set of recommendations on how to incorporate such models into the advising process and provides guidelines on the further development of work on cross-institutional validation, more comprehensive behavioral characteristics, and explainable and equitable predictive analytics.