Development and Validation of Machine Learning-Based Prediction of Depression Progression Using EHR Data: A Multi-Institutional Retrospective Cohort Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Depression is a leading cause of global disability. Timely identification of patients at risk for clinical worsening remains a major challenge. Electronic health records (EHRs) facilitate large-scale, real-world analyses of disease trajectories. However, standardized symptom scale data such as the Patient Health Questionnaire-9 are often unavailable or recorded only as unstructured text. In this context, International Classification of Diseases (ICD10) diagnostic-code based severity progression provides a pragmatic alternative for developing predictive tools to identify worsening depression.
Objective
We aim to develop and evaluate machine-learning and deep-learning models for predicting ICD10-defined progression from mild to moderate/severe depression using EHR data curated by the MedStar Health Research Institute (MHRI).
Methods
We conducted a multi-institutional retrospective cohort analysis using the MHRI EHR database, which integrates data from 10 hospitals and 300 outpatient sites across the mid-Atlantic. Adults (≥18 years) with an initial ICD10 diagnosis of mild depression between 2017 and 2023 were included (N=2131). Nonprogressors were defined as patients whose mild major depressive disorder remained mild for 24 months (N=270). Progressors were defined as patients who developed moderate or severe ICD10 depression within 24 months of the index diagnosis (N=533). Data were stratified and split into (60%) training, (20%) validation, and (20%) test subsets. A heterogeneous feature set spanning demographics, healthcare utilization, socioeconomic indices, diagnostic context, and laboratory measurements were available. Logistic regression utilized elastic net regularization with fivefold cross validation, and random forest hyperparameters were tuned by grid search. XGBoost, CatBoost, and a deep neural network (DNN) were trained with standard learning rate, depth, class weighting, and early stopping. A deterministic top model selection framework applied prespecified thresholds of sensitivity at least 0.70 and AUC at least 0.70, and composite rankings integrated accuracy, sensitivity, specificity, and the overfitting gap.
Results
The analytic cohort included 803 patients with complete two-year follow-up. Under the selection criteria, the DNN failed to meet the AUC threshold (0.671) and was excluded. Among the remaining models, XGBoost achieved the top composite score (accuracy = 0.72; AUC = 0.776; sensitivity = 0.77; specificity = 0.63; overfit gap = 0.112). Logistic regression ranked second (accuracy = 0.71; AUC = 0.797; sensitivity = 0.79; specificity = 0.61; overfit gap = 0.052), followed by CatBoost and random forest, the latter penalized for overfitting (gap = 0.278). The TinyLlama audit note, generated through a local Hugging Face pipeline, confirmed XGBoost as the most balanced model.
Conclusions
Using EHR data from a multi-institutional regional health system, we developed and validated machine-learning models that predicted progression of depression. XGBoost demonstrated the most reliable composite performance. These findings support the feasibility of leveraging socioeconomic and EHR data to predict worsening depression and emphasize the importance of transparent model-selection frameworks for trustworthy clinical artificial intelligence.