Generalizability of Risk Models for Treatment-Resistant Depression Across Three Health Systems

Colin G. Walsh
Michael Ripperger
Thomas H. McCoy
Victor Castro
Yirui Hu
H. Lester Kirchner
Douglas Ruderfer
Roy H. Perlis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

As multiple strategies have emerged for managing treatment-resistant major depressive disorder, efficient identification of individuals at elevated risk for this outcome earlier in their illness course remains essential.

Method

We extracted electronic health records data for all individuals with a diagnosis of major depressive disorder who received an index antidepressant prescription in the clinical networks of three geographically-distinct health systems – Mass General-Brigham (MGB), Vanderbilt University Medical Center (VUMC), and Geisinger Clinic (GC) – between April 1, 2004, and March 30, 2022. The primary outcome, treatment resistant depression, was defined as provision of electroconvulsive therapy, transcranial magnetic stimulation, vagus nerve stimulation, prescription of either ketamine or esketamine or monoamine oxidase inhibitors (MAOIs), or failed trials of more than two antidepressants. We applied L1-regularized regression to sociodemographic features, medications, and ICD10 diagnostic code counts to fit a model of treatment resistance in each of the three cohorts. For each, we then estimated generalizable model performance, aka external validity, across the other two cohorts. Model concordance was measured with Concordance Correlation Coefficients (CCCs) and random forest regression analyses were used to estimate importance of features predicting discordance.

Results

Across sites, discrimination performance ranged from Area Under the Receiver Operating Characteristic curves (AUROCs) 0.58 – 0.64 on internal validation and 0.51 - 0.58 on external validation. Area Under the Precision-Recall curve (AUPRC) ranged from 0.1-0.13 on internal validation and averaged 0.07-0.13 in external validation on the same test sets held out at each site. On the same testing set, CCCs were 0.13 for the VUMC<-> MGB models, 0.18 for VUMC<->GC models, and 0.38 for MGB<-> GC models. These results indicate the MGB and GC models were better correlated, but none were well correlated. Important features predicting discordance were dominated primarily by age and secondarily coded sex.

Conclusion

These linear models demonstrated consistent aggregate performance and discordant individual performance across three, disparate major health systems. The inclusion of large and heterogeneous samples suggest that further improvement may require incorporation of data types beyond those readily available in EHR. Close attention to performance by key subgroups is indicated to ensure models do not perform disparately or unfairly. Prospective studies to evaluate the extent to which clinical models might improve early identification and outcomes are warranted.

Version published to 10.1101/2025.05.21.25328089 on medRxiv
May 27, 2025