Investigating Algorithmic Bias in Machine Learning Prediction Models of Suicide Attempts in Multiple Clinical Settings by Race/Ethnicity and Gender
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance: Machine learning models reflect the training data, and may thus learn and perpetuate healthcare disparities. Objective: To evaluate whether performance of a validated machine learning model predicting suicide attempts varies by race/ethnicity or gender from electronic health records (EHRs). Design: In this prognostic study, we re-analyzed previously validated landmark prediction models predicting suicide attempts 18 months after a healthcare visit. Prediction models were estimated with regularized Cox regression models in three cohorts: (1) general outpatient; (2) psychiatric emergency department (ED); and (3) psychiatric inpatient. Model performance (area under the curve [AUC], sensitivity, positive predictive value [PPV]) was evaluated independently across race/ethnicity and gender in all three cohorts, and at the intersection of race/ethnicity and gender in the general outpatient cohort. Setting: EHR data were from the Research Patient Data Registry at Mass General Brigham. Participants: Individuals ages 15–85 years seen in at least 1 of 3 clinical settings from Jan 1, 2016–Dec 31, 2018: general outpatient (N=1,210,222), psychiatric ED (N=13,098), and psychiatric inpatient (N=7,825).Main Outcomes and Measures: The primary outcome was suicide attempt determined by validated ICD codes during 18 months after a randomly sampled “landmark visit” in one of the three settings. Results: When considering gender alone, models showed consistently stronger performance for male vs. female patients. When considering race/ethnicity alone, results were equivocal: in general outpatient, models had higher AUC for White than Hispanic patients. However, in the psychiatric ED, AUC was highest for Asian patients. When considering the intersection of race/ethnicity and gender in general outpatient, models provided better performance for White men than Hispanic and White women across all metrics. There were also gender differences within racial/ethnic groups, with higher PPV for Black men than Black women, and Hispanic men than Hispanic women, suggesting gender differences largely drove these differences. Conclusions and Relevance: We observed modest evidence for disparities in suicide prediction models by gender, and limited evidence of disparities by race/ethnicity alone. More consistent patterns of bias emerged at the intersection of race/ethnicity and gender. Future work should replicate these findings in larger diverse samples to ensure fair deployment of models.