Whose Truth Is Ground Truth?: Consequences of Label Choice on ML Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective:Machine learning (ML) models developed using electronic health record (EHR) data frequently rely on provider-documented diagnoses as ground truth, despite substantial variation in how mental health conditions are diagnosed and recorded. We aimed to examine how depression label choice (patient-reported, provider-coded, or patient–provider concordant) impacted the performance and feature importance of tree-based ML models. Methods:We analyzed EHR data from 644,387 adults (2012–2019) in the OCHIN ADVANCE network. Depression outcomes were defined using (1) provider-coded ICD diagnoses, (2) patient-reported PHQ-9 scores in the “Most Severe” range (26–27), and (3) concordant indications. Predictors included demographics, vital signs, chronic conditions, and social determinants of health. Classification and regression trees (CART) were trained using 70/30 splits and 10-fold cross-validation. Performance was assessed using sensitivity, specificity, AUC, and F1 score. SHapley Additive exPlanations (SHAP) quantified feature importance.Results:Model performance was modest (AUC = .62–.64); sensitivity was highest for provider-indicated depression (0.33). Feature importance differed substantially by labeling method: anxiety was the only consistently influential predictor across label strategies. Demographic features (e.g., gender, race, marital status) were highly influential in provider-labeled models but not patient-reported or concordant label strategies.Conclusions:We empirically demonstrated that depression label choice meaningfully alters ML model performance and feature interpretation. Higher AUC values from provider-derived labels were offset by amplified demographic feature importance, raising concerns about bias. Findings underscore the importance of transparent reporting, testing multiple labeling strategies, and including patient-reported outcomes in model development to support trustworthy and equitable mental health AI.