Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The DAIC-WOZ dataset is a widely used benchmark for the task of depression severity estimation from multimodal behavioral data. Yet the reliability, reproducibility, and methodological rigor of published machine learning models remain uncertain. In this systematic review, we examined all works published through September 2025 that mention the DAIC-WOZ dataset and report mean absolute error as an evaluation metric. Our search identified 536 papers, of which 414 remained after deduplication. Following title and abstract screening, 132 records were selected for full-text review. After applying eligibility criteria, 66 papers were included in the quality assessment stage. Of these, only five met minimal reproducibility standards (such as clear data partitioning, model description, and training protocol documentation) and were included in this review. We found that published models suffer from poor documentation and methodology, and, inter alia, identified subject leakage as a critical methodological flaw. To illustrate its impact, we conducted experiments on the DAIC-WOZ dataset, comparing the performance of the model trained with and without subject leakage. Our results indicate that leakage produces significant overestimation of the validation performance; however, our evidence is limited to the audio, text, and combined modalities of the DAIC-WOZ dataset. Without leakage, the model consistently performed worse than a simple mean predictor. Aside from poor methodological rigor, we found that the predictive accuracy of the included models is poor: reported MAEs on DAIC-WOZ are of the same magnitude as the dataset’s own PHQ-8 variability, and are comparable to or larger than the variability typically observed in general population samples. We conclude with specific recommendations aimed at improving the methodology, reproducibility, and documentation of manuscripts. Code for our experiments is publicly available.

Article activity feed