Evaluating Open-Weight Large Language Models for Structured Depression Assessment from Clinical Interviews

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The administration of semi-structured clinical interviews for depression assessment is resource-intensive and susceptible to rater drift. While large language models (LLMs) offer potential for automated quality assurance, reliance on proprietary models creates data privacy barriers incompatible with secure clinical workflows. This study evaluates a privacy-preserving pipeline using open-weight LLMs to perform item-level scoring of the Montgomery-Åsberg Depression Rating Scale (MADRS) directly from interview transcripts. The sample included 541 video-recorded English-language interviews with 277 psychiatric inpatients diagnosed with diverse affective and psychotic disorders. Benchmarking 25 architectures, we found that an indirect scoring strategy, where models predict individual item scores that are subsequently summed, significantly outperformed direct total score prediction. This suggests that decomposing the assessment into intermediate reasoning steps improves performance. The best-performing models achieved “good” error rates and “excellent” correlations with trusted human labels, approaching established benchmarks for human inter-rater reliability. To determine the active ingredients of effective prompting, we conducted an ablation study. Results revealed that descriptive cues (scoring criteria) were the primary driver of performance, whereas few-shot examples provided negligible additional benefit, suggesting that zero-shot prompting is a viable, efficient strategy. However, error analysis identified a multimodal gap, where text-only models struggled more with items dependent on nonverbal behavior. We conclude that open-weight LLMs demonstrate strong potential to serve as secure “digital second opinions,” representing a promising step toward augmenting clinical decision-making without compromising patient privacy.

Article activity feed