Evaluating Open-Weight Large Language Models for Structured Depression Assessment from Clinical Interviews

Jeffrey M. Girard
Gaossou Youssouf Kebe
Louis-Philippe Morency
Fernando De la Torre
Einat Liebenthal
Justin T. Baker

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The administration of semi-structured clinical interviews for depression assessment is resource-intensive and susceptible to rater drift. While large language models (LLMs) offer potential for automated quality assurance, reliance on proprietary models creates data privacy barriers incompatible with secure clinical workflows. This study evaluates a privacy-preserving pipeline using open-weight LLMs to perform item-level scoring of the Montgomery-Åsberg Depression Rating Scale (MADRS) directly from interview transcripts. The sample included 541 video-recorded English-language interviews with 277 psychiatric inpatients diagnosed with diverse affective and psychotic disorders. Benchmarking 25 architectures, we found that an indirect scoring strategy, where models predict individual item scores that are subsequently summed, significantly outperformed direct total score prediction. This suggests that decomposing the assessment into intermediate reasoning steps improves performance. The best-performing models achieved “good” error rates and “excellent” correlations with trusted human labels, approaching established benchmarks for human inter-rater reliability. To determine the active ingredients of effective prompting, we conducted an ablation study. Results revealed that descriptive cues (scoring criteria) were the primary driver of performance, whereas few-shot examples provided negligible additional benefit, suggesting that zero-shot prompting is a viable, efficient strategy. However, error analysis identified a multimodal gap, where text-only models struggled more with items dependent on nonverbal behavior. We conclude that open-weight LLMs demonstrate strong potential to serve as secure “digital second opinions,” representing a promising step toward augmenting clinical decision-making without compromising patient privacy.

Version published to 10.31234/osf.io/63sw4_v1 on OSF Preprints
Apr 18, 2026

Prompt Architecture as a High-Impact Design Factor in Expert-Rated Clinical Documentation Quality: A Controlled Comparative Study in Inpatient Rehabilitation

This article has 18 authors:
1. Idoia Eceizabarrena-Matxinandiarena
2. Emilio-Javier Frutos-Reoyo
3. José Ignacio Guerrero-Rojas
4. Clara Vidal-Millet
5. Pedro Ignacio Tejada Ezquerro
6. Elena Roldan-Arcelus
7. Irene De-Torres
8. Judith Sanchez-Raya
9. Lourdes Gil-Fraguas
10. María Hernandez-Manada
11. Carolina de Miguel-Benadiba
12. Josep Maria Monguet-Fierro
13. Alejandro Trejo-Omeñaca
14. Michelle Catta-Preta
15. Astrid Teixeira-Taborda
16. Natalia Álvarez-Bandrés
17. Raquel Cutillas-Ruiz
18. Helena Bascuñana-Ambrós
This article has no evaluationsLatest version Apr 1, 2026
Can Large Language Models Emulate Human Performance on Educational Assessments?

This article has 4 authors:
1. Xiuxiu Tang
2. Yikai Lu
3. John T. Behrens
4. Ying Cheng
This article has no evaluationsLatest version Apr 23, 2026
Establishing Objective Ground-Truth for PediatricADHD Engagement: A Methodological Frameworkand Benchmark Dataset

This article has 3 authors:
1. Somayeh Malekshahi
2. Reza Rostami
3. Hamid Soltanian-Zadeh
This article has no evaluationsLatest version Mar 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Prompt Architecture as a High-Impact Design Factor in Expert-Rated Clinical Documentation Quality: A Controlled Comparative Study in Inpatient Rehabilitation

Can Large Language Models Emulate Human Performance on Educational Assessments?

Establishing Objective Ground-Truth for PediatricADHD Engagement: A Methodological Frameworkand Benchmark Dataset