Large Language Models for Depression Assessment: Simulating Patients and Clinicians in MADRS Administration

Russell W. Hanson
Tyler Maxwell Moore
Adam Robert Teed
Alexander Speer
Mohammad Akbari
Franz Hell
Shobi Syed Ahmed

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Depression assessment faces challenges of resource limitations and inter-rater variability. We evaluated four large language models’ abilities to simulate depressed patients and conduct Montgomery-Asberg Depression Rating Scale (MADRS) assessments using 139 synthetic patient profiles. Four configurations were tested: bidirectional role-playing between different LLMs (Grok 4 and Claude 4.1), dual-role assessments using single LLM instances (GPT5 and Gemini 2.5), and variations in patient response patterns. Results demonstrated high concordance between true and LLM-estimated MADRS scores (ICCs=0.79-0.93, p<0.0005). GPT5 dual-role and Claude-as-patient/Grok-as-clinician configurations showed strongest performance. However, systematic biases emerged: Grok-as-clinician overestimated 89% of scores, while most configurations underestimated severe presentations. Item-level analysis revealed excellent accuracy for concrete symptoms (Inner Tension r=0.91, Reduced Sleep r=0.88) but with concerning underestimation of severe suicidality: 30% of maximum suicide risk ratings were underestimated by two or more severity levels. These findings suggest potential training applications but preclude immediate clinical deployment without human oversight.

Version published to 10.31234/osf.io/7h6ud_v1 on OSF Preprints
Feb 12, 2026

Distinguishing the Language of Mental and Physical Health: A Sequential Evaluation with Model Preregistration of Automated Clinical Visit Interviews

This article has 9 authors:
1. Oscar Nils Erik Kjell
2. Scott Feltman
3. H. Andrew Schwartz
4. Adithya V Ganesan
5. Whitney R. Ringwald
6. Sean Clouston
7. Melissa Anne Carr
8. Benjamin Luft
9. Roman Kotov
This article has no evaluationsLatest version Mar 18, 2026
The threat of synthetic respondents extends to clinical mental health screening

This article has 3 authors:
1. Kianté Fernandez
2. Laura Berner
3. Blair R K Shevlin
This article has no evaluationsLatest version Mar 6, 2026
Dimensional Psychopathology Explains Functional Disability better than ICD-10 Categories

This article has 5 authors:
1. Pauline Maria Nagel
2. Kornelius Winds
3. Wolfgang Aichhorn
4. Martin Kronbichler
5. Lisa Kronbichler
This article has no evaluationsLatest version Feb 24, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Distinguishing the Language of Mental and Physical Health: A Sequential Evaluation with Model Preregistration of Automated Clinical Visit Interviews

The threat of synthetic respondents extends to clinical mental health screening

Dimensional Psychopathology Explains Functional Disability better than ICD-10 Categories