Large Language Models for Depression Assessment: Simulating Patients and Clinicians in MADRS Administration

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Depression assessment faces challenges of resource limitations and inter-rater variability. We evaluated four large language models’ abilities to simulate depressed patients and conduct Montgomery-Asberg Depression Rating Scale (MADRS) assessments using 139 synthetic patient profiles. Four configurations were tested: bidirectional role-playing between different LLMs (Grok 4 and Claude 4.1), dual-role assessments using single LLM instances (GPT5 and Gemini 2.5), and variations in patient response patterns. Results demonstrated high concordance between true and LLM-estimated MADRS scores (ICCs=0.79-0.93, p<0.0005). GPT5 dual-role and Claude-as-patient/Grok-as-clinician configurations showed strongest performance. However, systematic biases emerged: Grok-as-clinician overestimated 89% of scores, while most configurations underestimated severe presentations. Item-level analysis revealed excellent accuracy for concrete symptoms (Inner Tension r=0.91, Reduced Sleep r=0.88) but with concerning underestimation of severe suicidality: 30% of maximum suicide risk ratings were underestimated by two or more severity levels. These findings suggest potential training applications but preclude immediate clinical deployment without human oversight.

Article activity feed