Large language model performance versus human expert ratings in automated suicide risk assessment
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Language Models’ (LLMs) potential for psychological diagnostics requires systematic evaluation. We aimed to investigate conditions for reliable and valid psychological assessments, focusing on suicide risk evaluation in clinical data by comparing LLM-generated ratings with human expert ratings across model configurations., analyzing 100 youth crisis text line conversation transcripts rated by four experts using the Nurses’ Global Assessment of Suicide Scale (NGASR). Using Mixtral-8x7B-Instruct, we generated ratings across three temperature settings and prompting styles (zero-shot, few-shot, chain-of-thought). Across configurations we compared (a) inter-rating-reliability for AI-generated NGASR risk and sum scores, (b) LLM-to-human observer agreement regarding sum score, risk category, and item, using Krippendorff’s α, (c) classification metrics of risk categories and individual items against human ratings. LLM configuration strongly influenced assessment reliability. Zero-shot prompting at temperature 0 yielded perfect inter-rating reliability (α = 1.00, 95% CI: [1–1] for high & very high risk), while few-shot prompting showed best human-AI agreement for very high risk (α = 0.78, 95% CI: [0.67–0.89]) and strongest classification performance (balanced accuracy 0.54–0.71). Lower temperatures consistently improved reliability and accuracy. However, critical clinical items showed poor validity. Our findings establish optimal conditions (zero temperature, task-specific prompting) for LLM-based psychological assessment. However, inconsistent clinical item performance and only moderate LLM-to-human observer agreement limit LLMs to initial screening rather than detailed assessment, requiring careful parameter control and validation.