Automated Suicide Risk Factor Monitoring in Crisis Text Line Users: Comparative Study of AI and Human Ratings Using Large Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : Large Language Models´ (LLMs) potential for psychological diagnostics requires systematic evaluation. Objective : To investigate conditions for reliable and valid psychological assessments, focusing on suicide risk evaluation in clinical data by comparing LLM-generated ratings with human expert ratings across across configurations. Methods : We analyzed 100 youth crisis conversation transcripts rated by four experts using the Nurses Global Assessment of Suicide Scale (NGASR). Using Mixtral-7x8b-Instruct, we generated ratings across three temperature settings and prompting styles (zero-shot, few-shot, chain-of-thought). Across configurations we compared a) inter-rating-reliability for AI-generated NGASR risk and sum scores, b) LLM-to-human observer agreement regarding sum score, risk category, and item, using Krippendorff´s α, c) classification metrics of risk categories and individual items against human ratings. Results : LLM configuration strongly influenced assessment reliability. Zero-shot prompting at temperature 0 yielded perfect inter-rating reliability (α=1.00, 95% CI: [1-1] for high & very high risk), while few-shot prompting showed best human-AI agreement for very high risk (α=0.78, 95% CI: [0.67-0.89]) and strongest classification performance (balanced accuracy 0.54-0.71). Lower temperatures consistently improved reliability and accuracy. However, critical clinical items showed poor validity. Discussion : Our findings establish optimal conditions (zero temperature, task-specific prompting) for LLM-based psychological assessment. However, inconsistent clinical item performance and only moderate to-human agreement limit LLMs to initial screening rather than detailed assessment, requiring careful parameter control and validation.

Article activity feed