Automated Suicide Risk Factor Monitoring in Crisis Text Line Users: Comparative Study of AI and Human Ratings Using Large Language Models

Julia Thomas
Zohar Elyoseph
Lars Kuchinke
Gunther Meinlschmidt

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background : Large Language Models´ (LLMs) potential for psychological diagnostics requires systematic evaluation. Objective : To investigate conditions for reliable and valid psychological assessments, focusing on suicide risk evaluation in clinical data by comparing LLM-generated ratings with human expert ratings across across configurations. Methods : We analyzed 100 youth crisis conversation transcripts rated by four experts using the Nurses Global Assessment of Suicide Scale (NGASR). Using Mixtral-7x8b-Instruct, we generated ratings across three temperature settings and prompting styles (zero-shot, few-shot, chain-of-thought). Across configurations we compared a) inter-rating-reliability for AI-generated NGASR risk and sum scores, b) LLM-to-human observer agreement regarding sum score, risk category, and item, using Krippendorff´s α, c) classification metrics of risk categories and individual items against human ratings. Results : LLM configuration strongly influenced assessment reliability. Zero-shot prompting at temperature 0 yielded perfect inter-rating reliability (α=1.00, 95% CI: [1-1] for high & very high risk), while few-shot prompting showed best human-AI agreement for very high risk (α=0.78, 95% CI: [0.67-0.89]) and strongest classification performance (balanced accuracy 0.54-0.71). Lower temperatures consistently improved reliability and accuracy. However, critical clinical items showed poor validity. Discussion : Our findings establish optimal conditions (zero temperature, task-specific prompting) for LLM-based psychological assessment. However, inconsistent clinical item performance and only moderate to-human agreement limit LLMs to initial screening rather than detailed assessment, requiring careful parameter control and validation.

Version published to 10.21203/rs.3.rs-6210376/v1 on Research Square
May 13, 2025

Using a Fine-tuned Large Language Model for Symptom-based Depression Evaluation

This article has 11 authors:
1. Samantha Weber
2. Nicolas Deperrois
3. Robert Heun
4. Laura Frühschütz
5. Anna Monn
6. Stephanie Homan
7. Andrea Häfliger
8. Erich Seifritz
9. Tobias Kowatsch
10. Birgit Kleim
11. Sebastian Olbrich
This article has no evaluationsLatest version May 13, 2025
Can Large Language Models address problem gambling? Expert insights from gambling treatment professionals

This article has 6 authors:
1. Kasra Ghaharian
2. Marta Soligo
3. Richard Young
4. Lukasz Golab
5. Shane W Kraus
6. Samantha Wells
This article has no evaluationsLatest version May 21, 2025
Automated speech content analysis to detect depression with large language models: towards multilingual and few-shot capabilities

This article has 7 authors:
1. Rachid Riad
2. Alexandre Ducorroy
3. Sélim Benjamin GUESSOUM
4. Filomène ROQUEFORT
5. Adrien Lesage
6. Xuan-Nga Cao
7. Alexis Bourla
This article has no evaluationsLatest version May 13, 2025

Listed in

Abstract

Article activity feed

Related articles

Using a Fine-tuned Large Language Model for Symptom-based Depression Evaluation

Can Large Language Models address problem gambling? Expert insights from gambling treatment professionals

Automated speech content analysis to detect depression with large language models: towards multilingual and few-shot capabilities