AI-Powered Triage of Suicidal Ideation in Adolescents: A Comparative Evaluation of Large Language Models Using Synthetic Clinical Vignettes

Masab Mansoor

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

To evaluate the performance of leading Large Language Models (LLMs) in classifying suicide risk and generating clinically appropriate action plans for adolescent psychiatric cases presented through synthetic clinical vignettes.

Methods

We developed 40 synthetic clinical vignettes depicting adolescents with varying levels of suicide risk, structured according to established clinical formulation principles. A gold standard for risk level, based on the Columbia-Suicide Severity Rating Scale (C-SSRS) framework, and corresponding clinical actions was established for each vignette by a panel of two board-certified child and adolescent psychiatrists. Three LLMs (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B) were prompted using a structured chain-of-thought methodology to classify risk and propose a detailed action plan. Performance was assessed using quantitative classification metrics (accuracy, precision, recall, F1-score) and qualitative thematic analysis of the generated action plans.

Results

Quantitative analysis of risk classification revealed variable performance. GPT-4o achieved the highest accuracy (82.5%), followed by Claude 3.5 Sonnet (75.0%) and Llama-3.1-70B (67.5%). F1-scores demonstrated challenges in correctly identifying higher-risk categories, particularly for nuanced presentations of intent. Qualitative thematic analysis of the action plans identified consistent adherence to basic safety protocols (e.g., recommending emergency evaluation for high-risk cases). However, significant and critical failures were pervasive, including the omission of crucial inquiries about access to lethal means, failure to incorporate protective factors into planning, and the generation of clinically inappropriate therapeutic reassurance in a triage context.

Conclusions

While LLMs demonstrate a nascent ability to process clinical information for suicide risk assessment, significant deficits in clinical reasoning and safety planning persist. Their performance on idealized synthetic data suggests these models are not yet suitable for autonomous clinical decision-making. These findings underscore the imperative for rigorous, clinically-grounded evaluation frameworks and the development of human-in-the-loop systems to ensure patient safety in any future deployment.

Key Messages

What is already known on this topic

Suicide is a leading cause of death in adolescents, yet current clinical risk assessment tools and subjective judgments have limited predictive accuracy and are difficult to scale in the face of rising demand and workforce shortages.

What this study adds

This study provides a direct comparative evaluation of multiple state-of-the-art Large Language Models on a standardized adolescent suicide risk triage task, using a synthetic data methodology that allows for controlled assessment of both classification accuracy and the clinical appropriateness of generated action plans.

How this study might affect research, practice or policy

The findings highlight the potential of LLMs as adjunctive tools in non-specialist settings but also reveal critical safety and reliability gaps that must be addressed through further research, the development of ethical guidelines, and regulatory oversight before any clinical implementation can be considered.

Version published to 10.1101/2025.08.05.25333046 on medRxiv
Aug 7, 2025

Development and initial validation of the Response to Suicidal Ideation Inventory (RSII)

This article has 2 authors:
1. Si Ning Yeo
2. Jeremy Gordon Stewart
This article has no evaluationsLatest version Aug 23, 2025
Developing an AI-Enhanced Individualized Prediction Tool for Psychopathological Symptoms in Vietnam: A Study Protocol

This article has 4 authors:
1. Hung Nguyen
2. Minh Khau
3. Huyen Nguyen
4. Hieu Pham
This article has no evaluationsLatest version Aug 6, 2025
Reading Between the Signs: Predicting Future Suicidal Ideation from Adolescent Social Media Texts

This article has 5 authors:
1. Paul Blum
2. Enrico Liscio
3. Ruixuan Zhang
4. Caroline Figueroa
5. Pradeep K. Murukannaiah
This article has no evaluationsLatest version Aug 29, 2025

Listed in

Abstract

Objective

Methods

Results

Conclusions

Key Messages

What is already known on this topic

What this study adds

How this study might affect research, practice or policy

Article activity feed

Related articles

Development and initial validation of the Response to Suicidal Ideation Inventory (RSII)

Developing an AI-Enhanced Individualized Prediction Tool for Psychopathological Symptoms in Vietnam: A Study Protocol

Reading Between the Signs: Predicting Future Suicidal Ideation from Adolescent Social Media Texts