AI-Powered Triage of Suicidal Ideation in Adolescents: A Comparative Evaluation of Large Language Models Using Synthetic Clinical Vignettes

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

To evaluate the performance of leading Large Language Models (LLMs) in classifying suicide risk and generating clinically appropriate action plans for adolescent psychiatric cases presented through synthetic clinical vignettes.

Methods

We developed 40 synthetic clinical vignettes depicting adolescents with varying levels of suicide risk, structured according to established clinical formulation principles. A gold standard for risk level, based on the Columbia-Suicide Severity Rating Scale (C-SSRS) framework, and corresponding clinical actions was established for each vignette by a panel of two board-certified child and adolescent psychiatrists. Three LLMs (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B) were prompted using a structured chain-of-thought methodology to classify risk and propose a detailed action plan. Performance was assessed using quantitative classification metrics (accuracy, precision, recall, F1-score) and qualitative thematic analysis of the generated action plans.

Results

Quantitative analysis of risk classification revealed variable performance. GPT-4o achieved the highest accuracy (82.5%), followed by Claude 3.5 Sonnet (75.0%) and Llama-3.1-70B (67.5%). F1-scores demonstrated challenges in correctly identifying higher-risk categories, particularly for nuanced presentations of intent. Qualitative thematic analysis of the action plans identified consistent adherence to basic safety protocols (e.g., recommending emergency evaluation for high-risk cases). However, significant and critical failures were pervasive, including the omission of crucial inquiries about access to lethal means, failure to incorporate protective factors into planning, and the generation of clinically inappropriate therapeutic reassurance in a triage context.

Conclusions

While LLMs demonstrate a nascent ability to process clinical information for suicide risk assessment, significant deficits in clinical reasoning and safety planning persist. Their performance on idealized synthetic data suggests these models are not yet suitable for autonomous clinical decision-making. These findings underscore the imperative for rigorous, clinically-grounded evaluation frameworks and the development of human-in-the-loop systems to ensure patient safety in any future deployment.

Key Messages

What is already known on this topic

Suicide is a leading cause of death in adolescents, yet current clinical risk assessment tools and subjective judgments have limited predictive accuracy and are difficult to scale in the face of rising demand and workforce shortages.

What this study adds

This study provides a direct comparative evaluation of multiple state-of-the-art Large Language Models on a standardized adolescent suicide risk triage task, using a synthetic data methodology that allows for controlled assessment of both classification accuracy and the clinical appropriateness of generated action plans.

How this study might affect research, practice or policy

The findings highlight the potential of LLMs as adjunctive tools in non-specialist settings but also reveal critical safety and reliability gaps that must be addressed through further research, the development of ethical guidelines, and regulatory oversight before any clinical implementation can be considered.

Article activity feed