Potential of ChatGPT in youth mental health emergency triage: Comparative analysis with clinicians

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Aim

Large language models, such as GPT‐4, are increasingly integrated into healthcare to support clinicians in making informed decisions. Given ChatGPT's potential, it is necessary to explore such applications as a support tool, particularly within mental health telephone triage services. This study evaluates whether GPT Models can accurately triage psychiatric emergency vignettes and compares its performance to that of clinicians.

Methods

A cross‐sectional study was performed to assess the performance of three different GPT‐4 models (GPT‐4o, GPT‐4o Mini, and GPT‐4 Legacy) in psychiatric emergency triage. Twenty‐two psychiatric emergency vignettes, intended to represent realistic prehospital triage scenarios, were initially drafted using ChatGPT and subsequently reviewed and refined by the research team to ensure clinical accuracy and relevance. The GPT‐4 models independently generated clinical responses to the vignettes over three iterations to ensure consistency. Thereafter, two advanced practice nurse practitioners independently assessed these responses utilizing a 3‐point Likert‐type scale for the main triage criteria: risk level ( Low  = 1 to High  = 3), necessity of hospital admission ( Ye s = 1; No  = 2), and urgency of clinical evaluation ( Low  = 1 to High  = 3). Additionally, the nurse practitioners provided their clinical judgments independently for each vignette. Interrater reliability was evaluated by comparing responses generated by the GPT Models with the independent clinical assessments of nurse practitioners, and agreement was evaluated using Cohen's Kappa. The clinical expert committee ( n  = 3) conducted qualitative analyses of responses from both GPT Models using a systematic coding method to evaluate triage accuracy, clarity, completeness, and total score. The evaluation of responses focused on three key triage criteria: risk ( Low  = 1 to High  = 3), admission necessity ( Ye s = 1; No  = 2), and urgency of clinical evaluation ( Low  = 1 to High  = 3).

Results

GPT Models had an average admission score of 1.73 (standard deviation [SD] = 0.45; scale: Yes  = 1, No  = 2), indicating a general trend toward recommending against hospital admission. Risk (mean = 2.12, SD = 0.83) and urgency (mean = 2.27, SD = 0.44) assessments suggested moderate‐to‐high perceived risk and urgency (scale: Low  = 1, High  = 3), reflecting conservative decision‐making. Interrater reliability between clinicians and GPT‐4 models was substantial, with Cohen's Kappa values of 0.77 (admission), 0.78 (risk), and 0.76 (urgency). GPT Models’ responses tended toward slight over‐triage, indicated by four false‐positive admission recommendations and zero false negatives. Substantial interrater reliability was observed between clinicians and GPT‐4 responses across the three triage criteria (Cohen's Kappa: admission = 0.77; risk = 0.78; urgency = 0.76).The mean scores for triage criteria responses between GPT‐4 models and clinicians exhibited consistent patterns with minimal variability. Overall, GPT Models had a tendency to over‐triage patients as indicated by four total false positives and zero false negatives for admissions.

Conclusion

This study indicates that GPT Models may serve as supportive decision‐support tools in mental health telephone triage, particularly for psychiatric emergencies. Although response variability across iterations was minimal, most discrepancies in admission decisions were identified as false positives, reflecting that GPT Models may have a tendency to over‐triage relative to clinician judgment. Further investigation is needed to establish robust structure to increase alignment with clinical decisions and response relevance in clinical practice.

Article activity feed