Evaluation of large language model chatbot responses to psychotic prompts

Elaine Shen
Fadi Hamati
Meghan Rose Donohue
Ragy R. Girgis
Jeremy Veenstra-VanderWeele
Amandeep Jutla

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

The large language model (LLM) chatbot product ChatGPT has accumu-lated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input, they may have difficulty responding to psychotic content.

Objective

To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms.

Design

A cross-sectional study of ChatGPT responses to psychotic and control prompts, with blind clinician ratings of response appropriateness.

Setting

ChatGPT web application accessed on 8/28-8/29/2025, testing three prod-uct versions: GPT-5 Auto (current paid default), GPT-4o (previous paid default), and “Free” (version accessible without subscription or account).

Main Outcomes and Measures

We presented 158 unique prompts (79 control and 79 psychotic, generated based on the Structured Interview for Psychosis-Risk Syndromes) to three product versions, yielding 474 prompt-response pairs. Blinded clinicians assigned each an appropriateness rating (0 = completely appropriate, 1 = somewhat appropriate, 2 = completely inappropriate) via a standardized rubric. We hypothesized a priori that psychotic prompts would be more likely than control prompts to elicit inappropriate responses both across and within product versions.

Results

In the primary (across-version) analysis, psychotic prompts were 25.84 times more likely to elicit inappropriate responses with “Free” ChatGPT (95% CI 12.45 to 53.66, p < 0.001). GPT-5 Auto reduced risk somewhat (OR for interaction term 0.33, 95% CI 0.16 to 0.68, p = 0.005) yet still generated inappropriate responses at a greatly elevated rate (implied OR 8.53, 95% CI 3.05 to 23.84). In the secondary (within-version) analysis, ORs were 9.08 for GPT-5 Auto (95% CI 4.24 to 21.02), 14.15 for GPT-4o (95% CI 6.12 to 37.23) and 43.37 for “Free” (95% CI 18.44 to 112.80). In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions.

Conclusions and Relevance

No tested version of ChatGPT reliably generated ap-propriate responses to psychotic content.

Key Points

Question

Can the popular large language model product ChatGPT reliably generate appropriate responses to prompts containing psychotic content?

Findings

Psychotic prompts were 26 times more likely than control prompts to elicit inappropriate responses from the current free version of ChatGPT, and 9 times more likely to elicit them from the current paid version.

Meaning

No tested version of ChatGPT can reliably generate appropriate responses to psychotic content.

Version published to 10.1101/2025.11.09.25339772 on medRxiv
Nov 11, 2025

Simple Prompting Enhances ChatGPT’s Diagnostic Accuracy in Psychiatric Cases

This article has 9 authors:
1. Seraphina Fong
2. Alessandro Carollo
3. Martina Dal Maso
4. Giovanni Martinotti
5. Debora Luciani
6. Yasser Saeed Khan
7. Luca Pellegrini
8. Ornella Corazza
9. Gianluca Esposito
This article has no evaluationsLatest version Oct 13, 2025
One and a Half Year of ChatGPT: An Umbrella Review of Large Language Model (LLM) Perception Across Time and Research Fields

This article has 5 authors:
1. Leonardo Bergmann
2. Robin Beckenbach
3. Lisa Zach
4. Benjamin Roth
5. Ulrich S. Tran
This article has no evaluationsLatest version Oct 19, 2025
Assessment of the efficacy of ChatGPT responses to bacterial species-specific questions in microbiology.

This article has 6 authors:
1. Withanage Dona Manushi Dinasha Withanage
2. Nissanka Mudiyanselage Tanuri Ayanga Nissanka
3. Chamudhi Prabashi Wickramasinghe
4. Warnakulasuriya Palakuttige Pasindu Damsara Fernando
5. Vindya Perera
6. Gayan Danushka Gunatilake
Reviewed by Access Microbiology

This article has 2 evaluationsLatest version Nov 6, 2025Latest activity Nov 5, 2025