Evaluation of large language model chatbot responses to psychotic prompts
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance
The large language model (LLM) chatbot product ChatGPT has accumu-lated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input, they may have difficulty responding to psychotic content.
Objective
To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms.
Design
A cross-sectional study of ChatGPT responses to psychotic and control prompts, with blind clinician ratings of response appropriateness.
Setting
ChatGPT web application accessed on 8/28-8/29/2025, testing three prod-uct versions: GPT-5 Auto (current paid default), GPT-4o (previous paid default), and “Free” (version accessible without subscription or account).
Main Outcomes and Measures
We presented 158 unique prompts (79 control and 79 psychotic, generated based on the Structured Interview for Psychosis-Risk Syndromes) to three product versions, yielding 474 prompt-response pairs. Blinded clinicians assigned each an appropriateness rating (0 = completely appropriate, 1 = somewhat appropriate, 2 = completely inappropriate) via a standardized rubric. We hypothesized a priori that psychotic prompts would be more likely than control prompts to elicit inappropriate responses both across and within product versions.
Results
In the primary (across-version) analysis, psychotic prompts were 25.84 times more likely to elicit inappropriate responses with “Free” ChatGPT (95% CI 12.45 to 53.66, p < 0.001). GPT-5 Auto reduced risk somewhat (OR for interaction term 0.33, 95% CI 0.16 to 0.68, p = 0.005) yet still generated inappropriate responses at a greatly elevated rate (implied OR 8.53, 95% CI 3.05 to 23.84). In the secondary (within-version) analysis, ORs were 9.08 for GPT-5 Auto (95% CI 4.24 to 21.02), 14.15 for GPT-4o (95% CI 6.12 to 37.23) and 43.37 for “Free” (95% CI 18.44 to 112.80). In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions.
Conclusions and Relevance
No tested version of ChatGPT reliably generated ap-propriate responses to psychotic content.
Key Points
Question
Can the popular large language model product ChatGPT reliably generate appropriate responses to prompts containing psychotic content?
Findings
Psychotic prompts were 26 times more likely than control prompts to elicit inappropriate responses from the current free version of ChatGPT, and 9 times more likely to elicit them from the current paid version.
Meaning
No tested version of ChatGPT can reliably generate appropriate responses to psychotic content.