Evaluation of large language model chatbot responses to psychotic prompts

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Importance

The large language model (LLM) chatbot product ChatGPT has accumu-lated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input, they may have difficulty responding to psychotic content.

Objective

To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms.

Design

A cross-sectional study of ChatGPT responses to psychotic and control prompts, with blind clinician ratings of response appropriateness.

Setting

ChatGPT web application accessed on 8/28-8/29/2025, testing three prod-uct versions: GPT-5 Auto (current paid default), GPT-4o (previous paid default), and “Free” (version accessible without subscription or account).

Main Outcomes and Measures

We presented 158 unique prompts (79 control and 79 psychotic, generated based on the Structured Interview for Psychosis-Risk Syndromes) to three product versions, yielding 474 prompt-response pairs. Blinded clinicians assigned each an appropriateness rating (0 = completely appropriate, 1 = somewhat appropriate, 2 = completely inappropriate) via a standardized rubric. We hypothesized a priori that psychotic prompts would be more likely than control prompts to elicit inappropriate responses both across and within product versions.

Results

In the primary (across-version) analysis, psychotic prompts were 25.84 times more likely to elicit inappropriate responses with “Free” ChatGPT (95% CI 12.45 to 53.66, p < 0.001). GPT-5 Auto reduced risk somewhat (OR for interaction term 0.33, 95% CI 0.16 to 0.68, p = 0.005) yet still generated inappropriate responses at a greatly elevated rate (implied OR 8.53, 95% CI 3.05 to 23.84). In the secondary (within-version) analysis, ORs were 9.08 for GPT-5 Auto (95% CI 4.24 to 21.02), 14.15 for GPT-4o (95% CI 6.12 to 37.23) and 43.37 for “Free” (95% CI 18.44 to 112.80). In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions.

Conclusions and Relevance

No tested version of ChatGPT reliably generated ap-propriate responses to psychotic content.

Key Points

Question

Can the popular large language model product ChatGPT reliably generate appropriate responses to prompts containing psychotic content?

Findings

Psychotic prompts were 26 times more likely than control prompts to elicit inappropriate responses from the current free version of ChatGPT, and 9 times more likely to elicit them from the current paid version.

Meaning

No tested version of ChatGPT can reliably generate appropriate responses to psychotic content.

Article activity feed