Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) are increasingly used for mental health, yet safety evaluations rely primarily on small, simulation-based benchmarks removed from real-world language. We replicate four published safety evaluations assessing suicide risk handling, harmful content generation, and jailbreak resistance for general-purpose frontier models and a purpose-built mental health AI. We then conduct an ecological audit of 20,000 real user conversations with the purpose-built system, which includes layered safeguards for suicide and non-suicidal self-injury (NSSI). The purpose-built AI was significantly less likely than general-purpose LLMs to produce harmful content across suicide/NSSI (.4-11.27% vs 29.0-54.4%), eating disorder (8.4% vs 54.0%), and substance use (9.9% vs 45.0%) benchmarks. In real user data, clinician review found zero suicide-risk cases without crisis resources. Three NSSI mentions (.015%) lacked intervention, implying a .38% lower-bound false negative rate. Findings support the utility of ecological audits for safety estimation.

Article activity feed