Consumer chatbots gave similarly empathic answers whether safe or unsafe: a physician-rated evaluation in six languages

Dean Ariel
Lyel Romina Grumberg
Sopak Supakul
Sirawit Wannasri
Ilan Y. Mitchnik
Anna Lev
Weerawat Ariyamethanon
Muhammad Agbarieh
Shafiq Miari
Guy Laban
Boaz Hasid

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Patients cannot check the clinical content of chatbot health advice, so they judge it by what they can perceive. We examined whether physician-rated empathy tracked clinical quality, and whether that relationship held when the question was asked in another language.

Methods

Four consumer chatbots answered forum-derived, clinician-adapted patient scenarios in six languages (English, Hebrew, French, Russian, Arabic, Thai), yielding 504 responses. Two language-matched physicians per language, blinded to chatbot identity, rated accuracy, safety, referral, cultural appropriateness, and empathy, and completed an item-level checklist, giving 1,008 ratings across 21 scenarios. Associations were estimated within physicians. The same two physicians rated English and Hebrew, so that contrast was also within-physician.

Results

Empathy did not separate safe from unsafe responses (AUC 0.49, 95% CI 0.39 to 0.62), and within-physician slopes on safety and substance were near zero (−0.006 and −0.004). Each dimension correlated with its own checklist (r = 0.81 and 0.70) and not with the other (0.01 and −0.01). The substance-minus-empathy gap narrowed from 0.92 to 0.44 and from 1.09 to 0.52 in the two English– Hebrew physicians, driven by lower substance. Unsafe ratings concentrated on the same three scenarios across products (p<0.001), and ten responses were accurate yet unsafe.

Conclusions

Empathy, the one cue a patient can judge, carried no information about whether the advice was safe, and clinical substance fell in Hebrew within both physicians who rated it. Evaluation should score clinical content independently of empathy, in each deployment language, and anchor on high-risk scenarios rather than any single product.

Version published to 10.64898/2026.05.09.26352813 on medRxiv
May 14, 2026

Comparing Human and Large Language Model Responses to Patients’ Online Questions: Towards Multi-dimensional Patient-centered Support

This article has 4 authors:
1. Md Alomgeer Hussein
2. Rajmi Doshi
3. Lu He
4. Tera L. Reynolds
This article has no evaluationsLatest version Jul 17, 2026
Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model

This article has 1 author:
1. Roy H. Perlis
This article has no evaluationsLatest version Jun 17, 2026
Conversational trajectory degrades large language model detection of suicidal ideation relative to clinicians: a preregistered study

This article has 15 authors:
1. Mark Kalinich
2. James Luccarelli
3. John P. Santa Maria
4. Matthew Flathers
5. Phuong Anh Nguyen
6. Seo Ho Song
7. Kevin Makhoul
8. Maria Jose Rivera Criado
9. Callie Ginapp
10. Bryce Hill
11. J. Nicholas Shumate
12. Haruka Notsu
13. Christopher L. Smith
14. Frank Moss
15. John Torous
This article has no evaluationsLatest version Jul 14, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Comparing Human and Large Language Model Responses to Patients’ Online Questions: Towards Multi-dimensional Patient-centered Support

Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model

Conversational trajectory degrades large language model detection of suicidal ideation relative to clinicians: a preregistered study