Asymmetry between warmth and clinical substance in multilingual consumer health AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

A patient has little way to verify a chatbot’s health advice and is left to judge it by how it sounds: warm, fluent, and attentive. Whether that warmth corresponds to the clinical quality beneath it, and whether quality holds when a patient writes in a language other than English, has not been quantified with language-matched physician adjudication.

Methods

Four consumer chatbots (ChatGPT, Claude, Gemini, DeepSeek) were crossed with 21 forum-derived patient questions in six languages (English, Hebrew, French, Russian, Arabic, Thai). Two language-matched physician raters per language, blinded to chatbot identity, scored each of 504 responses on five 1–5 Likert dimensions (accuracy, safety, referral, cultural and local appropriateness, empathy; 1,008 rater-response records, 5,040 dimension-level scores). Primary analyses estimated language- and chatbot-associated variance per dimension and a clinical-substance composite; secondary analyses included the catastrophic-rating proportion, an empathy-discrimination test, and a language-property triangulation analysis.

Results

Clinical substance varied far more by patient language than by chatbot identity (composite η² 0.275 vs 0.035), 9.5-fold larger than the empathy effect (η² 0.029). Cultural and local appropriateness was largest (η² 0.272); accuracy, safety, and referral each ≈ 0.11. Failures concentrated in silent omission on stroke time-criticality, carbon-monoxide diagnostic reasoning, and workplace-anaphylaxis occupational framing (0/24 each); 0/120 sentinel-fact responses were confidently wrong. Across non-English emergency-relevant responses, 34.5% gave the correct local emergency number, and none defaulted to US 911. The catastrophic-rating proportion ranged 4.3-fold (3.6% English to 15.5% Hebrew/Thai). Empathy did not discriminate catastrophic safety (AUC 0.49). A language-property triangulation (URIEL distance, tokenization fertility, Joshi tier) gave in-sample AUC 0.90, LOOCV 0.92.

Conclusions

Clinical substance and warmth dissociated under patient language: substance degraded, warmth was preserved and did not track danger, and failures were silent omissions rather than confident errors. The pattern held across four independently trained products and six typologically distinct languages (a phenomenon, not a vendor comparison). English-only premarket evaluation does not characterize the quality on which non-English-speaking patients rely.

Article activity feed