Multilingual Evaluation of a Large Language Model-Based Primary Care Chatbot

Pei-Lun Chen
Amogh Ananda Rao
Sydney Pugh
Kevin B Johnson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Pre-visit planning has the potential to reduce EHR documentation burden while improving workflow efficiency, care quality, and patient–provider engagement. Large language model (LLM) chatbots show promise for supporting this task, but while their English-centric development suggests a potential for disparity, the extent to which these concerns translate into performance degradation in multilingual clinical settings remains unclear. In this mixed-methods study, we systematically evaluate the multilingual capabilities of PCP-Bot, an English-developed LLM-based (GPT-4o) clinical chatbot that collects patient concerns and generates structured, physician-ready summaries (∼200 words) under structured output constraints. We enrolled 31 bilingual individuals (11 Mandarin, 10 Spanish, 10 Hindi) to role-play as patients to evaluate the PCP-Bot, interacting with it across five synthetic clinical cases in both English and a second language. Participants completed a structured survey comprising baseline language proficiency screening, standardized interactions with PCP-Bot in each language, and post-interaction evaluations. Case order was randomized, with each scenario completed first in English and subsequently in the participant’s second language. All summaries were generated in English, regardless of the interaction language. Our results show that Hindi achieved usability and conversation quality parity with English across all measured dimensions. Mandarin achieved usability parity but showed a significant conversation quality gap relative to English. Spanish demonstrated significant deficits in both conversation quality and summary quality. Trust and workload remained consistent across languages. Qualitatively, participants found PCP-Bot natural, smooth, and accurate overall, but noted repetition, transcription errors, missed follow-ups, and more frequent usability issues in non-English interactions. Overall, our findings demonstrate that LLM translation capabilities can enable effective deployment beyond English following appropriate performance validation.

Version published to 10.64898/2026.05.03.26352241 on medRxiv
May 5, 2026

NigBench: A multilingual point-of-care medical query benchmarking study of large language models in Nigeria

This article has 18 authors:
1. Tobi Olatunji
2. Chinemelu Aka
3. Chibuzor Okocha
4. Emmanuel Ayodele
5. Jennifer Orisakwe
6. Toni Adekunle
7. Mardhiyah Sanni
8. Abdulameed Abiola
9. Tassallah Abdullahi
10. Oluwatomi Owopetu
11. Tolu Afolaranmi
12. Peter Suoyo Yougha
13. Mira Emmanuel-Fabula
14. Vaishnavi Menon
15. Alastair Denniston
16. Xiao Liu
17. Gwydion Williams
18. Bilal A. Mateen
This article has no evaluationsLatest version Jul 10, 2026
Research through Evaluation for Large Language Model in Patient-Clinician Communications

This article has 16 authors:
1. Yuexing Hao
2. Jason Holmes
3. Jared Hobson
4. Alexandra Bennett
5. Elizabeth L. McKone
6. Daniel K. Ebner
7. David M. Routman
8. Satomi Shiraishi
9. Samir H. Patel
10. Nathan Y. Yu
11. Chris L. Hallemeier
12. Brooke E. Ball
13. Saleh Kalantari
14. Marzyeh Ghassemi
15. Mark Waddle
16. Wei Liu
This article has no evaluationsLatest version Jun 18, 2026
Large Language Models in Healthcare Simulation Education: A Bibliometric Analysis with AI-Assisted Screening

This article has 5 authors:
1. Matthew Pears
2. Karan Wadhwa
3. Stephen R Payne
4. Stathis TH Konstantinidis
5. Chandra Shekhar Biyani
This article has no evaluationsLatest version Jun 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

NigBench: A multilingual point-of-care medical query benchmarking study of large language models in Nigeria

Research through Evaluation for Large Language Model in Patient-Clinician Communications

Large Language Models in Healthcare Simulation Education: A Bibliometric Analysis with AI-Assisted Screening