Language-dependent variability in large language model performance on pharmaceutical knowledge tasks

Hiroto Asano
Yu-Shi Tian
Asuka Hatabu
Minako Ohishi
Kaori Fukuzawa
Daisuke Takaya
Kenji Ikeda

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly being considered for applications in healthcare systems. However, their reliability across languages has not been fully examined. In this study, we investigated language-dependent differences in LLM performance on pharmaceutical knowledge questions using datasets derived from Japanese professional examinations. A total of 32 open-weight LLMs were evaluated using the Japanese test questions and their English translations of the Japanese National Examination for Pharmacists, together with the Japanese Registered Salesperson Examination (validated total question number (n) = 4,892). Several open-weight LLMs achieved high accuracy on pharmaceutical examination questions, with performance well above the passing thresholds required for human examinees. Life science–related subjects generally showed higher accuracy, whereas chemistry and regulatory topics showed lower performance. Regarding identical questions of the Japanese National Examination for Pharmacists in Japanese and English (n = 1,045), significant differences in accuracy were observed across multiple models. The magnitude of these language-related differences varied across knowledge domains. Accuracy differences were greater in some domains, such as chemistry and regulatory topics, while smaller in pharmacology. Model performance increased with model size, indicating an overall scaling relationship between parameter count and accuracy. However, language-related differences did not consistently decrease with increasing model size. In addition, analyses of subject-level performance correlations suggested that the relationships among knowledge domains differed between questions presented in Japanese and those in English. These findings suggest that LLM performance on pharmaceutical knowledge tasks may vary across language conditions and that language differences may also influence response consistency across related knowledge domains. Together, these observations highlight the importance of evaluating AI systems in the languages relevant to their intended healthcare settings.

Version published to 10.21203/rs.3.rs-9217016/v1 on Research Square
Mar 27, 2026

How are doctors across specialties using commercial large language models? Insights from the Anthropic Economic Index

This article has 4 authors:
1. Izabella Mancewicz
2. Yufei Xu
3. Jeff R. Ma
4. Khoa N. Cao
This article has no evaluationsLatest version Mar 24, 2026
Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026
Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

This article has 2 authors:
1. Burcu Yeliz KOLLAYAN
2. Tuğba CEBECİ
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

How are doctors across specialties using commercial large language models? Insights from the Anthropic Economic Index

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study