Language-dependent variability in large language model performance on pharmaceutical knowledge tasks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) are increasingly being considered for applications in healthcare systems. However, their reliability across languages has not been fully examined. In this study, we investigated language-dependent differences in LLM performance on pharmaceutical knowledge questions using datasets derived from Japanese professional examinations. A total of 32 open-weight LLMs were evaluated using the Japanese test questions and their English translations of the Japanese National Examination for Pharmacists, together with the Japanese Registered Salesperson Examination (validated total question number (n) = 4,892). Several open-weight LLMs achieved high accuracy on pharmaceutical examination questions, with performance well above the passing thresholds required for human examinees. Life science–related subjects generally showed higher accuracy, whereas chemistry and regulatory topics showed lower performance. Regarding identical questions of the Japanese National Examination for Pharmacists in Japanese and English (n = 1,045), significant differences in accuracy were observed across multiple models. The magnitude of these language-related differences varied across knowledge domains. Accuracy differences were greater in some domains, such as chemistry and regulatory topics, while smaller in pharmacology. Model performance increased with model size, indicating an overall scaling relationship between parameter count and accuracy. However, language-related differences did not consistently decrease with increasing model size. In addition, analyses of subject-level performance correlations suggested that the relationships among knowledge domains differed between questions presented in Japanese and those in English. These findings suggest that LLM performance on pharmaceutical knowledge tasks may vary across language conditions and that language differences may also influence response consistency across related knowledge domains. Together, these observations highlight the importance of evaluating AI systems in the languages relevant to their intended healthcare settings.