Performance Evaluation of Large Language Models in Real-World Perinatal Medication Consultations: A Cross-Sectional Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Introduction Perinatal medication consultation is a core clinical pharmacy service that involves a complex benefit–risk assessment for both maternal and fetal safety. Large language models (LLMs) have emerged as potential tools to improve access to medication information, yet their performance and safety in real-world, pharmacist-led perinatal consultation settings, particularly in non-English contexts, remain insufficiently evaluated. Aim To evaluate and compare the performance of multiple advanced large language models in addressing real-world Chinese perinatal medication consultation queries and to assess their potential role as supervised adjunctive tools within clinical pharmacy services. Method This cross-sectional study evaluated seven LLMs using real-world clinical data from pharmacist-led medication consultations at the Pharmacy Clinic of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University. A standardized test set of 64 perinatal medication consultation questions was developed from 15,280 electronic consultation records collected between April 2014 and April 2024. The evaluated models included international (GPT-5.1, Grok 3, Gemini 3.0) and domestic (DeepSeek, Wenxin Yiyan, Kimi K2, Tongyi Qianwen) models. Senior clinical pharmacologists independently assessed responses across four dimensions—relevance, accuracy, usefulness, and empathy—using a 10-point Likert scale. The results are summarized as mean ± SD, and between-model differences were analyzed using non-parametric statistical tests. Results Among the 448 model-generated responses, inter-rater consistency was excellent (ICC = 0.91, 95% CI 0.88–0.94). Significant differences in the overall performance were observed among the models (p < 0.001). GPT-5.1 achieved the highest mean total score (9.1 ± 0.8), outperforming all other models (all p < 0.01), followed by Kimi K2 and DeepSeek. Accuracy was the primary determinant of performance differences, with GPT-5.1 showing the highest accuracy score (9.2 ± 0.7). Performance gaps were more pronounced in complex clinical scenarios involving comorbidities or benefit–risk trade-offs, whereas domestic models demonstrated relative advantages in consultations involving traditional Chinese medicine. Conclusion LLMs have demonstrated variable performance in response to perinatal medication consultation queries. While high-performing models show the potential to support pharmacist-led perinatal medication consultations by improving access to information, their current performance supports use only as supervised, adjunctive decision-support tools, rather than as independent sources of medication counseling. Careful governance, human oversight, and further validation of safety and reliability are required before broader integration into perinatal clinical pharmacy practices.