When Artificial İntelligence Speaks For The Obstetrician: Multilingual Accuracy On Real Patient Questions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background The use of artificial intelligence technologies, particularly Generative Large Language Models (LLMs), in the health sector is rapidly growing. These models offer a new level of health counselling by enabling patients to access information more easily. However, there is limited comprehensive data on the accuracy and clarity of responses from different LLMs in various languages, as well as on the reliability of the scientific references they provide. This study aims to compare the performance and quality of references of three free, open-access LLMs (ChatGPT, Google Gemini, DeepSeek) in both Turkish and English, focusing on frequently asked questions related to pregnancy. The goal is to enhance digital health literacy and to assess the effectiveness and limitations of artificial intelligence-supported health tools. Methods In this comparative, observational, and descriptive study, the 14 most common pregnancy questions encountered in the clinic and among the patient population were identified. These questions were posed to three different LLMs (ChatGPT, Gemini, DeepSeek) in both Turkish and English, with instructions to respond using a web extension from an up-to-date, scientific, and reliable source. The initial answers provided by the models were evaluated by an independent team of obstetricians and gynaecologists, with the origin of the model concealed. The references accompanying the answers were assessed for reliability, scientific validity, and accessibility by a separate team of specialised physicians. The data collected were analysed statistically. Results It has been shown that language and model infrastructure play a significant role in the performance of LLMs. It was found that Google Gemini and DeepSeek's answers to English questions achieved statistically higher accuracy scores than their answers to Turkish questions (p < 0.05). In ChatGPT, no general accuracy difference was observed between answers to Turkish and English questions. Among LLMs, Google Gemini and DeepSeek were statistically more accurate than ChatGPT when questions were asked in English (p < 0.05). When the same questions were posed in Turkish, no significant difference in accuracy was found among the three LLMs (p > 0.05). Regarding reference reliability, ChatGPT achieved a significantly higher "perfect reference" rate than the other models in the references of its answers to both Turkish (78.5%) and English (71.4%) questions. Conclusions This study shows that LLMs can be used for general information about pregnancy, but they have significant limitations in terms of language competence and reference reliability. In particular, it was found that the response quality of LLMs may suffer for non-native English speakers and that the references provided may not always be reliable. The data obtained emphasise that LLMs alone are not a reliable tool in situations requiring patient-based and clinical decisions. In the future, training LLMs specifically for native languages, strengthening reference validation algorithms and developing systems that integrate physician supervision are critical to maximise the potential of AI in healthcare. In this context, AI should be positioned as a tool to support patient care under the supervision of physicians rather than replacing them.