When Artificial İntelligence Speaks For The Obstetrician: Multilingual Accuracy On Real Patient Questions

Alihan Tığlı
Rulin Deniz
Sefer Üstebay
Muammer Hayri Bektaş
Çağdaş Çöllüoğlu
Yakup Baykuş

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background The use of artificial intelligence technologies, particularly Generative Large Language Models (LLMs), in the health sector is rapidly growing. These models offer a new level of health counselling by enabling patients to access information more easily. However, there is limited comprehensive data on the accuracy and clarity of responses from different LLMs in various languages, as well as on the reliability of the scientific references they provide. This study aims to compare the performance and quality of references of three free, open-access LLMs (ChatGPT, Google Gemini, DeepSeek) in both Turkish and English, focusing on frequently asked questions related to pregnancy. The goal is to enhance digital health literacy and to assess the effectiveness and limitations of artificial intelligence-supported health tools. Methods In this comparative, observational, and descriptive study, the 14 most common pregnancy questions encountered in the clinic and among the patient population were identified. These questions were posed to three different LLMs (ChatGPT, Gemini, DeepSeek) in both Turkish and English, with instructions to respond using a web extension from an up-to-date, scientific, and reliable source. The initial answers provided by the models were evaluated by an independent team of obstetricians and gynaecologists, with the origin of the model concealed. The references accompanying the answers were assessed for reliability, scientific validity, and accessibility by a separate team of specialised physicians. The data collected were analysed statistically. Results It has been shown that language and model infrastructure play a significant role in the performance of LLMs. It was found that Google Gemini and DeepSeek's answers to English questions achieved statistically higher accuracy scores than their answers to Turkish questions (p < 0.05). In ChatGPT, no general accuracy difference was observed between answers to Turkish and English questions. Among LLMs, Google Gemini and DeepSeek were statistically more accurate than ChatGPT when questions were asked in English (p < 0.05). When the same questions were posed in Turkish, no significant difference in accuracy was found among the three LLMs (p > 0.05). Regarding reference reliability, ChatGPT achieved a significantly higher "perfect reference" rate than the other models in the references of its answers to both Turkish (78.5%) and English (71.4%) questions. Conclusions This study shows that LLMs can be used for general information about pregnancy, but they have significant limitations in terms of language competence and reference reliability. In particular, it was found that the response quality of LLMs may suffer for non-native English speakers and that the references provided may not always be reliable. The data obtained emphasise that LLMs alone are not a reliable tool in situations requiring patient-based and clinical decisions. In the future, training LLMs specifically for native languages, strengthening reference validation algorithms and developing systems that integrate physician supervision are critical to maximise the potential of AI in healthcare. In this context, AI should be positioned as a tool to support patient care under the supervision of physicians rather than replacing them.

Version published to 10.21203/rs.3.rs-7425784/v1 on Research Square
Oct 8, 2025

Performance of GPT-5, DeepSeek, and Claude in Dental MCQs for Medically Compromised Patients

This article has 4 authors:
1. Omran Altos
2. Gang Chen
3. Ahmed Bashah
4. Ahmed Awad
This article has no evaluationsLatest version Oct 10, 2025
Using Generative AI for the Objective Assessment of Language in Healthcare

This article has 7 authors:
1. James O'Sullivan
2. Pilar Garces
3. Eduardo A. Aponte
4. Julian Tillmann
5. Christopher Chatham
6. Florian Lipsmeier
7. David Nobbs
This article has no evaluationsLatest version Nov 4, 2025
The Role of Artificial Intelligence in Healthcare(Diagnosis, Development, Education)

This article has 2 authors:
1. Shadi Homsi
2. Rasha Makhlouf
This article has no evaluationsLatest version Nov 20, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Performance of GPT-5, DeepSeek, and Claude in Dental MCQs for Medically Compromised Patients

Using Generative AI for the Objective Assessment of Language in Healthcare

The Role of Artificial Intelligence in Healthcare(Diagnosis, Development, Education)