Quality, Reliability, and Readability of AI-Generated Breastfeeding Information: A Comparative Evaluation of Four Large Language Models

Sibel Ejder Tekgunduz
Ayse Gurol
Serap Ejder Apay

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Large language models (LLMs) are increasingly used to provide breastfeeding information, yet concerns remain regarding the quality, reliability, and readability of AI-generated health content. Objective To comparatively evaluate the information quality, scientific reliability, and readability of breastfeeding-related responses generated by four widely used LLMs. Methods This descriptive cross-sectional study (September 2025) assessed responses from ChatGPT-5, Google Gemini, DeepSeek, and Claude to 10 expert-validated, clinically critical breastfeeding FAQs derived from an initial pool of 100 questions (LLM-generated and Google “People also ask”). Prompts were submitted in newly initiated chat sessions on the same day. A blinded panel of three independent experts (pediatrician, obstetrician–gynecologist, senior midwife) rated each response using DISCERN (16–80) for information quality and a 5-point Likert scale for scientific reliability; readability was assessed with Flesch Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL). Differences across models were tested using Friedman/Dunn (DISCERN, Likert) and one-way ANOVA (readability). Ethical approval was obtained from the Atatürk University Faculty of Health Sciences Ethics Committee. Results DISCERN scores differed significantly across models (χ²(3) = 76.50, p < .001). DeepSeek (37.20 ± 7.17) and Claude (34.27 ± 4.93) scored higher than ChatGPT (19.93 ± 2.86) and Gemini (22.40 ± 2.19) (p < .05); no model reached “excellent” quality (≥ 63). Likert reliability also varied (χ²(3) = 62.50, p < .001), highest for DeepSeek (3.47 ± 0.63) and Claude (3.17 ± 0.38) versus ChatGPT (2.03 ± 0.18) and Gemini (2.07 ± 0.37). Readability differed (FRES: F(3,36) = 3.54, p = .024; FKGL: F(3,36) = 3.57, p = .023); all models exceeded the ideal ≤ 6 FKGL benchmark. Conclusions LLMs show a clear trade-off between informational quality and readability. DeepSeek and Claude produced more comprehensive, guideline-consistent content, but it was less readable. In contrast, ChatGPT and Gemini were more readable, albeit with lower quality and reliability. Expert oversight remains essential before integrating LLM outputs into breastfeeding education.

Version published to 10.21203/rs.3.rs-8558387/v1 on Research Square
Feb 11, 2026

Comparative Readability of Large Language Models Responses to Male Infertility Questions: Impact of Contextual Prompting

This article has 3 authors:
1. Ramy Abou Ghayda
2. Hachem Ziadeh
3. Thriaksh Rajan
This article has no evaluationsLatest version Feb 27, 2026
Large Language Models as Ophthalmic Patient Educators: A Comparative Evaluation of Readability, Understandability, and Actionability

This article has 3 authors:
1. Shivam Chandra
2. Vineet Kumar
3. Patrianakos Thomas
This article has no evaluationsLatest version Mar 20, 2026
Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

This article has 2 authors:
1. Burcu Yeliz KOLLAYAN
2. Tuğba CEBECİ
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Comparative Readability of Large Language Models Responses to Male Infertility Questions: Impact of Contextual Prompting

Large Language Models as Ophthalmic Patient Educators: A Comparative Evaluation of Readability, Understandability, and Actionability

Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study