Quality, Reliability, and Readability of AI-Generated Breastfeeding Information: A Comparative Evaluation of Four Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Large language models (LLMs) are increasingly used to provide breastfeeding information, yet concerns remain regarding the quality, reliability, and readability of AI-generated health content. Objective To comparatively evaluate the information quality, scientific reliability, and readability of breastfeeding-related responses generated by four widely used LLMs. Methods This descriptive cross-sectional study (September 2025) assessed responses from ChatGPT-5, Google Gemini, DeepSeek, and Claude to 10 expert-validated, clinically critical breastfeeding FAQs derived from an initial pool of 100 questions (LLM-generated and Google “People also ask”). Prompts were submitted in newly initiated chat sessions on the same day. A blinded panel of three independent experts (pediatrician, obstetrician–gynecologist, senior midwife) rated each response using DISCERN (16–80) for information quality and a 5-point Likert scale for scientific reliability; readability was assessed with Flesch Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL). Differences across models were tested using Friedman/Dunn (DISCERN, Likert) and one-way ANOVA (readability). Ethical approval was obtained from the Atatürk University Faculty of Health Sciences Ethics Committee. Results DISCERN scores differed significantly across models (χ²(3) = 76.50, p < .001). DeepSeek (37.20 ± 7.17) and Claude (34.27 ± 4.93) scored higher than ChatGPT (19.93 ± 2.86) and Gemini (22.40 ± 2.19) (p < .05); no model reached “excellent” quality (≥ 63). Likert reliability also varied (χ²(3) = 62.50, p < .001), highest for DeepSeek (3.47 ± 0.63) and Claude (3.17 ± 0.38) versus ChatGPT (2.03 ± 0.18) and Gemini (2.07 ± 0.37). Readability differed (FRES: F(3,36) = 3.54, p = .024; FKGL: F(3,36) = 3.57, p = .023); all models exceeded the ideal ≤ 6 FKGL benchmark. Conclusions LLMs show a clear trade-off between informational quality and readability. DeepSeek and Claude produced more comprehensive, guideline-consistent content, but it was less readable. In contrast, ChatGPT and Gemini were more readable, albeit with lower quality and reliability. Expert oversight remains essential before integrating LLM outputs into breastfeeding education.