Comparative Readability of Large Language Models Responses to Male Infertility Questions: Impact of Contextual Prompting
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) are increasingly used by patients seeking medical information, yet the readability of LLM-generated content in male infertility remains insufficiently characterized. We evaluated the readability of responses generated by five widely used LLM platforms to frequently asked questions (FAQs) on male infertility collected from urology association and hospital websites. Fifty-four FAQs were submitted to OpenAI (ChatGPT-5/5-mini), Claude (Sonnet 4.5), Google Gemini (2.5 Flash), DeepSeek (V3), and Grok (V3/V4) in two conditions: (1) no additional context and (2) contextual prompting directing the model to explain to a lay patient/couple worried about male infertility. Readability was assessed using Flesch-Kincaid Reading Ease (FRE) and SMOG indices. In the non-prompted condition, DeepSeek generated the most readable responses (mean FRE 46.28±6.86; mean SMOG 11.71±0.76), whereas Claude produced the least readable outputs (mean FRE 23.83±12.29; mean SMOG 16.37±2.47). After prompting, Grok generated the most readable responses (mean FRE 69.71±6.09; mean SMOG 9.96±1.04), and readability improved across all models. These findings suggest that simple contextual prompting can substantially enhance readability of LLM-generated male infertility education; however, readability gains must be paired with ongoing verification of clinical accuracy to mitigate misinformation risk.