Evaluating Large Language Models for ADHD Education: A Comparative Study of ChatGPT-5, DeepSeek V3, and Grok 4
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Children with attention-deficit/hyperactivity disorder (ADHD) often face barriers to participating in organized sports, particularly when physical education (PE) is delivered by outsourced coaches with limited training in disability inclusion. Meanwhile, large language models (LLMs) such as ChatGPT, DeepSeek, and Grok are increasingly used to generate educational content, yet their readability, stability, and accuracy for non-specialist educators remain unclear.
Methods
This study systematically compared three advanced LLMs, ChatGPT-5, DeepSeek V3, and Grok 4, using identical prompts related to ADHD definitions, symptoms, and medication–exercise interactions. Thirty responses per model were collected and analyzed for content accuracy, readability (Flesch–Kincaid Reading Ease, Grade Level, and SMOG), and lexical complexity.
Results
All models aligned with DSM-5 in describing ADHD but differed in emphasis and stability. DeepSeek V3 produced the broadest and most variable outputs, Grok 4 showed the greatest consistency and clinical structure, and ChatGPT-5 generated concise and strengths-based explanations. However, all models exhibited high reading levels (FKGL > 12), exceeding recommended public-health standards.
Conclusion
While LLMs demonstrate strong potential for generating ADHD-related educational materials, their current readability and stability limitations restrict accessibility for non-specialist educators. Future work should focus on optimizing prompt design and language calibration to enhance usability in inclusive education contexts.