Response consistency of ChatGPT-4o for Type 2 Diabetes Nutrition and Physical-activity Recommendations: A Pilot NLP-based Assessment of GPT outputs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Generative AI tools such as ChatGPT are increasingly used by the public to seek guidance on diet and physical activity for type 2 diabetes (T2D) prevention and management. However, the consistency of model outputs across different users and disease-stage scenarios remains insufficiently characterized. This pilot study aims to evaluate the word-level and semantic-level consistency of GPT-4o’s diet and physical activity responses for type 2 diabetes prevention and management. We designed 12 prompts covering four categories: prediabetes, diagnosed type 2 diabetes (T2D), diagnosed T2D with complications, and general questions that did not specify dysglycemia stage. Word-level similarity was quantified with Term Frequency-Inverse Document Frequency (TF-IDF) cosine scores; sentence-level semantic similarity was measured using large language models (LLMs) – DeBERTa-v3 MNLI to calculate the entailment probabilities. The results showed that mean cosine similarity across users was moderate (0.44–0.66), whereas mean entailment similarity was higher (0.68–0.81). Across stages, word-level similarity was low to moderate (0.28–0.63) and entailment similarity remained moderate to high (0.63–0.80). Low similarity commonly referenced distinct food choices, operational details, safety warnings, and stage-specific suggestions. GPT-4o generated semantically consistent but variably detailed responses and the moderate semantic variation suggested limited differentiation of response content across diabetes-related stages in this pilot consistency assessment.
Author Summary
This pilot study investigated the consistency of nutrition and physical activity recommendations generated by ChatGPT-4o for type 2 diabetes. While content accuracy is an important aspect of evaluating AI-generated health advice, answer consistency is also important, especially for medical-related guidance such as diabetes nutrition and lifestyle recommendations. We collected one-round responses from ChatGPT-4o and quantitatively compared the generated answers across users and prevention stages. Overall, ChatGPT-4o provided generally consistent recommendations on broad topics, including healthy eating, physical activity, weight management, and blood glucose monitoring. However, the operational details varied across users and stages, such as how recommendations were prioritized, framed, and translated into specific actions. More details were discussed in the paper. This study serves as a proof-of-concept showing that the consistency of AI-generated health recommendations can be measured quantitatively. Future work may expand this publicly available framework to follow-up conversations, larger sample sizes, more diverse user profiles, and further evaluation of accuracy and actionability.