Response consistency of ChatGPT-4o for Type 2 Diabetes Nutrition and Physical-activity Recommendations: A Pilot NLP-based Assessment of GPT outputs

Yundan Zhang
Xue-Jing Liu
Qiongzhi Hu
Karla I. Galaviz
Ines Gonzalez Casanova
Jason Colditz
Danny Valdez

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Generative AI tools such as ChatGPT are increasingly used by the public to seek guidance on diet and physical activity for type 2 diabetes (T2D) prevention and management. However, the consistency of model outputs across different users and disease-stage scenarios remains insufficiently characterized. This pilot study aims to evaluate the word-level and semantic-level consistency of GPT-4o’s diet and physical activity responses for type 2 diabetes prevention and management. We designed 12 prompts covering four categories: prediabetes, diagnosed type 2 diabetes (T2D), diagnosed T2D with complications, and general questions that did not specify dysglycemia stage. Word-level similarity was quantified with Term Frequency-Inverse Document Frequency (TF-IDF) cosine scores; sentence-level semantic similarity was measured using large language models (LLMs) – DeBERTa-v3 MNLI to calculate the entailment probabilities. The results showed that mean cosine similarity across users was moderate (0.44–0.66), whereas mean entailment similarity was higher (0.68–0.81). Across stages, word-level similarity was low to moderate (0.28–0.63) and entailment similarity remained moderate to high (0.63–0.80). Low similarity commonly referenced distinct food choices, operational details, safety warnings, and stage-specific suggestions. GPT-4o generated semantically consistent but variably detailed responses and the moderate semantic variation suggested limited differentiation of response content across diabetes-related stages in this pilot consistency assessment.

Author Summary

This pilot study investigated the consistency of nutrition and physical activity recommendations generated by ChatGPT-4o for type 2 diabetes. While content accuracy is an important aspect of evaluating AI-generated health advice, answer consistency is also important, especially for medical-related guidance such as diabetes nutrition and lifestyle recommendations. We collected one-round responses from ChatGPT-4o and quantitatively compared the generated answers across users and prevention stages. Overall, ChatGPT-4o provided generally consistent recommendations on broad topics, including healthy eating, physical activity, weight management, and blood glucose monitoring. However, the operational details varied across users and stages, such as how recommendations were prioritized, framed, and translated into specific actions. More details were discussed in the paper. This study serves as a proof-of-concept showing that the consistency of AI-generated health recommendations can be measured quantitatively. Future work may expand this publicly available framework to follow-up conversations, larger sample sizes, more diverse user profiles, and further evaluation of accuracy and actionability.

Version published to 10.64898/2026.06.23.26356399 on medRxiv
Jun 26, 2026

General-Purpose vs. Domain-Specific Large Language Models in Antibiotic Clinical Decision-Making: A Double-Blind Evaluation with a 2×2 Factorial Design

This article has 7 authors:
1. Yang Liu
2. Changjing Zhang
3. Feifei Wang
4. Wei Xu
5. Yunhe Zhang
6. Shaolin Ma
7. Haitao Zhang
This article has no evaluationsLatest version Jul 13, 2026
Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

This article has 5 authors:
1. Michael Larsen
2. Ian M. Campbell
3. Lori A. Orlando
4. Peter Robinson
5. Nephi A. Walton
This article has no evaluationsLatest version May 25, 2026
Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study

This article has 14 authors:
1. Khalid Nawab
2. Gretchen Ramsey
3. Samina Asfandiyar
4. Sayuj Atreya
5. Shadi Hijjawi
6. Sharatkumar Rokkam
7. Usman Ghayur
8. Akarshana Rajesh
9. Ihtesham Yousuf
10. Zefaf Ali Shah
11. Amit Kumar Misra
12. Madhushan Ponnala
13. Tauseef Hamid
14. Richard Schreiber
This article has no evaluationsLatest version Jun 15, 2026

Discuss this preprint

Listed in

Abstract

Author Summary

Article activity feed

Related articles

General-Purpose vs. Domain-Specific Large Language Models in Antibiotic Clinical Decision-Making: A Double-Blind Evaluation with a 2×2 Factorial Design

Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study