The Promise and Peril of Large Language Models in Digital Health: GPT-4 Personalizes Cardiovascular Patient Education but Amplifies Gender Biases
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Gender-neutral patient education materials often overlook critical sex-based differences in cardiovascular disease (CVD). Large Language Models (LLMs) like GPT-4 offer a potential tool for personalizing health communication, but their ability to correct gender gaps without introducing new biases is unknown.
Methods
We identified seven publicly available English-language CVD prevention handouts from major health organizations. Using GPT-4 API in August 2025, we generated gender-specific revisions for a 55-year-old male and female audience via standardized prompts. We provided structured prompts instructing the model to include evidence-based, sex-specific risk factors and symptoms. Original and revised materials were evaluated using Flesch-Kincaid Reading Ease, a novel 10-point gender-inclusivity checklist, and qualitative thematic analysis.
Findings
GPT-4 revisions substantially improved gender-inclusivity scores (Original median: 3.0/10, IQR=0.0-3.0; Male-tailored median: 8.0, IQR=7.0-9.0; Female-tailored median: 10.0, IQR=10.0-10.0). Readability was maintained. However, qualitative analysis revealed that while female-tailored versions excelled at incorporating biological facts (e.g., menopause), male-tailored versions often missed key clinical factors. For instance, 4 of 7 revisions failed to mention erectile dysfunction as a CVD risk marker. Revisions also occasionally relied on social stereotypes (e.g., “bottling up emotions”). Furthermore, both versions showed evidence of linguistic bias (framing female symptoms as ‘atypical,’ thereby reinforcing the male-centric clinical paradigm) and gendered assumptions in recommended activities.
Conclusion
LLMs can rapidly improve gender-specificity in patient education but can also perpetuate harmful stereotypes and linguistic biases likely absorbed from their training data. Their use requires careful, critical oversight to avoid undermining and to potentially advance health equity.
Author Summary
Why was this study done?
Heart disease affects men and women differently, but most patient education materials tend to be the same for everyone. We wanted to see if Artificial Intelligence, specifically large language models like GPT-4, could help rewrite these materials to be more accurate and helpful for each gender.
What did the researchers do and find?
We took seven heart disease prevention online handouts from major public health and clinical organizations. We used GPT-4 to create new versions specifically for men and for women, including male-specific risk markers (e.g., erectile dysfunction) and female-specific risk enhancers (e.g., menopause). We found that the original handouts were not very gender-specific. The AI-revised versions were dramatically better, especially for women, providing more relevant information without making the text harder to read. However, we also found the AI sometimes made mistakes, like using stereotypes about how men handle emotions.
What do these findings mean?
This means AI can be a powerful assistant for creating drafts, but it’s not a replacement for human expertise. To be used safely, a healthcare professional must always check the AI’s work to ensure it is medically accurate and free from harmful stereotypes.