Evaluation of Gender Bias in the Evaluation of Synthetic Cardiovascular Disease Cases with Open Source LLMs

R Robinson

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

To systematically evaluate gender bias in open-source large language models (LLMs) for cardiovascular diagnostic decision-making using controlled synthetic case vignettes.

Methods

We generated 500 synthetic cardiovascular cases with randomly assigned gender (male/female, equal distribution) and age (45-80 years), keeping all other clinical variables identical. Two structured prompts simulated sequential cardiovascular evaluation stages: initial chest discomfort presentation and post-stress-test evaluation. Three open-source LLMs were evaluated via local Ollama API: Gemma-2b, Phi, and TinyLLaMA. Primary outcomes included coronary artery disease (CAD) likelihood ratings (low/intermediate/high), diagnostic certainty (low/intermediate/high), and test usefulness scores (1-10 scale). Statistical analysis included chi-square tests, Mann-Whitney U tests, and logistic/linear regression with multiple comparison adjustments. Power analysis determined minimum detectable effects of 12.5% for individual models and 7.2% for pooled data.

Results

Evaluation of 1,500 model responses (500 cases × 3 models) revealed minimal gender-related differences. Only one statistically significant finding emerged: Gemma-2b assigned higher diagnostic certainty to female patients in initial presentations (58% vs. 48%, p=0.031, adjusted p=0.092). No other gender-based differences reached significance after multiple-comparison adjustment. Effect sizes were consistently small across all comparisons (Cohen’s h: 0.01-0.18; Cliff’s delta: -0.11 to 0.12). Substantial inter-model variability was observed, with Gemma-2b and Phi demonstrating assertive diagnostic patterns while TinyLLaMA showed conservative tendencies. Parsing quality exceeded 95% for all models.

Conclusions

Open-source LLMs demonstrated largely gender-neutral outputs in controlled cardiovascular scenarios, contrasting with documented biases in human clinicians and commercial LLMs. The isolated gender effect in Gemma-2b was modest and clinically insignificant. More concerning was substantial inter-model variability in diagnostic confidence and test recommendations, highlighting the critical importance of rigorous model benchmarking before clinical deployment. These preliminary findings suggest that open-source LLMs may offer advantages for equitable healthcare applications, but broader validation across diverse clinical contexts and real-world constraints remains essential.

Version published to 10.1101/2025.08.15.25333803 on medRxiv
Aug 19, 2025

Evaluating the Influence of Demographic Identity in the Medical Use of Large Language Models

This article has 6 authors:
1. Sujung Lee
2. Won Ik Cho
3. Chansung Park
4. Youngrong Lee
5. Chanjun Park
6. Taehoon Ko
This article has no evaluationsLatest version Jul 11, 2025
Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

This article has 5 authors:
1. Vanessa D’Amario
2. Randy Daniel
3. Dhruv Edamadaka
4. Nitya Alaparthy
5. Joshua Tarkoff
This article has no evaluationsLatest version Aug 27, 2025
Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

This article has 6 authors:
1. Chiyu Sheng
2. Shumin Shen
3. Lin Wang
4. Wei Chen
5. Shanghu Wang
6. Nianfei Wang
This article has no evaluationsLatest version Sep 1, 2025

Listed in

Abstract

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Evaluating the Influence of Demographic Identity in the Medical Use of Large Language Models

Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study