Evaluating the Influence of Demographic Identity in the Medical Use of Large Language Models

Sujung Lee
Won Ik Cho
Chansung Park
Youngrong Lee
Chanjun Park
Taehoon Ko

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As large language models (LLMs) are increasingly adopted in medical decision-making, concerns about demographic biases in AIgenerated recommendations remain unaddressed. In this study, we systematically investigate how demographic attributes—specifically race and gender—affect the diagnostic, medication, and treatment decisions of LLMs. Using the MedQA dataset, we construct a controlled evaluation framework comprising 20,000 test cases with systematically varied doctor-patient demographic pairings. We evaluate two LLMs of different scales: Claude 3.5 Sonnet, a highperformance proprietary model, and Llama 3.1-8B, a smaller open-source alternative. Our analysis reveals significant disparities in both accuracy and bias patterns across models and tasks. While Claude 3.5 Sonnet demonstrates higher overall accuracy and more stable predictions, Llama 3.1-8B exhibits greater sensitivity to demographic attributes, particularly in diagnostic reasoning. Notably, we observe the largest accuracy drop when Hispanic patients are treated by White male doctors, underscoring potential risks of bias amplification. These findings highlight the need for rigorous fairness assessments in medical AI and inform strategies to mitigate demographic biases in LLM-driven healthcare applications.

Version published to 10.1101/2025.07.09.25331072 on medRxiv
Jul 11, 2025

Evaluating gender bias in Large Language Models in long-term care

This article has 1 author:
1. Sam Rickman
This article has no evaluationsLatest version Jul 9, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025
Implementation of Large Language Models in Electronic Health Records

This article has 3 authors:
1. Maxime Griot
2. Jean Vanderdonckt
3. Demet Yuksel
This article has no evaluationsLatest version Jul 4, 2025

Listed in

Abstract

Article activity feed

Related articles

Evaluating gender bias in Large Language Models in long-term care

CLEVER: Clinical Large Language Model Evaluationby Expert Review

Implementation of Large Language Models in Electronic Health Records