Identifying and Characterizing Bias at Scale in Clinical Notes Using Large Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Importance

Discriminatory language in clinical documentation impacts patient care and reinforces systemic biases. Scalable tools to detect and mitigate this are needed.

Objective

Determine utility of a frontier large language model (GPT-4) in identifying and categorizing biased language and evaluate its suggestions for debiasing.

Design

Cross-sectional study analyzing emergency department (ED) notes from the Mount Sinai Health System (MSHS) and discharge notes from MIMIC-IV.

Setting

MSHS, a large urban healthcare system, and MIMIC-IV, a public dataset.

Participants

We randomly selected 50,000 ED medical and nursing notes from 230,967 MSHS 2023 adult ED visiting patients, and 500 randomly selected discharge notes from 145,915 patients in MIMIC-IV database. One note was selected for each unique patient.

Main Outcomes and Measures

Primary measure was accuracy of detection and categorization (discrediting, stigmatizing/labeling, judgmental, and stereotyping) of bias compared to human review. Secondary measures were proportion of patients with any bias, differences in the prevalence of bias across demographic and socioeconomic subgroups, and provider ratings of effectiveness of GPT-4’s debiasing language.

Results

Bias was detected in 6.5% of MSHS and 7.4% of MIMIC-IV notes. Compared to manual review, GPT-4 had sensitivity of 95%, specificity of 86%, positive predictive value of 84% and negative predictive value of 96% for bias detection. Stigmatizing/labeling (3.4%), judgmental (3.2%), and discrediting (4.0%) biases were most prevalent. There was higher bias in Black patients (8.3%), transgender individuals (15.7% for trans-female, 16.7% for trans-male), and undomiciled individuals (27%). Patients with non-commercial insurance, particularly Medicaid, also had higher bias (8.9%). Higher bias was also seen in health-related characteristics like frequent healthcare utilization (21% for >100 visits) and substance use disorders (32.2%). Physician-authored notes showed higher bias than nursing notes (9.4% vs. 4.2%, p < 0.001). GPT-4’s suggested revisions were rated highly effective by physicians, with an average improvement score of 9.6/10 in reducing bias.

Conclusions and Relevance

A frontier LLM effectively identified biased language, without further training, showing utility as a scalable fairness tool. High bias prevalence linked to certain patient characteristics underscores the need for targeted interventions. Integrating AI to facilitate unbiased documentation could significantly impact clinical practice and health outcomes.

Article activity feed