Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective: To conduct a scoping review of bias assessment in studies applying Large Language Models (LLMs) to health data and to synthesize their prevailing conceptualization of bias. Material and methods: Following PRISMA guidelines, we queried PubMed and Scopus. Two annotators screened titles, abstracts, and full texts for eligibility, calibrating their assessments throughout the process. For included studies, we extracted and summarized data on LLMs (name and version, development domain, open- or closed- sourced status, and commercial or academic origin), NLP tasks (task formulation, gold standard dataset, evaluation metrics, prompting or fine-tuning strategies), and biases (type, assessment, and bias summary). Results: Of the 1,585 records retrieved, 76 papers met the eligibility criteria for full review. Among these, 59 reported identifying bias. Three major conceptualizations of bias emerged: behavioral output bias (non-stereotyping and stereotyping), predictive outcome bias, and representational bias. Studies generally adopted an observational approach (measuring bias using the existing dataset) or an experimental approach (altering prompts, e.g., with different demographic information, and comparing outputs). Discussion and Conclusion: Behavioral output bias and predictive outcome bias, both of which emphasize parity, dominate existing studies. Whether evaluated against external accuracy or internal equality benchmarks, these approaches often assume that equal performance across groups is inherently desirable. Treating all disparities as bias risks conflating poor model behavior with real-world disparities, and researchers should remain aware of potential trade-offs between parity and accuracy objectives. We introduce an integrated framework that combines parity and accuracy benchmarks and encourages transparent, context-aware interpretation of group differences.