Evaluating gender bias in Large Language Models in long-term care

Sam Rickman

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Large language models (LLMs) are being used to reduce the administrative burden in long-term care by automatically generating and summarising case notes. However, LLMs can reproduce bias in their training data. This study evaluates gender bias in summaries of long-term care records generated with two state-of-the-art, open-source LLMs released in 2024: Meta's Llama 3 and Google Gemma. Methods: Gender-swapped versions of long-term care records for 617 older people from a London local authority were created. Summaries of male and female versions were generated with Llama 3 and Gemma, as well as benchmark models from Meta and Google released in 2019: T5 and BART. Linguistic and inclusion bias was quantified through sentiment analysis, and frequency of words and themes Results: The benchmark models exhibited some variation in output on the basis of gender. Llama 3 showed no gender-based differences across any metrics. Gemma displayed the most significant gender-based differences. Male summaries focus more on physical and mental health issues. Language used for men was more direct, with women's needs downplayed more often than men's. Conclusions: Care services are allocated on the basis of need. If women's health issues are underemphasised, this may lead to gender-based disparities in service receipt. LLMs may offer substantial benefits in easing administrative burden. However, the findings highlight the variation in state-of-the-art LLMs, and the need for evaluation of bias in LLMs. Bias across gender and other protected characteristics should be evaluated in LLMs used in long-term care. The methods in this paper provide a practical framework for such evaluations. The code is available on GitHub.

Version published to 10.21203/rs.3.rs-5166499/v3 on Research Square
Jul 9, 2025
Version published to 10.21203/rs.3.rs-5166499/v2 on Research Square
Oct 24, 2024
Version published to 10.21203/rs.3.rs-5166499/v1 on Research Square
Oct 15, 2024

Evaluating the Influence of Demographic Identity in the Medical Use of Large Language Models

This article has 6 authors:
1. Sujung Lee
2. Won Ik Cho
3. Chansung Park
4. Youngrong Lee
5. Chanjun Park
6. Taehoon Ko
This article has no evaluationsLatest version Jul 11, 2025
Fine-Tuning a Multilingual Translation Model for Financial Crime Data

This article has 2 authors:
1. Ravi Kumar Mishra
2. Avadhoot Suresh Jathar
This article has no evaluationsLatest version Jul 2, 2025
Implementation of Large Language Models in Electronic Health Records

This article has 3 authors:
1. Maxime Griot
2. Jean Vanderdonckt
3. Demet Yuksel
This article has no evaluationsLatest version Jul 4, 2025

Listed in

Abstract

Article activity feed

Related articles

Evaluating the Influence of Demographic Identity in the Medical Use of Large Language Models

Fine-Tuning a Multilingual Translation Model for Financial Crime Data

Implementation of Large Language Models in Electronic Health Records