Evaluating gender bias in Large Language Models in long-term care
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Large language models (LLMs) are being used to reduce the administrative burden in long-term care by automatically generating and summarising case notes. However, LLMs can reproduce bias in their training data. This study evaluates gender bias in summaries of long-term care records generated with two state-of-the-art, open-source LLMs released in 2024: Meta's Llama 3 and Google Gemma. Methods: Gender-swapped versions of long-term care records for 617 older people from a London local authority were created. Summaries of male and female versions were generated with Llama 3 and Gemma, as well as benchmark models from Meta and Google released in 2019: T5 and BART. Linguistic and inclusion bias was quantified through sentiment analysis, and frequency of words and themes Results: The benchmark models exhibited some variation in output on the basis of gender. Llama 3 showed no gender-based differences across any metrics. Gemma displayed the most significant gender-based differences. Male summaries focus more on physical and mental health issues. Language used for men was more direct, with women's needs downplayed more often than men's. Conclusions: Care services are allocated on the basis of need. If women's health issues are underemphasised, this may lead to gender-based disparities in service receipt. LLMs may offer substantial benefits in easing administrative burden. However, the findings highlight the variation in state-of-the-art LLMs, and the need for evaluation of bias in LLMs. Bias across gender and other protected characteristics should be evaluated in LLMs used in long-term care. The methods in this paper provide a practical framework for such evaluations. The code is available on GitHub.