Evaluating gender bias in large language models in long-term care

Sam Rickman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) are being used to reduce the administrative burden in long-term care by automatically generating and summarising case notes. However, LLMs can reproduce bias in their training data. This study evaluates gender bias in summaries of long-term care records generated with two state-of-the-art, open-source LLMs released in 2024: Meta’s Llama 3 and Google Gemma.

Methods

Gender-swapped versions were created of long-term care records for 617 older people from a London local authority. Summaries of male and female versions were generated with Llama 3 and Gemma, as well as benchmark models from Meta and Google released in 2019: T5 and BART. Counterfactual bias was quantified through sentiment analysis alongside an evaluation of word frequency and thematic patterns.

Results

The benchmark models exhibited some variation in output on the basis of gender. Llama 3 showed no gender-based differences across any metrics. Gemma displayed the most significant gender-based differences. Male summaries focus more on physical and mental health issues. Language used for men was more direct, with women’s needs downplayed more often than men’s.

Conclusion

Care services are allocated on the basis of need. If women’s health issues are underemphasised, this may lead to gender-based disparities in service receipt. LLMs may offer substantial benefits in easing administrative burden. However, the findings highlight the variation in state-of-the-art LLMs, and the need for evaluation of bias. The methods in this paper provide a practical framework for quantitative evaluation of gender bias in LLMs. The code is available on GitHub.

Version published to 10.1186/s12911-025-03118-0
Aug 11, 2025
Version published to 10.21203/rs.3.rs-5166499/v3 on Research Square
Jul 9, 2025
Version published to 10.21203/rs.3.rs-5166499/v2 on Research Square
Oct 24, 2024
Version published to 10.21203/rs.3.rs-5166499/v1 on Research Square
Oct 15, 2024

An Evaluation Framework for Dialectal Sentiment Classification and Linguistic Phenomena in Large Language Models

This article has 5 authors:
1. Tarek Rashed
2. Ramadan Alfared
3. Abduelbaset Goweder
4. Husien Alhammi
5. Abubaker Kashada
This article has no evaluationsLatest version Dec 24, 2025
Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

This article has 2 authors:
1. Shereen Fouad
2. Ezzaldin Alkooheji
This article has no evaluationsLatest version Jan 12, 2026
Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

An Evaluation Framework for Dialectal Sentiment Classification and Linguistic Phenomena in Large Language Models

Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework