Bias in Large Language Models for Mental Health: Evidence from Vignette-Based Evaluation Across Nine Models

Lerh Jian Wei
Annalisa Fang
Protik Roychowdhury
Oliver Suendermann
Rajat Kumar Sinha
Sumit Chauhan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With increasing use of large language models (LLMs) for mental health needs and reports of inappropriate and biased responses, it is important to identify determinants of bias in LLMs reasoning and responses. This study evaluated LLMs-generated responses to mental health vignettes varying in severity and nature of symptom for 10 social questions, such as comfort level as a working colleague and propensity for violence. Nine LLMs models (Deepseek, Gemini, Gemma, GPT‑3.5, GPT‑4, GPT‑4o, LLaMA, Microsoft, StabilityAI) were assessed using automated metrics, including BERTScoreF and ROUGE-L for differences from human-expert generated responses (degree of biasness). Analyses showed that models produced responses that differ lexically and semantically.. Moreover, the significant interaction effects of type of mental health symptom and severity across LLMs and type of social questions were indicative of weaker concordance of reasoning between LLMs and human-expert depending on specific symptom-severity, suggesting potential biases and differential generalization. Research and clinical implications, such as the importance of human expert oversight throughout the development and application of LLMs for mental health use, were discussed.

Version published to 10.31234/osf.io/ykn2c_v2 on OSF Preprints
Oct 3, 2025
Version published to 10.31234/osf.io/ykn2c_v1 on OSF Preprints
Sep 26, 2025

Bias in Large Language Models for Mental Health: Evidence from Vignette-Based Evaluation Across Nine Models

This article has 6 authors:
1. Lerh Jian Wei
2. Annalisa Fang
3. Protik Roychowdhury
4. Oliver Suendermann
5. Rajat Kumar Sinha
6. Sumit Chauhan
This article has no evaluationsLatest version Oct 3, 2025
Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

This article has 11 authors:
1. Paul Mwaniki
2. Wilkister Musau
3. Lynda Isaaka
4. Conrad Wanyama
5. Vaishnavi Menon
6. Alastair Denniston
7. Xiaoxuan Liu
8. Mira Emmanuel-Fabula
9. Gwydion Williams
10. Bilal A. Mateen
11. Ambrose Agweyu
This article has no evaluationsLatest version Oct 27, 2025
Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias

This article has 1 author:
1. Mouhacine Benosman
This article has no evaluationsLatest version Oct 3, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Bias in Large Language Models for Mental Health: Evidence from Vignette-Based Evaluation Across Nine Models

Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya

Designing Psychometric Measures for LLMs: Framework and Application to Racial Bias