When Corporate Chatbots Show Bias: A Multi-Dimensional Analysis of LLMs in Enterprise Settings

Shreya Bhattacharya
Vincent Hagenow
Marco Di Gennaro

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The increasing deployment of Large Language Models (LLMs) in enterprise settings necessitates a thorough understanding of their inherent biases, which can lead to unequal outcomes in multilingual information retrieval, question answering, and language processing. This study presents a systematic evaluation of bias dimensions in five leading Large Language Models (LLMs)—GPT-4, Claude, Cohere, Mistral, and DeepSeek—within a multilingual enterprise context. Leveraging a controlled Retrieval-Augmented Generation (RAG) pipeline and over 250 real-world queries grounded in corporate documentation, we examined four critical bias types: retrieval bias, reinforcement drift, language bias, and hallucination. Our results reveal distinct model behaviors: Claude demonstrates a strong recency preference and sensitivity to input grammar, while Cohere and GPT-4 are notably susceptible to output drift under repeated queries. Language bias persists across all models, with reduced performance on Dutch and German inputs relative to English, echoing known cross-lingual disparities. Notably, hallucination rates were negligible under RAG, reinforcing its value in grounding responses. These findings underscore the need for robust bias auditing, language-aware deployment strategies, and retrieval-grounded architectures in enterprise LLM applications.

Version published to 10.20944/preprints202505.1268.v1
May 16, 2025

Assessing Large Language Model Alignment Towards Radiological Myths and Misconceptions

This article has 2 authors:
1. Christopher West
2. Yi Wang
This article has no evaluationsLatest version May 19, 2025
First Interactions with Generative Chatbots Shape Local but Not Global Sentiments About AI

This article has 6 authors:
1. Eva-Madeleine Schmidt
2. Mengchen Dong
3. Clara Bersch
4. Nils Köbis
5. Jean-Francois Bonnefon
6. Iyad Rahwan
This article has no evaluationsLatest version May 27, 2025
Cultural Value Alignment in Large Language Models: A Prompt-based Analysis of Schwartz Values in Gemini, ChatGPT, and DeepSeek

This article has 1 author:
1. Robin Segerer
This article has no evaluationsLatest version May 23, 2025

Listed in

Abstract

Article activity feed

Related articles

Assessing Large Language Model Alignment Towards Radiological Myths and Misconceptions

First Interactions with Generative Chatbots Shape Local but Not Global Sentiments About AI

Cultural Value Alignment in Large Language Models: A Prompt-based Analysis of Schwartz Values in Gemini, ChatGPT, and DeepSeek