A Comparative Analysis to Evaluate Bias and Fairness Across Large Language Models with Benchmarks

陳文意
黃兆明

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study performs a comprehensive evaluation of bias and fairness within Large Language Models (LLMs), including ChatGPT-4, Google Gemini, and Llama 2, utilizing the Google BIG-Bench benchmark. Our analysis reveals varied levels of biases across models, with disparities particularly notable in dimensions such as gender, race, and ethnicity. The Google BIG-Bench benchmark proved instrumental in identifying these biases, though its effectiveness is tempered by challenges in capturing the sophisticated manifestations of bias that emerge in real-world contexts. Comparative performance analysis indicates that while each model exhibits strengths in certain areas, no single model uniformly excels across all fairness and bias metrics. The study underscores the intricate balance between model performance, fairness, and efficiency, highlighting the necessity for ongoing research and development in AI ethics to mitigate bias effectively. Insights from this research advocate for a multifaceted approach to AI development, integrating ethical considerations at every stage to ensure the equitable advancement of technology. The findings prompt a call for continued innovation in model training and benchmarking methodologies, aiming to enhance the fairness and inclusivity of future LLMs.

Version published to 10.31219/osf.io/mc762 on OSF Preprints
Mar 29, 2024

Benchmarking LLM Fairness: Multi-Agent Evaluators for Scalable Model Assessment

This article has 1 author:
1. Anil Kumar Jonnalagadda
This article has no evaluationsLatest version Dec 11, 2025
Bias In, Symbolic Compliance Out? GPT’s Reliance on Gender and Race in Strategic Evaluations

This article has 2 authors:
1. Tristan Botelho
2. Qingyang (Iris) Wang
This article has no evaluationsLatest version Jan 30, 2026
Rethinking Item Fairness Using Single World Intervention Graphs

This article has 2 authors:
1. Youmi Suk
2. Weicong Lyu
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Benchmarking LLM Fairness: Multi-Agent Evaluators for Scalable Model Assessment

Bias In, Symbolic Compliance Out? GPT’s Reliance on Gender and Race in Strategic Evaluations

Rethinking Item Fairness Using Single World Intervention Graphs