Evaluating Personality Traits of Large Language Models Through Scenario-based Interpretive Benchmarking
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The assessment of Large Language Models (LLMs) has traditionally focused on performance metrics tied directly to their task-solving capabilities. This paper introduces a novel benchmark explicitly designed to measure personality traits in LLMs through scenario-based interpretive prompts. We detail the methodology behind this benchmark, where LLMs are presented with structured prompts inspired by psychological scenarios, and responses are assessed via a judge LLM. The evaluation encompasses traits such as emotional stability, creativity, adaptability, and anxiety levels, among others. Scores are assigned based on a judge LLM’s evaluation, with consistency across various judge models assessed through consensus analysis. Anecdotal observations on score validity and orthogonality with conventional performance metrics are discussed. Results, implementation scripts, and updated leaderboards are publicly accessible at https://github.com/fit-alessandro-berti/llm-dreams-benchmark