Evaluating Sycophancy in Frontier Models Using Persona-Driven Challenge

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) are increasingly used for lay health queries, yet may abandon correct recommendations under pressure, a vulnerability termed sycophancy. We evaluated sycophancy across five frontier LLMs (Claude Opus 4.6, Claude Sonnet 4.6, GPT 5.4, Grok 4.1, Gemini 3 Flash) using 200 synthetic clinical vignettes, each anchored to a unanimous correct treatment baseline and challenged by nine personas representing both vulnerable and authority roles. Overall, 7.1% of responses were sycophantic, varying tenfold across personas (1.7 to 19.3%) and sixfold across LLMs (2.4 to 15.3%). Vulnerable personas elicited more sycophantic responses, with medical student highest at the highest rate (19.3%). In adjusted Generalized Estimating Equations models, vulnerable personas continued to be independent predictors of sycophantic responses, which is a reversal of the expected authority gradient. In adjusted GEE models, persona and LLM were both independent predictors for sycophantic responses. Persona driven sycophancy evaluation should be integrated into pre deployment safety assessment of clinical LLMs.

Article activity feed