AI is Misled by GenAI: Stylistic Bias in Automated Assessment of Creativity in Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Outputs from large language models (LLMs) are often rated as highly original yet show low variability compared to human responses, a pattern we refer to as the LLM creativity paradox. Yet, prior work suggests that assessments of originality and variability may reflect stylistic features of LLM outputs rather than underlying conceptual novelty. The goal of the present study was to investigate this issueusing outputs from seven distinct LLMs on a modified Alternative Uses Task. We scored verbatim and "humanized" LLM responses—reworded to reduce verbosity but maintain core ideas—using four automated metrics (supervised OCSAI and CLAUS models, and two unsupervised semantic-distance tools) and compared them with responses from 30 human participants. As expected, verbatim LLM responses were rated as substantially more original than human responses (median d = 2.27 for supervised models and 0.79 for semantic-distance models) but showed markedly lower variability (median d = 0.85). Humanizing the responses strongly decreased originality and weakly increased variability, indicating that part of the LLM creativity paradox is driven by stylistic cues. Nevertheless,even after humanization, originality scores of LLM responses remained higher (median d = 0.80) and their variability lower (d = 0.57) than those of human responses. These findings suggest that automated assessment tools can be partially misled by the style of LLM outputs, highlighting the need for caution when using automated methods to evaluate machine-generated ideas, particularly in real-worldapplications such as providing feedback or guiding creative workflows.