AI is Misled by GenAI: Stylistic Bias in Automated Assessment of Creativity in Large Language Models

Marek Urban
Petra Kmoníčková
Kamila Urban

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Outputs from large language models (LLMs) are often rated as highly original yet show low variability compared to human responses, a pattern we refer to as the LLM creativity paradox. Yet, prior work suggests that assessments of originality and variability may reflect stylistic features of LLM outputs rather than underlying conceptual novelty. The goal of the present study was to investigate this issueusing outputs from seven distinct LLMs on a modified Alternative Uses Task. We scored verbatim and "humanized" LLM responses—reworded to reduce verbosity but maintain core ideas—using four automated metrics (supervised OCSAI and CLAUS models, and two unsupervised semantic-distance tools) and compared them with responses from 30 human participants. As expected, verbatim LLM responses were rated as substantially more original than human responses (median d = 2.27 for supervised models and 0.79 for semantic-distance models) but showed markedly lower variability (median d = 0.85). Humanizing the responses strongly decreased originality and weakly increased variability, indicating that part of the LLM creativity paradox is driven by stylistic cues. Nevertheless,even after humanization, originality scores of LLM responses remained higher (median d = 0.80) and their variability lower (d = 0.57) than those of human responses. These findings suggest that automated assessment tools can be partially misled by the style of LLM outputs, highlighting the need for caution when using automated methods to evaluate machine-generated ideas, particularly in real-worldapplications such as providing feedback or guiding creative workflows.

Version published to 10.31234/osf.io/p6dyu_v1 on OSF Preprints
Jul 28, 2025

Six fallacies in substituting large language models for human participants

This article has 1 author:
1. Zhicheng Lin
This article has no evaluationsLatest version Aug 21, 2025
Forma mentis networks predict creativity ratings of short texts via interpretable artificial intelligence in human and AI-simulated raters

This article has 5 authors:
1. Edith Haim
2. Natalie Fischer
3. Salvatore Citraro
4. Giulio Rossetti
5. Massimo Stella
This article has no evaluationsLatest version Sep 2, 2025
Gold student meets star model: Predicting the interpretational diversity of novel compounds in an exploratory-confirmatory approach

This article has 3 authors:
1. Fritz Guenther
2. Melanie J. Bell
3. Martin Schäfer
This article has no evaluationsLatest version Jul 8, 2025

Listed in

Abstract

Article activity feed

Related articles

Six fallacies in substituting large language models for human participants

Forma mentis networks predict creativity ratings of short texts via interpretable artificial intelligence in human and AI-simulated raters

Gold student meets star model: Predicting the interpretational diversity of novel compounds in an exploratory-confirmatory approach