Investigating the Validity Evidence of Automated Scoring Methods for Divergent Thinking Assessments

Janika Saretzki
Mathias Benedek

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Divergent thinking (DT) ability is a fundamental aspect of creativity, but its assessment remains challenging by the reliance on effortful human ratings and persistent uncertainty regarding how to aggregate scores across a variable number of responses. Recent work demonstrated that automated scoring based on large language models (LLMs) can predict human creativity ratings. Other research has evaluated the psychometric quality of different response aggregation methods for human ratings by comparing their concurrent criterion validity with respect to external criteria. The present study integrates these two lines of work and investigates the criterion validity evidence of automated creativity scorings derived from five LLMs (CLAUS, OCSAI, GPT-4, Llama 3.3, and Claude 3.5) under different aggregation methods. Instead of just relating LLM-based ratings to human ratings, we compared the validity evidence between rater-based and LLM-based scores which opens up the possibility that automated scoring could prove even more valid. Analyses were based on data from 300 participants who completed five alternate uses tasks. Findings showed that general-purpose LLMs yielded equal or even slightly higher criterion validity evidence compared to human ratings, especially when using max-3 scoring. These results suggest that automated DT scoring can serve as a psychometrically sound alternative to rater-based scoring.

Version published to 10.31219/osf.io/8n2cb_v1 on OSF Preprints
Jul 9, 2025

A validity-guided workflow for robust large language model research in psychology

This article has 1 author:
1. Zhicheng Lin
This article has no evaluationsLatest version Jul 6, 2025
WITHDRAWN

This article has no evaluationsLatest version Jul 10, 2025
Generative artificial intelligence models outperform students on divergent and convergent thinking assessments

This article has 5 authors:
1. Vikram Arora
2. Alex Thabane
3. Sameer Parpia
4. Goran Calic
5. Mohit Bhandari
This article has no evaluationsLatest version Jul 10, 2025

Listed in

Abstract

Article activity feed

Related articles

A validity-guided workflow for robust large language model research in psychology

WITHDRAWN

Generative artificial intelligence models outperform students on divergent and convergent thinking assessments