Investigating the Validity Evidence of Automated Scoring Methods for Divergent Thinking Assessments

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Divergent thinking (DT) ability is a fundamental aspect of creativity, but its assessment remains challenging by the reliance on effortful human ratings and persistent uncertainty regarding how to aggregate scores across a variable number of responses. Recent work demonstrated that automated scoring based on large language models (LLMs) can predict human creativity ratings. Other research has evaluated the psychometric quality of different response aggregation methods for human ratings by comparing their concurrent criterion validity with respect to external criteria. The present study integrates these two lines of work and investigates the criterion validity evidence of automated creativity scorings derived from five LLMs (CLAUS, OCSAI, GPT-4, Llama 3.3, and Claude 3.5) under different aggregation methods. Instead of just relating LLM-based ratings to human ratings, we compared the validity evidence between rater-based and LLM-based scores which opens up the possibility that automated scoring could prove even more valid. Analyses were based on data from 300 participants who completed five alternate uses tasks. Findings showed that general-purpose LLMs yielded equal or even slightly higher criterion validity evidence compared to human ratings, especially when using max-3 scoring. These results suggest that automated DT scoring can serve as a psychometrically sound alternative to rater-based scoring.

Article activity feed