Are LLMs better expert judges? Rethinking content validity assessment in the age of AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In this article, we demonstrate a novel application of large language models (LLMs) as expert judges for item-level content relevance evaluation. Eleven advanced LLMs were included, each treated as a separate expert panel based on multiple procedurally independent judgments generated via repeated API queries. Their performance was compared with ratings provided by human judges, including psychology students and academic experts. Internal agreement within each panel was assessed using Krippendorff’s alpha and Kendall’s coefficient of concordance. Beyond agreement, theoretically predefined control items representing content-nonrelevant material were used to determine the accuracy of each panel’s ratings. Results indicated that several LLM-based panels (Gemini 3 Pro, GPT-5.2 Pro, Claude Sonnet 4.5, and DeepSeek-V3.2) combined near-perfect internal agreement with high accuracy in identifying nonrelevant items, outperforming human panels. Even when individual model instances were treated as independent judges, agreement remained high and all control items were correctly identified. These findings demonstrate that selected LLMs can function as highly consistent content judges, particularly in detecting content-nonrelevant items that are conceptually distant from the measured construct. However, given the lack of empirical evidence on LLMs as expert judges, their ratings should currently be viewed as complementary to human expertise, and further research is required to clarify the conditions under which they can be reliably incorporated into procedures for evaluating test item content relevance.

Article activity feed