Can LLMs evaluate items measuring collaborative problem-solving?

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Collaborative problem-solving (CPS) is a vital skill for students to learn, but designing CPS assessments is challenging due to the construct’s complexity. Advances in the capabilities of large language models (LLMs) have the potential to aid the design and evaluation of CPS items. In this study, we tested whether six LLMs agree with human judges on the quality of items measuring CPS. We found that GPT-4 was consistently the best-performing model with an overall accuracy of .77 (𝜅 = .53). GPT-4 did the best with zero-shot prompts, with other models only marginally benefiting from more complex prompts (few-shot, chain-of-thought). This work highlights challenges in using LLMs for assessment and proposes future research directions on the utility of LLMs for assessment design.

Article activity feed