Large Language Models as Mediators: Addressing Rater Disagreement in Turkish Essay Scoring
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study explores the potential of large language models (LLMs) to resolve scoring discrepancies in educational assessments, focusing on a guided writing task from a Turkish language proficiency test. In a single administration, 1824 individuals completed the test with four sections: reading, listening, writing, and speaking. The writing section included two tasks: a guided task and an independent task, with our research concentrating specifically on the guided task. We identified 50 cases where two human raters showed significant disagreement, assigning scores differing by at least 2.5 points out of 10. To evaluate LLMs’ ability to address rater discrepancies, we used ChatGPT-4 in a zero-shot setup with a rubric-based approach to score these problematic essays. ChatGPT's scores were compared to those of a third rater—a human expert with extensive experience—who resolved the initial conflicts. By analyzing ChatGPT's performance against this expert benchmark, the study assessed the accuracy, reliability, and potential of LLMs as a tool for standardizing essay scoring, providing valuable insights into their role in improving fairness and consistency in essay scoring.