Large Language Models as Mediators: Addressing Rater Disagreement in Turkish Essay Scoring

Burak Aydın
Tarık Kışla
Nursel Tan Elmas
Emrah Boylu
Okan Bulut

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study explores the potential of large language models (LLMs) to resolve scoring discrepancies in educational assessments, focusing on a guided writing task from a Turkish language proficiency test. In a single administration, 1824 individuals completed the test with four sections: reading, listening, writing, and speaking. The writing section included two tasks: a guided task and an independent task, with our research concentrating specifically on the guided task. We identified 50 cases where two human raters showed significant disagreement, assigning scores differing by at least 2.5 points out of 10. To evaluate LLMs’ ability to address rater discrepancies, we used ChatGPT-4 in a zero-shot setup with a rubric-based approach to score these problematic essays. ChatGPT's scores were compared to those of a third rater—a human expert with extensive experience—who resolved the initial conflicts. By analyzing ChatGPT's performance against this expert benchmark, the study assessed the accuracy, reliability, and potential of LLMs as a tool for standardizing essay scoring, providing valuable insights into their role in improving fairness and consistency in essay scoring.

Version published to 10.21203/rs.3.rs-7226688/v1 on Research Square
Aug 4, 2025

Evaluating an LLM’s Performance in Annotating Discourse Strategies

This article has 2 authors:
1. Taylor Meizlish
2. Chris Ziffo
This article has no evaluationsLatest version Sep 2, 2025
How LLMs Assess Public Speaking? Methodology of Explaining LLM Judgments through Linguistic Patterns and Rhetorical Criteria

This article has 5 authors:
1. Alisa Barkar
2. Mathieu Chollet
3. Matthieu Labeau
4. Beatrice Biancardi
5. Chloé Clavel
This article has no evaluationsLatest version Sep 8, 2025
The Evaluation of Generated Responses by ChatGPT to Complex Linguistics Related Questions

This article has 1 author:
1. Hadis Habibi
This article has no evaluationsLatest version Aug 11, 2025

Listed in

Abstract

Article activity feed

Related articles

Evaluating an LLM’s Performance in Annotating Discourse Strategies

How LLMs Assess Public Speaking? Methodology of Explaining LLM Judgments through Linguistic Patterns and Rhetorical Criteria

The Evaluation of Generated Responses by ChatGPT to Complex Linguistics Related Questions