Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper investigates the use of large language models (LLMs) as evaluators in multidimensional machine translation (MT) assessment, focusing on the English–Indonesian language pair. Building on established evaluation frameworks, we adopt an MQM-aligned rubric that assesses translation quality along morphosyntactic, semantic, and pragmatic dimensions. Three LLM-based translation systems (Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)) are evaluated using both expert human judgments and an LLM-based evaluator (GPT–5), allowing for a detailed comparison of alignment, bias, and consistency between human and AI-based assessments. In addition, a classroom calibration study is conducted to examine how rubric-guided evaluation supports alignment among novice evaluators. The results indicate that GPT–5 exhibits strong agreement with human evaluators in terms of relative quality ranking, while systematic differences in absolute scoring highlight calibration challenges. Overall, this study provides insights into the role of LLMs as reference-free evaluators for MT and illustrates how multidimensional rubrics can support both research-oriented evaluation and pedagogical applications in a mid-resource language setting.