Beyond BLEU: GPT-5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

Shalawati Shalawati
Arbi Haza Nasution
Winda Monika
Tatum Derin
Aytug Onan
Yohei Murakami

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent progress in large language models (LLMs) has rekindled the promise of high-quality machine translation (MT), yet evaluation remains a bottleneck. Traditional automatic metrics (e.g., BLEU) are fast but fail to capture semantic and pragmatic nuances reflected in human judgments. We present a multidimensional framework—inspired by MQM—that augments standard metrics (Adequacy, Fluency) with three linguistic dimensions: Morphosyntactic, Semantic, and Pragmatic. We compare three Small Language Models for English→Indonesian: Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B). Two controlled experiments are conducted: (i) Preliminary (1,000 translations, GPT-5-only scoring of Adequacy/Fluency + BLEU), and (ii) Final (100 translations, three human experts + GPT-5) on all five metrics. We compute inter-annotator reliability (Krippendorff’s α, weighted κ) and annotator competence (MACE). Results show consistent model ranking (Gemma 3 (1B) > LLaMA 3.2 (3B) > Qwen 3 (0.6B)) and strong GPT-5–human correlation (r = 0.822). To validate practical applicability, a classroom study with 26 translation students tested the metrics in real learning settings. Using the same multidimensional rubric, students rated MT outputs across pre-, post-, and final-test phases. Their mean absolute error (MAE) decreased from 0.97 to 0.83, while Exact Match Rate increased from 0.30 to 0.50 after rubric calibration, demonstrating that the proposed framework and GPT-5 evaluation can be effectively transferred to educational contexts for evaluator training and feedback alignment.

Version published to 10.20944/preprints202511.1292.v1
Nov 18, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed