CalibJudge: Calibrated LLM-as-a-Judge for Multilingual RAG with Uncertainty-Aware Scoring

Chenfeiyu Wen
Ao Zhu
Runkun Long
Hejun Huang
Junjie Jiang
Chi Shing Lee

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) serving as automatic evaluators (LLM-as-a-Judge) have become essential for assessing Retrieval-Augmented Generation (RAG) systems. However, in multilingual settings, these judges exhibit significant calibration drift across languages, producing scores that are neither comparable nor aligned with human judgments. We present CalibJudge, a post-hoc calibration framework that addresses this challenge through: (1) language-specific temperature scaling, (2) uncertainty quantification, and (3) selective abstention. We evaluate CalibJudge on the MEMERAG benchmark covering five languages. Our experiments demonstrate that CalibJudge improves correlation with human annotations by up to 21.3% relative improvement in Kendall's while reducing cross-lingual fairness gaps by 42% and achieving 88% balanced accuracy at 70% coverage.

Version published to 10.20944/preprints202603.1324.v1
Mar 17, 2026

PruneBERT: Context-Aware Sentence Classification through Statistical Relevance Pruning

This article has 5 authors:
1. Raghav Kaushik R
2. Jeganathan L
3. Janaki Meena M
4. Ummity Srinivasa Rao
5. Jayaram Balabaskaran
This article has no evaluationsLatest version Feb 6, 2026
Rote Memorization or Intelligence: An Assessment of Inferential Reasoning in Large Language Models

This article has 3 authors:
1. Rashid Mehmood
2. Eid Rehman
3. Muhammad Habib
This article has no evaluationsLatest version Apr 1, 2026
Standardized Assessment of LLM English Proficiency

This article has 7 authors:
1. Shangchao Min
2. Shaonan Wang
3. Xinyu Gao
4. Hui Wang
5. Zhiling Jin
6. Chen Ling
7. Nai Ding
This article has no evaluationsLatest version Feb 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

PruneBERT: Context-Aware Sentence Classification through Statistical Relevance Pruning

Rote Memorization or Intelligence: An Assessment of Inferential Reasoning in Large Language Models

Standardized Assessment of LLM English Proficiency