DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

Sher Badshah
Hassan Sajjad

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen’s Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE’s ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.

Version published to 10.32388/b69sky
Mar 21, 2025

BHRE-RAG: A Benchmark and Retrieval-Augmented Framework for Advancing Comprehension-Based Question Answering in Bangla

This article has 2 authors:
1. Md Saiyem Raiyan
2. Nayeema Ferdous
This article has no evaluationsLatest version Jan 23, 2026
QNLP-Bench: A Standardized Benchmark and Evaluation Framework for Quantum Natural Language Processing

This article has 1 author:
1. Parham Ghayour
This article has no evaluationsLatest version Dec 19, 2025
From Generation to Detection: Leveraging Empirically Derived Linguistic Hints for LLM-Based Fake News Detection

This article has 1 author:
1. Piyush Ghasiya
This article has no evaluationsLatest version Jan 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

BHRE-RAG: A Benchmark and Retrieval-Augmented Framework for Advancing Comprehension-Based Question Answering in Bangla

QNLP-Bench: A Standardized Benchmark and Evaluation Framework for Quantum Natural Language Processing

From Generation to Detection: Leveraging Empirically Derived Linguistic Hints for LLM-Based Fake News Detection