Program-Guided Refinement with Debate: A Multi-Agent LLM-Based Automated Fact-Checking Model

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The explosive spread of online information has led to the proliferation of false claims, which makes automated fact-checking increasingly urgent. Existing automated fact-checking models have three common problems: the lack of interpretability, the hallucination phenomenon and the lower inference efficiency. In this paper, we propose a novel model, PGR-Debate (Program-Guided Refinement with Debate), to address these challenges. By designing a multi-agent debate, the PGR-Debate decomposes complex claims into three executable sub-tasks: Question, Verify and Predict, thereby significantly enhancing the interpretability of fact-checking. To alleviate the hallucination problem, we design two Debater agents and one Finalizer agent. The two Debater agents engage in interactive debates to identify and correct errors in the reasoning program. The Finalizer then rewrites the program, gradually improving the faithfulness and credibility of explanations. To accelerate inference and enable lightweight deployment, we adopt a knowledge distillation strategy. A high-performance model serves as the teacher, and a task-aware distillation framework transfers its multi-hop reasoning capability to a smaller student model. This approach improves inference efficiency while preserving reasoning consistency. The model requires neither domain-specific pretraining nor task-specific fine-tuning, but leverages instruction-based prompting and knowledge distillation. Experiments on the standard FEVEROUS-S and HOVER datasets demonstrate that PGR-Debate outperforms multiple baselines under different evidence availability settings (Gold, Open-book), reduces reasoning time to 30%–50% of traditional methods, and boosts the student model’s inference speed by 1.9× after distillation. Moreover, the error rate of predictions is reduced by about 50%, with the semantic error rate dropping from 6.8% to 2.88% after distillation. Experimental results on the HOVER dataset demonstrate that, compared with the ProgramFC baseline (Qwen2.5-14B), our PGR-Debate model improves explanation faithfulness by 42–43 percentage points (about 3.3×) at the sentence level and by 7–8 percentage points (about 1.4×) at the program level, significantly enhancing the factual consistency of reasoning chains.

Article activity feed