SumLLM: Performance Evaluation and the Judgment of Large Language Models in Bengali Abstractive News Summarization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Bengali abstractive summarization has long been hindered by noisy, limited-quality reference datasets and limited evaluation methods. Prior benchmarks reported apparent strong performance, yet relied on small-scale human studies and reference-based metrics, both of which underestimate the generative capacity of modern LLMs. In this paper, we revisit Bangla summarization under zero-shot conditions, evaluating six recent open-source models: GPT-4, Llama-3.1-8B, Mixtral-8x22B-Instruct-v0.1, Gemma-2-27B, DeepSeek-R1, and Qwen3-30B-A3B on the Bengali Abstractive News Summarization (BANS) dataset. To overcome the issue of weak reference quality, we propose a robust evaluation framework using LLMs-as-Judges, where multiple calibrated LLMs independently assess outputs for faithfulness, coherence, and relevance. Our results demonstrate that modern LLMs can rival and in many cases surpass human-written references in readability and informativeness, though humans still retain advantages in certain nuanced cases. This work establishes zero-shot LLM reasoning combined with reference-free evaluation as a new paradigm for high-quality Bangla summarization, providing a scalable and robust framework for future low-resource language research.