SumLLM: Performance Evaluation and the Judgment of Large Language Models in Bengali Abstractive News Summarization

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Bengali abstractive summarization has long been hindered by noisy, limited-quality reference datasets and limited evaluation methods. Prior benchmarks reported apparent strong performance, yet relied on small-scale human studies and reference-based metrics, both of which underestimate the generative capacity of modern LLMs. In this paper, we revisit Bangla summarization under zero-shot conditions, evaluating six recent open-source models: GPT-4, Llama-3.1-8B, Mixtral-8x22B-Instruct-v0.1, Gemma-2-27B, DeepSeek-R1, and Qwen3-30B-A3B on the Bengali Abstractive News Summarization (BANS) dataset. To overcome the issue of weak reference quality, we propose a robust evaluation framework using LLMs-as-Judges, where multiple calibrated LLMs independently assess outputs for faithfulness, coherence, and relevance. Our results demonstrate that modern LLMs can rival and in many cases surpass human-written references in readability and informativeness, though humans still retain advantages in certain nuanced cases. This work establishes zero-shot LLM reasoning combined with reference-free evaluation as a new paradigm for high-quality Bangla summarization, providing a scalable and robust framework for future low-resource language research.

Article activity feed