Efficient Knowledge Distillation for News Classification Based on ModernBERT

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large-scale pretrained language models (e.g., BERT-large, RoBERTa-large) achieve strong performance in text classification; however, their substantial computational and energy costs hinder deployment, whereas base counterparts still lag in accuracy. ModernBERT-large and ModernBERT-base are adopted as a teacher–student pair to systematically explore three knowledge distillation (KD) strategies—full-sample output-layer distillation, selective output-layer distillation, and attention-layer distillation—for news classification. To assess sustainability, a carbon-efficiency metric (Accuracy per kWh) is introduced. Experiments demonstrate that the distilled student model improves ACC by 0.74% and boosts energy efficiency by 138.9% over the teacher on AG News; on 20 Newsgroups, ACC increases by 2.62% with a 133.6% efficiency gain. The results validate the effectiveness of a green distillation framework.

Article activity feed