Efficient Knowledge Distillation for News Classification Based on ModernBERT
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-scale pretrained language models (e.g., BERT-large, RoBERTa-large) achieve strong performance in text classification; however, their substantial computational and energy costs hinder deployment, whereas base counterparts still lag in accuracy. ModernBERT-large and ModernBERT-base are adopted as a teacher–student pair to systematically explore three knowledge distillation (KD) strategies—full-sample output-layer distillation, selective output-layer distillation, and attention-layer distillation—for news classification. To assess sustainability, a carbon-efficiency metric (Accuracy per kWh) is introduced. Experiments demonstrate that the distilled student model improves ACC by 0.74% and boosts energy efficiency by 138.9% over the teacher on AG News; on 20 Newsgroups, ACC increases by 2.62% with a 133.6% efficiency gain. The results validate the effectiveness of a green distillation framework.