Efficient Knowledge Distillation for News Classification Based on ModernBERT

Xuyang Wang
Yuxi Zheng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale pretrained language models (e.g., BERT-large, RoBERTa-large) achieve strong performance in text classification; however, their substantial computational and energy costs hinder deployment, whereas base counterparts still lag in accuracy. ModernBERT-large and ModernBERT-base are adopted as a teacher–student pair to systematically explore three knowledge distillation (KD) strategies—full-sample output-layer distillation, selective output-layer distillation, and attention-layer distillation—for news classification. To assess sustainability, a carbon-efficiency metric (Accuracy per kWh) is introduced. Experiments demonstrate that the distilled student model improves ACC by 0.74% and boosts energy efficiency by 138.9% over the teacher on AG News; on 20 Newsgroups, ACC increases by 2.62% with a 133.6% efficiency gain. The results validate the effectiveness of a green distillation framework.

Version published to 10.21203/rs.3.rs-9229561/v1 on Research Square
Apr 8, 2026

AB-TC-BLATT: A Resource-Efficient Parallel System Architecture with Frozen ALBERT for Practical Chinese Sentiment Analysis

This article has 2 authors:
1. Li Qiusheng
2. Long Yu
This article has no evaluationsLatest version Apr 8, 2026
Large Language Models for Material Science: A Systematic Review

This article has 2 authors:
1. Cecília Coelho
2. Oliver Niggemann
This article has no evaluationsLatest version Apr 14, 2026
reEtym: A Natively Feature-Disentangled Transformer for Interpretability

This article has 1 author:
1. Hongyu Shi
This article has no evaluationsLatest version Apr 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

AB-TC-BLATT: A Resource-Efficient Parallel System Architecture with Frozen ALBERT for Practical Chinese Sentiment Analysis

Large Language Models for Material Science: A Systematic Review

reEtym: A Natively Feature-Disentangled Transformer for Interpretability