Efficient Cluster Execution of Sparse Transformers: Joint Quantization and Carbon- Aware DVFS Scheduling

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rising computational demands of Transformer-based deep learning models pose significant challenges to energy efficiency, particularly in distributed cluster environments. This paper presents a novel framework for executing sparse Transformer models on compute clusters through the joint application of model quantization and carbon-aware dynamic voltage and frequency scaling (DVFS) scheduling. By leveraging structured sparsity and low-precision quantization techniques, we reduce memory footprint and computational overhead without compromising model accuracy. In parallel, a dynamic scheduling mechanism is proposed to optimize power usage based on carbon intensity forecasts and workload characteristics, ensuring environmentally sustainable inference. The framework is evaluated on benchmark Transformer architectures across simulated and real-world carbon-aware DVFS environments. Results demonstrate significant reductions in energy consumption—up to 45%—with negligible impact on accuracy, while enabling flexible adaptation to fluctuating energy availability. Our approach highlights a scalable, low-cost pathway for deploying deep learning models in resource-constrained and sustainability-sensitive computing clusters. This research bridges the gap between efficient AI inference and green computing, aligning with the emerging need for carbon-conscious, large-scale machine learning systems.

Article activity feed