Efficient Cluster Execution of Sparse Transformers: Joint Quantization and Carbon- Aware DVFS Scheduling

Kamal Saluja
Vikas Solanki
Sunil Gupta
Reema Goyal
Tanya Khaneja

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rising computational demands of Transformer-based deep learning models pose significant challenges to energy efficiency, particularly in distributed cluster environments. This paper presents a novel framework for executing sparse Transformer models on compute clusters through the joint application of model quantization and carbon-aware dynamic voltage and frequency scaling (DVFS) scheduling. By leveraging structured sparsity and low-precision quantization techniques, we reduce memory footprint and computational overhead without compromising model accuracy. In parallel, a dynamic scheduling mechanism is proposed to optimize power usage based on carbon intensity forecasts and workload characteristics, ensuring environmentally sustainable inference. The framework is evaluated on benchmark Transformer architectures across simulated and real-world carbon-aware DVFS environments. Results demonstrate significant reductions in energy consumption—up to 45%—with negligible impact on accuracy, while enabling flexible adaptation to fluctuating energy availability. Our approach highlights a scalable, low-cost pathway for deploying deep learning models in resource-constrained and sustainability-sensitive computing clusters. This research bridges the gap between efficient AI inference and green computing, aligning with the emerging need for carbon-conscious, large-scale machine learning systems.

Version published to 10.21203/rs.3.rs-6689130/v1 on Research Square
Jun 13, 2025

The Energy Efficiency Paradox: Lightweight CNNs Consume More Power than ResNets on Consumer GPUs

This article has 1 author:
1. Someyo Kamal Utsho
This article has no evaluationsLatest version Apr 14, 2026
A Lightweight Neural Network Compression Pipeline for Resource-Constrained Edge AI Systems

This article has 1 author:
1. SOM SUBHRO NATH
This article has no evaluationsLatest version Apr 2, 2026
Impact of Dynamic Voltage on GPU Energy Consumption for Real-Time Systems

This article has 3 authors:
1. Gamil Radman
2. Abdullah Alhussain
3. Nasro Min-Allah
This article has no evaluationsLatest version Apr 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Energy Efficiency Paradox: Lightweight CNNs Consume More Power than ResNets on Consumer GPUs

A Lightweight Neural Network Compression Pipeline for Resource-Constrained Edge AI Systems

Impact of Dynamic Voltage on GPU Energy Consumption for Real-Time Systems