Racing to Idle: Energy Efficiency of Matrix Multiplication on Heterogeneous CPU and GPU Architectures

Mufakir Qamar Ansari

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Heterogeneous computing has emerged as an essential approach to overcoming the power and thermal constraints that have stalled single-core processor scaling. By integrating multi-core CPUs with both discrete and integrated GPUs, modern systems promise substantial gains in both performance and energy efficiency, yet the practical magnitude of these benefits on consumer hardware remains underexplored. This study presents a rigorous experimental comparison of a canonical matrix-matrix multiplication workload across three architectures, a multi-core AMD Ryzen 7 CPU, a discrete NVIDIA GeForce GPU, and an integrated AMD Radeon Vega GPU, within a single, widely available laptop. Using minimally intrusive, production-grade measurement tools, we deliver a transparent, quantitative analysis of the real-world trade-offs between speed and energy consumption. The results demonstrate that the discrete GPU not only provides a dramatic 93-fold speedup over the CPU, but also achieves more than 50 times greater energy efficiency, consuming just 2% of the energy required by the CPU for the same computation. These findings provide direct evidence for the race to idle principle: peak instantaneous power is less important than rapid workload completion and fast return to idle for minimizing total energy-to-solution. Overall, this work establishes clear empirical guidance for practitioners designing for energy-aware high-performance computing, demonstrating that architectural specialization is critical for unlocking orders-of-magnitude improvements in computational efficiency on widely accessible platforms.

Version published to 10.21203/rs.3.rs-7890483/v1 on Research Square
Oct 21, 2025

Accelerating fine-grained parallel incomplete factorization on MIMD many-core architecture

This article has 10 authors:
1. Yongzhen Shi
2. Qinglin Wang
3. Weihao Guo
4. Muchun Peng
5. Jie Liu
6. Lian Wang
7. Zhiyan Liu
8. Bingwei Wang
9. Feiming Liu
10. Xiangdong Pei
This article has no evaluationsLatest version Oct 9, 2025
Towards a GPU-enabled billionare SVD in pyLOM

This article has 6 authors:
1. Arnau Miró
2. Benet Eiximeno
3. Lucas Gasparino
4. Nathan Kutz
5. Ivette Rodriguez
6. Oriol Lehmkuhl
This article has no evaluationsLatest version Oct 10, 2025
Adaptive Dataflow and Precision Optimization for Deep Learning on Configurable Hardware Architectures

This article has 3 authors:
1. Gulnaz Rati
2. Rafael Mendes
3. Aisha Noor
This article has no evaluationsLatest version Oct 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accelerating fine-grained parallel incomplete factorization on MIMD many-core architecture

Towards a GPU-enabled billionare SVD in pyLOM

Adaptive Dataflow and Precision Optimization for Deep Learning on Configurable Hardware Architectures