The Energy Efficiency Paradox: Lightweight CNNs Consume More Power than ResNets on Consumer GPUs

Someyo Kamal Utsho

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Deploying deep neural networks in energyconstrained environments requires that inference optimisation strategies — runtime backends and reduced numerical precision — reliably deliver their promised gains on the target hardware. Yet how consistently these gains transfer across GPU tiers remains poorly characterised, and practitioners routinely apply optimisations benchmarked on high-end hardware to lower-tier devices without accounting for architectural differences. This paper presents a systematic, energy-aware benchmarking study of 15 image classification architectures evaluated across four runtime configurations on two consumer-grade GPUs: an NVIDIA GeForce GTX 1650 (Turing TU117, no Tensor Cores) and an NVIDIA GeForce RTX 3060 (Ampere GA106, Tensor Core-equipped). Each configuration is measured over ten independent runs at batch sizes 1, 8, and 32, recording perinference latency, throughput, and energy consumption. Runtime configurations include native PyTorch FP32, ONNX Runtime with CUDA Execution Provider (ORT-CUDA FP32), and FP16 variants in both PyTorch and ORT; all pairwise runtime differences are assessed viaWilcoxon signed-rank tests with Bonferroni correction. We also evaluate INT8 quantization (CPU) with accuracy agreement checks and roofline analysis to explain the observed energy paradox. Results show that ORT-CUDA consistently outperforms native PyTorch across both platforms, with a mean speedup of 2.01× on the GTX 1650 at batch size 1. The benefit is stratified by architecture family: lightweight depthwise-separable CNNs gain 2.79 ± 0.35×, standard CNNs 1.16 ± 0.18×, and LayerNormbased models 1.05±0.29×. FP16 behaviour, however, is strongly hardware-dependent. On the GTX 1650, which lacks Tensor Cores, FP16 systematically regresses latency across all 15 models: lightweight CNNs slow by a mean of 16%, standard CNNs by 115%, and LayerNorm-based architectures by 188% (up to 279% for ViT-B/16). A numerical stability failure renders EfficientNet-B3 unusable at batch size 32. On the RTX 3060, Tensor Core acceleration yields FP16 speedups only for compute-intensive models at moderate batch sizes (mean 1.95× at batch 32), while at batch size 1 only attention-based architectures benefit modestly (4–5%). Additionally, FP16 ONNX models for ViT-B/16 and Swin-T fail ONNX Runtime’s type checks on both GPUs due to a toolchain limitation in transformer attention layers, highlighting a separate portability constraint. Cross-platform, the GTX 1650 outperforms the RTX 3060 under PyTorch FP32 for 12 of 15 models at batch size 1 (FPS ratios 1.59–1.84×), revealing a batch-1 throughput paradox. INT8 quantization on CPU achieves 54–98% energy savings for lightweight models with 98–100% accuracy agreement. Roofline analysis confirms that memory bound lightweights (arithmetic intensity ¡35 FLOP/byte on RTX) are responsible for the energy paradox, explaining why models with fewer FLOPs can consume more energy than computebound ResNets. Together, these findings expose a hardware- and toolchaindependent portability gap that is invisible to single-platform benchmarks and carry direct implications for hardware-aware model selection in energy-sensitive deployment scenarios. All experimental scripts, raw results, and intermediate data are made publicly available to support reproducibility. Index Terms—Energy efficiency, deep learning

Version published to 10.21203/rs.3.rs-9384241/v1 on Research Square
Apr 14, 2026

Impact of Dynamic Voltage on GPU Energy Consumption for Real-Time Systems

This article has 3 authors:
1. Gamil Radman
2. Abdullah Alhussain
3. Nasro Min-Allah
This article has no evaluationsLatest version Apr 16, 2026
A Lightweight Neural Network Compression Pipeline for Resource-Constrained Edge AI Systems

This article has 1 author:
1. SOM SUBHRO NATH
This article has no evaluationsLatest version Apr 2, 2026
Fisher-Aware Adaptive Mixed-Precision Ternary Hybrid Quantization

This article has 1 author:
1. vibhor joshi
This article has no evaluationsLatest version Apr 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Impact of Dynamic Voltage on GPU Energy Consumption for Real-Time Systems

A Lightweight Neural Network Compression Pipeline for Resource-Constrained Edge AI Systems

Fisher-Aware Adaptive Mixed-Precision Ternary Hybrid Quantization