The Energy Efficiency Paradox: Lightweight CNNs Consume More Power than ResNets on Consumer GPUs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Deploying deep neural networks in energyconstrained environments requires that inference optimisation strategies — runtime backends and reduced numerical precision — reliably deliver their promised gains on the target hardware. Yet how consistently these gains transfer across GPU tiers remains poorly characterised, and practitioners routinely apply optimisations benchmarked on high-end hardware to lower-tier devices without accounting for architectural differences. This paper presents a systematic, energy-aware benchmarking study of 15 image classification architectures evaluated across four runtime configurations on two consumer-grade GPUs: an NVIDIA GeForce GTX 1650 (Turing TU117, no Tensor Cores) and an NVIDIA GeForce RTX 3060 (Ampere GA106, Tensor Core-equipped). Each configuration is measured over ten independent runs at batch sizes 1, 8, and 32, recording perinference latency, throughput, and energy consumption. Runtime configurations include native PyTorch FP32, ONNX Runtime with CUDA Execution Provider (ORT-CUDA FP32), and FP16 variants in both PyTorch and ORT; all pairwise runtime differences are assessed viaWilcoxon signed-rank tests with Bonferroni correction. We also evaluate INT8 quantization (CPU) with accuracy agreement checks and roofline analysis to explain the observed energy paradox. Results show that ORT-CUDA consistently outperforms native PyTorch across both platforms, with a mean speedup of 2.01× on the GTX 1650 at batch size 1. The benefit is stratified by architecture family: lightweight depthwise-separable CNNs gain 2.79 ± 0.35×, standard CNNs 1.16 ± 0.18×, and LayerNormbased models 1.05±0.29×. FP16 behaviour, however, is strongly hardware-dependent. On the GTX 1650, which lacks Tensor Cores, FP16 systematically regresses latency across all 15 models: lightweight CNNs slow by a mean of 16%, standard CNNs by 115%, and LayerNorm-based architectures by 188% (up to 279% for ViT-B/16). A numerical stability failure renders EfficientNet-B3 unusable at batch size 32. On the RTX 3060, Tensor Core acceleration yields FP16 speedups only for compute-intensive models at moderate batch sizes (mean 1.95× at batch 32), while at batch size 1 only attention-based architectures benefit modestly (4–5%). Additionally, FP16 ONNX models for ViT-B/16 and Swin-T fail ONNX Runtime’s type checks on both GPUs due to a toolchain limitation in transformer attention layers, highlighting a separate portability constraint. Cross-platform, the GTX 1650 outperforms the RTX 3060 under PyTorch FP32 for 12 of 15 models at batch size 1 (FPS ratios 1.59–1.84×), revealing a batch-1 throughput paradox. INT8 quantization on CPU achieves 54–98% energy savings for lightweight models with 98–100% accuracy agreement. Roofline analysis confirms that memory bound lightweights (arithmetic intensity ¡35 FLOP/byte on RTX) are responsible for the energy paradox, explaining why models with fewer FLOPs can consume more energy than computebound ResNets. Together, these findings expose a hardware- and toolchaindependent portability gap that is invisible to single-platform benchmarks and carry direct implications for hardware-aware model selection in energy-sensitive deployment scenarios. All experimental scripts, raw results, and intermediate data are made publicly available to support reproducibility. Index Terms—Energy efficiency, deep learning

Article activity feed