Performance Enhancement and Energy Consumption Improvement of Convolutional Neural Networks through Architecture-aware Code Optimization

Mehran Rezaei
Zahra Moein Najafabadi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Three new architecture-aware code optimization techniques are proposed here to address the issue at hand and improve the efficiency of execution in modern processors. The focus is on reducing executed instructions and memory accesses of the convolutional layers to obtain the opportunistically exploiting data access locality. The advanced post-compiler optimization technique unrolls the innermost loop in a manner that significantly reduces the count of loop body instructions and memory accesses. It is revealed that, next to differences in memory access patterns that affect the cache miss ratio, there exist different permutations in the count of executed instructions and memory requests. Attempt is made to maximize the reuse of processor registers, beyond compiler optimizations, to reduce the number of memory reference instructions. The gem5 full-system simulator to yield 1.6x performance improvment and a 62% reduction in energy. These enhancements are achieved by a 48.3% reduction in the count of executed instructions and a 80% reduction in the D-cache miss rate, respectively.

Version published to 10.21203/rs.3.rs-7225761/v1 on Research Square
Feb 17, 2026

Flexible MAC Design for Sparse-Aware Deep Learning Accelerator

This article has 3 authors:
1. Chun-Lung Hsu
2. You-Chuan Li
3. Chih-Wei Liu
This article has no evaluationsLatest version Feb 23, 2026
Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware

This article has 1 author:
1. Christoforos Kachris
This article has no evaluationsLatest version Feb 11, 2026
Optimized Design of Lightweight NPU Accelerator for the Internet of Things Based on Mixed-precision convolution and Systolic Array

This article has 1 author:
1. HaoMiao Zhao
This article has no evaluationsLatest version Feb 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Flexible MAC Design for Sparse-Aware Deep Learning Accelerator

Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware

Optimized Design of Lightweight NPU Accelerator for the Internet of Things Based on Mixed-precision convolution and Systolic Array