Flexible MAC Design for Sparse-Aware Deep Learning Accelerator

Chun-Lung Hsu
You-Chuan Li
Chih-Wei Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The increasing deployment of deep convolutional neural networks (DCNNs) in real‑time and resource‑constrained environments has intensified the demand for hardware accelerators capable of efficiently handling sparse and irregular computation patterns. Although systolic arrays offer high throughput, their rigid dataflow structure leads to severe processing‑element (PE) underutilization when executing unstructured sparse matrix operations, resulting in fragmented computation and unnecessary memory traffic. This work presents a flexible multiply-accumulate (MAC) architecture that enables sparsity‑aware deep learning accelerators (SA‑DLAs) while supporting both floating‑point and fixed‑point arithmetic within a unified datapath. The proposed architecture dynamically adapts to operand sparsity and data distribution, improving PE utilization without introducing complex control overhead. A complete SA‑DLA engine incorporating the flexible MAC is implemented in TSMC 28‑nm CMOS technology and validated on FPGA. Experimental results demonstrate that the proposed design significantly enhances computational efficiency under irregular workloads, achieving low latency, low power consumption, and high energy efficiency compared with conventional dense systolic‑array‑based accelerators. These results highlight the effectiveness of the proposed architecture for next‑generation sparse‑aware AI hardware systems.

Version published to 10.21203/rs.3.rs-8888624/v1 on Research Square
Feb 23, 2026

Optimized Design of Lightweight NPU Accelerator for the Internet of Things Based on Mixed-precision convolution and Systolic Array

This article has 1 author:
1. HaoMiao Zhao
This article has no evaluationsLatest version Feb 18, 2026
Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware

This article has 1 author:
1. Christoforos Kachris
This article has no evaluationsLatest version Feb 11, 2026
Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

This article has 4 authors:
1. Ernesto Leite
2. Fabrice Mourlin
3. Youakim Badr
4. Pierre Paradinas
This article has no evaluationsLatest version Mar 5, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Optimized Design of Lightweight NPU Accelerator for the Internet of Things Based on Mixed-precision convolution and Systolic Array

Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware

Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)