Optimized Design of Lightweight NPU Accelerator for the Internet of Things Based on Mixed-precision convolution and Systolic Array

HaoMiao Zhao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The IoT lightweight neural processing unit accelerator is crucial to improving the computing efficiency of the 3D reconstruction task of neural radiance fields (NeRF) in IoT terminals. In response to the problems of high computational complexity and large storage overhead of the NeRF model, this study designs a lightweight neural processing unit accelerator based on mixed-precision convolution, using half-precision floating-point to store the convolution kernel weights. Moreover, this research is combined with a three-level hierarchical storage architecture and output fixed data flow. Moreover, it further introduces a 16×16 pulsating array, built a "mixed precision convolution+pulsating array" collaborative architecture, and designs a weight-stationary pulsating data stream. The results revealed that the accelerator could complete 2304 multiply-accumulate operations in a single clock cycle. The average NeRF single-scenario inference latency was as low as 0.92s, the average frame rate reached 1.18FPS, and the effective operation ratio was 94.15%. The average model storage capacity was 10.44MB, the maximum off-chip memory access bandwidth was only 0.96GB/s, the NeRF reconstruction accuracy was high, and the structural similarity index measure reached 0.98. In summary, the accelerator achieves a comprehensive balance of high performance, lightweight, low power consumption and high precision. It can provide key hardware support for the efficient deployment of NeRF 3D reconstruction tasks for IoT terminals.

Version published to 10.21203/rs.3.rs-8705428/v1 on Research Square
Feb 18, 2026

Flexible MAC Design for Sparse-Aware Deep Learning Accelerator

This article has 3 authors:
1. Chun-Lung Hsu
2. You-Chuan Li
3. Chih-Wei Liu
This article has no evaluationsLatest version Feb 23, 2026
LoRPIA: Low-power Reconfigurable Pallet-Integrated Accelerator for Depthwise Separable Convolutions

This article has 2 authors:
1. Sajad Eydivandi
2. Hakem Beitollahi
This article has no evaluationsLatest version Jan 8, 2026
Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware

This article has 1 author:
1. Christoforos Kachris
This article has no evaluationsLatest version Feb 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Flexible MAC Design for Sparse-Aware Deep Learning Accelerator

LoRPIA: Low-power Reconfigurable Pallet-Integrated Accelerator for Depthwise Separable Convolutions

Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware