DLPack: A DSP-Based Low-Bitwidth Packing Architecture for Efficient 2-Bit CNN Inference on FPGA-based Edge Devices

Maryam Mohabbati
Hakem Beitollahi
Somayeh Kashi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Convolutional neural networks (CNNs) have become a fundamental component of modern deep learning, particularly in intelligent edge systems. However, deploying CNNs on such platforms presents challenges due to stringent constraints on power and computational resources. Field-programmable gate arrays (FPGAs), known for their reconfigurability and parallelism, offer a promising solution—yet are often inefficiently utilized for low-bitwidth inference. In this work, we present DLPack, a lightweight FPGA accelerator specifically designed for 2-bit CNN inference. DLPack introduces a structured packing technique that combines multiple low-precision multiply-accumulate (MAC) operations within a single DSP block, significantly enhancing processing density and resource efficiency. The architecture further incorporates a tile-wise dataflow strategy and a streamlined control mechanism to reduce latency and power consumption. Implemented on a Xilinx UltraScale+ FPGA, DLPack achieves up to 50% reduction in DSP usage, 83% lower power consumption, and around 99% improvement in inference latency compared to existing approaches. These results demonstrate the effectiveness of DLPack in enabling scalable, energy-efficient CNN inference on edge devices with limited computational budgets.

Version published to 10.21203/rs.3.rs-7260088/v1 on Research Square
Aug 29, 2025

Adaptive Dataflow and Precision Optimization for Deep Learning on Configurable Hardware Architectures

This article has 3 authors:
1. Gulnaz Rati
2. Rafael Mendes
3. Aisha Noor
This article has no evaluationsLatest version Oct 8, 2025
FPGA-Accelerated Real-Time DCGANs via Xilinx DPUs and Vitis AI

This article has 5 authors:
1. Amirhossein Sadr
2. Shayan Haghighat
3. Aida Pakniyat
4. Dara Rahmati
5. Saeid Gorgin
This article has no evaluationsLatest version Aug 20, 2025
CMOS-Compatible FTJ Hardware Unifying Stochastic Sampling and Deterministic Computing for On-Chip Image Generation

This article has 12 authors:
1. Jong-Ho Lee
2. Ryun-Han Koo
3. Jonghyun Ko
4. Sangwoo Ryu
5. Jiseong Im
6. Sung-Ho Park
7. Min Suk Song
8. Youngchan Cho
9. Jangsaeng Kim
10. Gyuweon Jung
11. Daewoong Kwon
12. Wonjun Shin
This article has no evaluationsLatest version Sep 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Adaptive Dataflow and Precision Optimization for Deep Learning on Configurable Hardware Architectures

FPGA-Accelerated Real-Time DCGANs via Xilinx DPUs and Vitis AI

CMOS-Compatible FTJ Hardware Unifying Stochastic Sampling and Deterministic Computing for On-Chip Image Generation