Adaptive Dataflow and Precision Optimization for Deep Learning on Configurable Hardware Architectures

Gulnaz Rati
Rafael Mendes
Aisha Noor

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As deep learning continues to revolutionize a wide range of domains—from computer vision and natural language processing to autonomous systems and edge computing—the demand for efficient, scalable, and domain-adaptable neural network acceleration has never been more critical. While Graphics Processing Units (GPUs) and Application-Specific Integrated Circuits (ASICs) have traditionally dominated the hardware landscape for both training and inference, Field-Programmable Gate Arrays (FPGAs) have recently gained significant traction due to their unique combination of reconfigurability, energy efficiency, and support for highly customized computation. This review presents a comprehensive and in-depth analysis of FPGA-based neural network accelerators, elucidating their architectural foundations, design methodologies, comparative performance characteristics, and deployment challenges in the context of modern machine learning workloads.We begin by examining the core motivations behind using FPGAs for deep learning, highlighting their suitability for low-latency, high-throughput inference, especially in power- and resource-constrained environments such as edge devices and embedded platforms. The ability to define custom data paths, implement novel numeric representations, and tailor memory hierarchies enables FPGAs to execute specialized models with high efficiency, often outperforming GPUs in terms of energy per operation. The review then delves into the major design patterns and architectural strategies employed in FPGA-based accelerators, including systolic arrays, streaming dataflows, loop unrolling, pipelining, and parallelism at various levels of the computation graph. State-of-the-art compilation frameworks and high-level synthesis tools such as Vitis AI, hls4ml, and FINN are discussed in detail, alongside recent advances in quantization, pruning, and model compression techniques that enhance the viability of FPGA deployment.A detailed comparison with GPU- and ASIC-based accelerators is presented, evaluating trade-offs across performance, flexibility, power efficiency, development complexity, and cost. Our findings suggest that FPGAs occupy a compelling middle ground between the general-purpose programmability of GPUs and the ultra-efficient specialization of ASICs, making them particularly well-suited for inference at the edge and in scenarios requiring frequent model updates or architectural experimentation. However, the adoption of FPGAs remains hindered by steep learning curves, toolchain immaturity, and limitations in dynamic runtime adaptability, resource utilization, and developer accessibility. To address these challenges, we survey emerging directions in FPGA research, including adaptive compute fabrics, hardware-software co-design automation, chiplet-based integration, support for dynamic workloads, and secure deployment frameworks.In conclusion, this review articulates the pivotal role that FPGAs can play in the future of AI acceleration. By bridging the gap between general-purpose and application-specific hardware, and by enabling fine-grained control over computation and memory, FPGA-based accelerators offer a highly versatile platform for deploying neural networks in increasingly diverse and demanding operational contexts. Through continued innovation in compiler technologies, hardware architectures, and cross-layer optimization methodologies, the FPGA ecosystem has the potential to evolve into a mainstream enabler of efficient, scalable, and adaptive machine learning systems.

Version published to 10.20944/preprints202510.0454.v1
Oct 8, 2025

DLPack: A DSP-Based Low-Bitwidth Packing Architecture for Efficient 2-Bit CNN Inference on FPGA-based Edge Devices

This article has 3 authors:
1. Maryam Mohabbati
2. Hakem Beitollahi
3. Somayeh Kashi
This article has no evaluationsLatest version Aug 29, 2025
FPGA-Accelerated Real-Time DCGANs via Xilinx DPUs and Vitis AI

This article has 5 authors:
1. Amirhossein Sadr
2. Shayan Haghighat
3. Aida Pakniyat
4. Dara Rahmati
5. Saeid Gorgin
This article has no evaluationsLatest version Aug 20, 2025
3CBench: A Unified Benchmarking Framework for the Computing Capacity of Heterogeneous AI Clusters

This article has 10 authors:
1. Weixing Zhang
2. Xizhi Wang
3. Jun Yan
4. Jiasun Feng
5. Yiying Liu
6. Haiyan Li
7. Qun Chen
8. Zhe Tang
9. Xin Cui
10. Fei Yang
This article has no evaluationsLatest version Oct 9, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DLPack: A DSP-Based Low-Bitwidth Packing Architecture for Efficient 2-Bit CNN Inference on FPGA-based Edge Devices

FPGA-Accelerated Real-Time DCGANs via Xilinx DPUs and Vitis AI

3CBench: A Unified Benchmarking Framework for the Computing Capacity of Heterogeneous AI Clusters