Unified Operator Fusion for Heterogeneous Hardware in ML Inference Frameworks

Zhengkai Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Modern machine learning inference workloads run on a diverse array of hardware accelerators—from cloud GPUs to edge NPUs and FPGAs. Operator fusion, which merges multiple graph operations into a single kernel, has proven highly effective on homogeneous platforms but struggles to generalize across devices with different execution and memory models. We propose Unified Operator Fusion (UOF), a framework that introduces a hardware-agnostic intermediate representation alongside a device-aware cost model. UOF performs graph rewrites to identify and evaluate fusion opportunities, then emits optimized fused kernels tailored to each target. We integrate UOF into an open-source inference engine, equipping it with plugin backends for CUDA, multicore C++ and vendor SDKs. Offline profiling collects device compute peaks, memory bandwidths and kernel-launch latencies; these feed into an automated cost evaluator that balances compute, data movement and launch overhead. On ResNet-50 and BERT-small benchmarks across Intel Xeon CPUs, NVIDIA V100 GPUs and a mobile NPU, UOF delivers up to 3.8× end-to-end speedups over unfused baselines and matches hand-tuned vendor libraries within 5–10 \%. An ablation study removing the cost model results in over-fusion and up to 15 \% slowdowns, underscoring the need for hardware-aware decisions. UOF thus offers a unified, extensible fusion strategy that minimizes manual backend engineering while maximizing performance across heterogeneous inference targets.

Version published to 10.20944/preprints202507.1862.v1
Jul 22, 2025

Reconfigurable Acceleration of Deep Learning Workloads with FPGA-Based Architectures in Edge and Embedded Systems

This article has 5 authors:
1. Lucas Oliveira
2. Camila Ferreira
3. Thiago Souza
4. Gulnaz Rati
5. Mariana Costa
This article has no evaluationsLatest version Jul 15, 2025
Efficient Cluster Execution of Sparse Transformers: Joint Quantization and Carbon- Aware DVFS Scheduling

This article has 5 authors:
1. Kamal Saluja
2. Vikas Solanki
3. Sunil Gupta
4. Reema Goyal
5. Tanya Khaneja
This article has no evaluationsLatest version Jun 13, 2025
Automatic GPU Memory Access Optimization for AoSoA-based Application in OP2 Framework

This article has 6 authors:
1. Tong Lei
2. Zongjing Chen
3. Yonggang Che
4. Chuanfu Xu
5. Zhe Dai
6. Jian Zhang
This article has no evaluationsLatest version Jun 17, 2025

Listed in

Abstract

Article activity feed

Related articles

Reconfigurable Acceleration of Deep Learning Workloads with FPGA-Based Architectures in Edge and Embedded Systems

Efficient Cluster Execution of Sparse Transformers: Joint Quantization and Carbon- Aware DVFS Scheduling

Automatic GPU Memory Access Optimization for AoSoA-based Application in OP2 Framework