Unified Operator Fusion for Heterogeneous Hardware in ML Inference Frameworks

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern machine learning inference workloads run on a diverse array of hardware accelerators—from cloud GPUs to edge NPUs and FPGAs. Operator fusion, which merges multiple graph operations into a single kernel, has proven highly effective on homogeneous platforms but struggles to generalize across devices with different execution and memory models. We propose Unified Operator Fusion (UOF), a framework that introduces a hardware-agnostic intermediate representation alongside a device-aware cost model. UOF performs graph rewrites to identify and evaluate fusion opportunities, then emits optimized fused kernels tailored to each target. We integrate UOF into an open-source inference engine, equipping it with plugin backends for CUDA, multicore C++ and vendor SDKs. Offline profiling collects device compute peaks, memory bandwidths and kernel-launch latencies; these feed into an automated cost evaluator that balances compute, data movement and launch overhead. On ResNet-50 and BERT-small benchmarks across Intel Xeon CPUs, NVIDIA V100 GPUs and a mobile NPU, UOF delivers up to 3.8× end-to-end speedups over unfused baselines and matches hand-tuned vendor libraries within 5–10 \%. An ablation study removing the cost model results in over-fusion and up to 15 \% slowdowns, underscoring the need for hardware-aware decisions. UOF thus offers a unified, extensible fusion strategy that minimizes manual backend engineering while maximizing performance across heterogeneous inference targets.

Article activity feed