Unified Operator Fusion for Heterogeneous Hardware in ML Inference Frameworks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Modern machine learning inference workloads run on a diverse array of hardware accelerators—from cloud GPUs to edge NPUs and FPGAs. Operator fusion, which merges multiple graph operations into a single kernel, has proven highly effective on homogeneous platforms but struggles to generalize across devices with different execution and memory models. We propose Unified Operator Fusion (UOF), a framework that introduces a hardware-agnostic intermediate representation alongside a device-aware cost model. UOF performs graph rewrites to identify and evaluate fusion opportunities, then emits optimized fused kernels tailored to each target. We integrate UOF into an open-source inference engine, equipping it with plugin backends for CUDA, multicore C++ and vendor SDKs. Offline profiling collects device compute peaks, memory bandwidths and kernel-launch latencies; these feed into an automated cost evaluator that balances compute, data movement and launch overhead. On ResNet-50 and BERT-small benchmarks across Intel Xeon CPUs, NVIDIA V100 GPUs and a mobile NPU, UOF delivers up to 3.8× end-to-end speedups over unfused baselines and matches hand-tuned vendor libraries within 5–10 \%. An ablation study removing the cost model results in over-fusion and up to 15 \% slowdowns, underscoring the need for hardware-aware decisions. UOF thus offers a unified, extensible fusion strategy that minimizes manual backend engineering while maximizing performance across heterogeneous inference targets.