A Fully Hardware-Managed Scheduling Architecture for AI Accelerators

Libo Cheng
Liang Yang
Jian Shao
Xinyi Gu
Rong Qian
Xinwei Zhang
Donghao Li
Hongbin Wang
Xiaoqi Xia

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The proliferation of AI applications across diverse domains has driven the evolution of AI accelerators toward higher performance and energy efficiency. This paper addresses the critical challenge of task scheduling in AI accelerators by introducing a fully hardware-managed scheduling system. Our approach leverages Operator Completion Status Registers (OCSRs) and a novel computation-scheduling instruction set to minimize software overhead and maximize execution parallelism. The co-designed hardware-software solution comprises: (1) a dedicated hardware scheduling unit with a complete instruction pipeline, (2) a compiler that maps operators to scheduling instructions while managing OCSR allocation, and (3) a lightweight runtime for efficient task dispatch. Experimental results demonstrate that our system significantly reduces scheduling latency and improves overall throughput, achieving an average performance gain of approximately 30\% across multiple CNN models while maintaining minimal area overhead of only 7.41\%. The proposed architecture establishes a new paradigm for high-efficiency AI accelerator design.

Version published to 10.21203/rs.3.rs-8183074/v1 on Research Square
Dec 12, 2025

LoRPIA: Low-power Reconfigurable Pallet-Integrated Accelerator for Depthwise Separable Convolutions

This article has 2 authors:
1. Sajad Eydivandi
2. Hakem Beitollahi
This article has no evaluationsLatest version Jan 8, 2026
Implementation and Performance Optimization of a DPDK Packet Gateway on Manycore CPUs

This article has 1 author:
1. Daisuke Sugisawa
This article has no evaluationsLatest version Jan 19, 2026
ZOE: Zero Overhead ECC Techniques for Flash Memory Used in AI Accelerators

This article has 2 authors:
1. Shyue-Kung Lu
2. Yi-Zheng Wu
This article has no evaluationsLatest version Dec 25, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LoRPIA: Low-power Reconfigurable Pallet-Integrated Accelerator for Depthwise Separable Convolutions

Implementation and Performance Optimization of a DPDK Packet Gateway on Manycore CPUs

ZOE: Zero Overhead ECC Techniques for Flash Memory Used in AI Accelerators