EcoRL-Sched: Energy-Aware Heterogeneous GPU–FPGA Task Scheduling for Sustainable RLHF Training Pipelines

Saher Elsayed

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become the dominant post-training paradigm for aligning large language models (LLMs), yet it remains among the most energetically expensive workloads in modern AI infrastructure. Existing RLHF frameworks optimise primarily for throughput on homogeneous GPU clusters, neglecting the severe energy inefficiencies inherent in the multi-stage RLHF pipeline. We identify a fundamental and previously unexploited structural asymmetry: inference stages (Reward Model, Reference Policy, Critic) draw 60–75% less power per GPU than training stages, and their predictable single-pass computation maps naturally to FPGA accelerators. We present EcoRL-Sched, an energy-aware heterogeneous GPU–FPGA task scheduling framework comprising three tightly integrated innovations: (1) a power-profiling subsystem that characterises per-stage, per-model-size energy density via a novel Energy Density Index (EDI) metric; (2) an FPGA offloading engine on Xilinx Alveo U55C achieving 4.9× better tokens/Joule than H100 GPUs for reward and reference inference, running concurrently with GPU training via a latency-overlap protocol; and (3) an RL-based dynamic scheduler, a PPO-trained lightweight policy network, that uses real-time power telemetry and ROLL multi-task workloads to minimise pipeline bubbles and idle GPU cycles. Across 8B, 70B, and 405B parameter models on a 32-GPU H100 cluster, EcoRL-Sched achieves up to 14.6× throughput speedup, 38.4% energy reduction, 40.6% CO2 reduction, and 51% faster convergence on ROLL benchmarks, all without degrading model quality. Lifecycle analysis confirms net carbon benefits exceed FPGA manufacturing overhead by >30×.

Version published to 10.20944/preprints202602.1854.v1
Feb 27, 2026

Flexible MAC Design for Sparse-Aware Deep Learning Accelerator

This article has 3 authors:
1. Chun-Lung Hsu
2. You-Chuan Li
3. Chih-Wei Liu
This article has no evaluationsLatest version Feb 23, 2026
Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

This article has 4 authors:
1. Ernesto Leite
2. Fabrice Mourlin
3. Youakim Badr
4. Pierre Paradinas
This article has no evaluationsLatest version Mar 5, 2026
Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware

This article has 1 author:
1. Christoforos Kachris
This article has no evaluationsLatest version Feb 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Flexible MAC Design for Sparse-Aware Deep Learning Accelerator

Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

Transformer Algorithmics: A Tutorial on Efficient Implementation of Transformers on Hardware