UAV Navigation using Reinforcement Learning: A Systematic Approach to Progressive Reward Function Design
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Fixed-wing unmanned aerial vehicles (UAVs) present significant path-following control challenges due to underactuation, coupled dynamics and stall constraints. These challenges complicate traditional control design and motivate the application of reinforcement learning (RL), which can learn effective policies without explicit aerodynamic models. A key difficulty in RL is reward function design: simple reward functions based solely on position and heading errors frequently produce oscillatory policies that struggle to generalize beyond trained paths. We address these limitations through systematic reward function decomposition, evaluating four progressively complex designs: (I) goal-distance minimization, (II) sequential waypoint navigation, (III) control-smoothness penalties, and (IV) 3D altitude tracking. Each policy is trained on a kinematic fixed-wing simulator and evaluated using reward-agnostic metrics—Path Deviation (mean distance to reference trajectory) and Oscillation Index (variance of control-rate changes). Across three RL algorithms—Proximal Policy Optimization (PPO), Soft Actor-Critic (SAC), and Twin Delayed Deep Deterministic Policy Gradient (TD3)—waypoint-based navigation (Stage II) reduces path deviation by 78–88% compared to goal-based rewards (Stage I), while smoothness penalties (Stage III) decrease control oscillations by 45–82%. The resulting policies maintain 100% success under wind disturbances despite being trained in zero-wind conditions. The framework extends to 3D trajectories (Stage IV), achieving 100% success on both seen and unseen paths while handling wind disturbances. Our results demonstrate that waypoint observations and control-rate penalties are essential components for stable fixed-wing RL control, while goal-only rewards consistently produce unstable behavior regardless of the underlying algorithm. This systematic decomposition provides a principled methodology for reward function design in RL based control of underactuated aerial systems.