Robotic pursuit evasion problem in a constrained game area using deep reinforcement learning and self-play training

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Pursuit evasion game (PEG) belongs to dynamic differential games, which have received a lot of attention thanks to its ability to articulate many real-life applications such as military, aerospace and mobile robotics. Several mathematical tools and processes are used to solve such problems, but recent techniques relying on deep reinforcement learning (DRL) have gained popularity, in particular DRL techniques adapted for problems with continuous action spaces such as Deep Deterministic policy gradients (DDPG). Most of these studies use two-phase training approach, where in the first phase only the pursuer is trained on a fixed trajectory of the evader, and in the second phase both DRL agents are trained simultaneously. The first phase of this approach requires trajectory generation, which introduces bias, but it remains adapted to unbound game areas. On the other hand, DDPG, is known to suffer from value overestimation problem, which led to the introduction of twin delayed DDPG (TD3). A tiny portion of the scientific literature use TD3 in the case of a one vs one pursuit evasion game, especially in the case of a bounded game area and without relying on a two-phase training approach. This paper explores the case of one-to-one pursuit evasion game in a constrained game area, using two TD3 agents trained simultaneously and from scratch via self-play only. Several rewards are proposed, which when combined, improve the training. Three training alternatives are presented, considering a normal self-play case, a case with a buffer zone and a final case with noisy actions. The three alternatives proved to output similar results, where both the pursuer and the evader agents were able to find optimal control strategies without any human intervention or trajectory generation. The simulation showed that the agents were performing better than other conventional methods such as Non-linear Model Predictive Control (NMPC). This study proposes a novel framework for a one-on-one PEG in a constrained environment, leveraging self-play to train two TD3 agents simultaneously and from scratch. Theoretical contributions include designing a multi-faceted reward function that integrates game status, evolution, duration, and agent performance to enhance training. Practical contributions involve evaluating three training configurations normal self-play, self-play with a buffer zone, and noisy actions—and demonstrating their effectiveness. Results show that all configurations enable the agents to discover optimal strategies autonomously, outperforming conventional methods like Non-linear Model Predictive Control (NMPC). Simulations reveal the agents' intelligent behaviors, with the pursuer leveraging game constraints and the evader executing evasive maneuvers. Comparison with existing methods highlights superior capture time and adaptability to noise, paving the way for real-world implementations.

Article activity feed