TSPPO: Transformer-Based Sequential Proximal Policy Optimization for Multi-Agent Systems

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Multi-agent reinforcement learning has emerged as a transformative approach for solving complex tasks in dynamic and cooperative environments, such as resource allocation, robotics, and swarm control. However, integrating long-term strategic planning with immediate reactive decision-making remains a significant challenge due to the inherent non-stationarity, partial observabil-ity, and scalability issues in multi-agent systems. In this paper, we propose a novel framework, Transformer-Based Sequential Proximal Policy Optimiza-tion(TSPPO). Specifically, we introduce Contextual State Encoding with Transformers to capture both long-term dependencies and fine-grained temporal dynamics, enabling agents to dynamically balance strategic planning and reactive decision-making. Furthermore, we develop a Pre-order Advantage Correction mechanism to mitigate non-stationarity by correcting the advantage function during sequential policy updates, ensuring stable convergence. To enhance learning efficiency, we propose Sequential Decisions on Marginal Contributions. This approach prioritizes agents for policy updates based on their estimated contributions to team performance. Extensive experiments conducted on benchmark environments, including the Star-Craft Multi-Agent Challenge and Multi-Agent MUJOCO, demonstrate that TSPPO consistently outperforms state-of-the-art baselines in terms of convergence speed, stability, and final performance. These results validate the effectiveness of our proposed framework in handling the complex interplay of cooperation and competition in multi-agent systems, setting a new standard for scalable and robust MARL approaches.

Article activity feed