An Attention-Enhanced and Exploration-Optimized Algorithm for Value Decomposition in Cooperative Multi-Agent Reinforcement Learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Value decomposition methods in cooperative Multi-Agent Reinforcement Learning (c-MARL) face two key bottlenecks: individual value functions lack awareness of the cooperative context, and a globally uniform exploration rate results in inefficient exploration. To address these issues, this paper proposes AESO-QMIX (Attention and Exploration Strategy Optimization), which embeds a multi-head self-attention mechanism into individual Q-networks—enabling the generation of context-rich individual value functions—and introduces a dynamic exploration optimization module. This module adjusts exploration intensity based on the "performance–entropy" gap at both the individual and team levels. While preserving the monotonicity constraint of the mixing network, AESO-QMIX achieves significant performance improvements on the SMAC benchmark: it outperforms strong baselines across all six maps, and attains win rates of 84.8% and 30% in the super-hard scenarios MMM2 and 6h_vs_8z, respectively—substantially outperforming QMIX (Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning), its predecessor, and other mainstream methods. Ablation studies further validate the method’s components: the removal of the dynamic exploration optimization module reduces the final win rate on key maps by approximately 6–10 percentage points, and a four-head attention configuration achieves a better trade-off between prediction accuracy and computational overhead.