A Hierarchical Multi-Agent Reinforcement Learning Framework with High-Level Guidance from Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multi-agent reinforcement learning (MARL) has achieved notable progress in cooperative control tasks; however, its scalability and learning efficiency remain limited in environments characterized by sparse rewards, long decision horizons, and complex coordination dynamics. Most existing MARL approaches rely primarily on end-to-end numerical optimization, which makes it difficult to incorporate structured high-level guidance and often leads to unstable training and inefficient exploration in challenging scenarios. In this paper, we propose a hierarchical learning framework that integrates a large language model (LLM) as a high-level guidance module within a value-decomposition-based MARL architecture. Operating at a coarse temporal scale, the LLM generates structured situation abstractions, long-horizon planning directives, and subtask specifications based on semantic environment descriptions. These high-level outputs are grounded through low-level MARL agents, which learn decentralized action policies via standard reinforcement learning optimization. To facilitate effective interaction between semantic guidance and numerical learning, the proposed framework introduces an LLM-driven semantic reward shaping mechanism that maps abstract subgoals into dense auxiliary learning signals, as well as a task-guided policy learning scheme with dynamic action masking to bias exploration toward strategically relevant action subsets during training. Both mechanisms are designed to be algorithm-agnostic and can be integrated into existing MARL methods without modifying their core network architectures. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) benchmark demonstrate that the proposed framework improves learning efficiency, convergence stability, and final performance across a diverse set of cooperative scenarios, particularly in sparse-reward and large-scale settings. Additional ablation studies and qualitative analyses further suggest the effectiveness of hierarchical guidance in accelerating coordinated multi-agent policy learning.