Offline Multi-Agent Reinforcement Learning via In‑Sample Sequential Policy Optimization (InSPO)
The paper introduces InSPO, an offline multi‑agent reinforcement‑learning algorithm that integrates behavior‑regularized Markov games with in‑sample sequential policy updates, using inverse KL divergence and maximum‑entropy regularization to avoid out‑of‑distribution joint actions, improve coordination, and achieve monotonic improvement toward Quantized Response Equilibrium, validated on XOR, bridge, and StarCraft II benchmarks.
Cooperative Markov Game
Offline multi‑agent reinforcement learning (MARL) is framed as a cooperative Markov game G=⟨N,S,A,P,r,γ,d⟩, where N is the set of agents, S the state space, A the joint action space, P the transition kernel, r a shared reward, γ the discount factor, and d the initial‑state distribution. At each timestep each agent i selects an action a_i in state s, receives a joint reward, and transitions to the next state.
IGM Principle and Value Decomposition
Directly learning a joint Q‑function Q(s,a) is intractable because the joint state‑action space grows exponentially with the number of agents. Value‑decomposition methods rewrite Q(s,a) as a combination of individual Q_i(s_i,a_i) functions under the Individual‑Global‑Max (IGM) principle, which assumes that the optimal joint action can be obtained by taking each agent’s greedy action. The paper notes that IGM can break down when the environment exhibits multimodal reward structures.
Behavior‑Regularized Markov Game for Offline MARL
To mitigate distribution‑shift (OOD) issues in offline MARL, a behavior‑regularized Markov game adds a data‑dependent penalty to the reward, encouraging policies to stay close to the behavior policy that generated the dataset. The objective maximizes expected discounted return while subtracting the regularization term, balancing exploration and exploitation and preventing convergence to sub‑optimal local minima.
In‑Sample Sequential Policy Optimization (InSPO)
InSPO operates within the behavior‑regularized Markov game, combining inverse KL divergence and maximum‑entropy regularization. By updating agents’ policies sequentially, it avoids selecting OOD joint actions and enhances inter‑agent coordination.
Mathematical Derivation
The core idea is to regularize the learned policy toward the behavior policy using the reverse KL divergence D_{KL}(π_b‖π). This term decomposes across agents, enabling sequential updates. Applying the Karush‑Kuhn‑Tucker (KKT) conditions yields a closed‑form solution for the optimization objective, which reduces to minimizing KL divergence to ensure consistent policy updates.
Maximum‑Entropy Behavior‑Regularized Markov Game (MEBR‑MG)
To further encourage exploration, InSPO adds an entropy term H(π) to the objective, forming the MEBR‑MG framework. The combined objective (‑KL + entropy) balances high‑probability and low‑probability actions, steering the policy toward the Quantized Response Equilibrium (QRE), which remains stable under perturbed rewards.
Algorithm Details
Algorithm 1: InSPO Steps
Input: offline dataset D, initial policy, and initial Q‑functions.
Output: final policy.
Compute a behavior policy via simple behavior cloning.
Iteratively optimize: at each iteration compute the current Q‑functions.
Randomly permute the agents and update each agent’s policy in sequence.
For each agent, update the policy using the derived objective function.
Repeat until a predefined number of iterations K is reached.
Policy Evaluation
Policy evaluation uses local Q‑functions to approximate the global Q‑function, updating them sequentially so they reflect the latest policies. An importance‑re‑sampling technique constructs a new dataset with reduced variance, stabilizing training.
Policy Improvement
After obtaining updated local Q‑functions, the algorithm minimizes KL divergence to improve each agent’s policy while preserving behavior‑policy characteristics, guaranteeing monotonic improvement.
Implementation Optimizations
Local Q‑function optimization : avoids exponential growth of the joint action space by approximating with per‑agent Q‑functions.
Importance re‑sampling : reduces variance of importance‑sampling ratios.
Automatic temperature α tuning : dynamically adjusts the conservatism level based on target values.
Experimental Validation
Experiments were conducted on three domains.
M‑NE Game
Two datasets were used: a balanced dataset collected by a uniform policy and an imbalanced dataset collected by a near‑local‑optimal policy. In the balanced case most algorithms found the global optimum; only InSPO identified the global optimum on the imbalanced dataset, highlighting the impact of dataset distribution on convergence.
Bridge Game (XOR‑like)
Two datasets—optimal (500 trajectories from an optimal deterministic policy) and mixed (optimal + 500 trajectories from a uniform random policy)—were evaluated. Only InSPO and AlberDICE achieved near‑optimal performance on both datasets, whereas value‑decomposition methods failed to converge.
StarCraft II Micromanagement Benchmarks
Four maps and four datasets (medium, expert, medium‑replay, mixed) were used. InSPO achieved state‑of‑the‑art results on most tasks, demonstrating its scalability to high‑dimensional, complex environments.
Ablation Study
Removing the entropy term caused InSPO to get stuck in local optima on the imbalanced M‑NE dataset. Replacing sequential updates with simultaneous updates led to conflicting update directions and OOD joint actions. Varying the temperature α showed that automatic tuning finds a suitable conservatism level and improves performance.
Conclusion and Future Directions
The proposed InSPO algorithm combines inverse KL divergence and entropy regularization to address OOD joint actions and local‑optimality in offline MARL. Theoretical analysis guarantees monotonic policy improvement and convergence to QRE, and extensive experiments confirm superior performance over existing offline MARL methods. Future work includes extending InSPO with other MARL techniques, scaling to larger environments, enhancing dataset quality via generative models, and tackling multimodal reward landscapes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
