Multi Agent Rl

From BloomWiki
Revision as of 14:20, 23 April 2026 by Wordpad (talk | contribs) (BloomWiki: Multi Agent Rl)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Multi-agent reinforcement learning (MARL) studies how multiple autonomous agents learn to make decisions in shared environments where their actions affect each other's rewards. MARL extends single-agent RL to cooperative, competitive, and mixed settings: agents may need to collaborate to achieve a common goal, compete to maximize their individual rewards, or do both simultaneously. Applications include traffic signal coordination, robot swarm control, game playing (StarCraft II, Dota 2), financial market simulation, and multi-robot warehouse logistics.

Remembering

  • Multi-agent system — A system composed of multiple interacting autonomous agents, each making decisions in a shared environment.
  • Cooperative MARL — All agents share a common reward and must coordinate to maximize it collectively.
  • Competitive MARL — Agents have opposing objectives; one agent's gain is another's loss (zero-sum games).
  • Mixed (general-sum) MARL — Agents have partially aligned and partially conflicting objectives.
  • Centralized training, decentralized execution (CTDE) — The dominant MARL paradigm: train agents with access to global information, but at deployment each agent acts only on local observations.
  • Joint action space — The combined action space of all agents; grows exponentially with the number of agents.
  • Partial observability — Each agent observes only part of the global state; the Dec-POMDP framework models this.
  • Non-stationarity — From any agent's perspective, the environment is non-stationary because other agents are also learning and changing their policies.
  • Value decomposition — Methods that decompose a joint value function into per-agent components for scalable learning (VDN, QMIX).
  • QMIX — A cooperative MARL algorithm that learns a monotonic mixing function over individual agent Q-values.
  • MADDPG (Multi-Agent Deep Deterministic Policy Gradient) — A CTDE actor-critic method for continuous action spaces in multi-agent settings.
  • Communication in MARL — Allowing agents to send messages to each other, improving coordination.
  • Emergent behavior — Complex collective behaviors arising from individual agent policies without explicit programming.
  • Mean-field game — An approximation for very large agent populations where each agent interacts with the mean behavior of others.

Understanding

MARL introduces challenges absent in single-agent RL:

Non-stationarity: From agent A's perspective, the environment includes other agents B, C, D whose policies change as they learn. This violates the Markov property that single-agent RL assumes — the "environment" is non-stationary. This makes convergence guarantees harder to establish and training more unstable.

The credit assignment problem: In cooperative settings with a shared reward, how do we determine each agent's contribution to the collective outcome? If the team succeeds, which agent deserves credit? Solving this is essential for effective individual agent learning.

Scalability: The joint action space grows exponentially with the number of agents. With 10 agents each having 10 actions, the joint space has 10^10 possibilities — intractable for explicit joint optimization.

CTDE solutions: The dominant approach (CTDE) resolves these by training with a centralized critic that accesses all agents' observations and actions (solving non-stationarity and credit assignment), while policies execute using only local observations (enabling decentralized deployment). QMIX uses this framework for cooperative settings, constraining the joint Q-function to be monotonic in individual Q-values.

Emergent communication: Agents trained in multi-agent settings can develop their own communication protocols — learned languages not designed by humans but effective for coordination. This is fascinating from both an AI and linguistics perspective.

Applying

Cooperative MARL with QMIX (using PyMARL/EPyMARL): <syntaxhighlight lang="python">

  1. Conceptual QMIX implementation

import torch import torch.nn as nn

class AgentQNetwork(nn.Module):

   """Individual agent Q-network (decentralized)."""
   def __init__(self, obs_dim, n_actions, hidden=64):
       super().__init__()
       self.net = nn.Sequential(
           nn.Linear(obs_dim, hidden), nn.ReLU(),
           nn.Linear(hidden, hidden), nn.ReLU(),
           nn.Linear(hidden, n_actions)
       )
   def forward(self, obs): return self.net(obs)

class QMIXMixer(nn.Module):

   """Monotonic mixing network (centralized, training only)."""
   def __init__(self, n_agents, state_dim, embed_dim=32):
       super().__init__()
       self.n_agents = n_agents
       # Hypernetwork generates mixing weights from global state
       self.hyper_w1 = nn.Linear(state_dim, n_agents * embed_dim)
       self.hyper_w2 = nn.Linear(state_dim, embed_dim)
       self.hyper_b1 = nn.Linear(state_dim, embed_dim)
       self.hyper_b2 = nn.Sequential(nn.Linear(state_dim, embed_dim), nn.ReLU(),
                                      nn.Linear(embed_dim, 1))
   def forward(self, agent_qs, state):
       B = agent_qs.size(0)
       w1 = torch.abs(self.hyper_w1(state)).view(B, self.n_agents, -1)
       b1 = self.hyper_b1(state).view(B, 1, -1)
       hidden = torch.nn.functional.elu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1)
       w2 = torch.abs(self.hyper_w2(state)).view(B, -1, 1)
       b2 = self.hyper_b2(state).view(B, 1, 1)
       q_total = torch.bmm(hidden, w2) + b2  # Monotonic mixing
       return q_total.squeeze(-1)

</syntaxhighlight>

MARL algorithm selection
Cooperative, discrete actions → QMIX, MAPPO (shared PPO with centralized critic)
Cooperative, continuous actions → MADDPG, FACMAC
Competitive (2-player) → Self-play + PPO; AlphaZero (for games)
Large populations (100+ agents) → Mean-field RL (MF-Q, MF-AC)
Communication allowed → QMIX + Attention Communication; CommNet
Real-world robotics → Decentralized variants (no centralized training feasible)

Analyzing

MARL Setting Comparison
Setting Agents Reward Key Algorithm Example
Cooperative Multiple Shared QMIX, MAPPO Robot swarms, traffic control
Competitive (zero-sum) 2+ Opposing Self-play, NFSP Chess, poker, StarCraft
Mixed (general-sum) Multiple Individual MADDPG, LOLA Economic simulation
Team vs. team 2+ teams Team-shared Team self-play DOTA 2, sports simulation

Failure modes: Non-stationarity causes oscillating or divergent training. Emergent equilibria may not be globally optimal (e.g., agents learn to cooperate in a suboptimal way). Reward shaping misalignment — individual reward components don't align with true cooperative objectives. Scalability collapse when agent numbers exceed ~20 for joint-action methods.

Evaluating

Evaluation beyond individual agent performance: (1) Cooperative task success rate: overall team success rate at shared objective. (2) Coordination quality: breakdown of failure modes — which agents failed, which coordination steps went wrong. (3) Emergent behavior analysis: visualize learned strategies; are they sensible and interpretable? (4) Transfer: do agents generalize to different team compositions or opponent policies? (5) Scalability: how does performance degrade as number of agents scales from 5 to 50 to 500?

Creating

Designing a MARL system for warehouse robot coordination: (1) Define agents: each robot is an independent agent. (2) Observation: each robot observes its own position, nearest shelves, other robots within sensing radius. (3) Action: move to adjacent cell, pick item, place item, wait. (4) Reward: shared team throughput (items delivered per minute). (5) Training: CTDE with MAPPO — global observation for critic, local observation for actor. (6) Communication: allow robots to broadcast intended next position to prevent collisions. (7) Evaluation: test on held-out warehouse configurations; measure throughput vs. rule-based baseline.