Editing Multi-Agent Reinforcement Learning

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Multi-agent reinforcement learning (MARL) studies how multiple autonomous agents learn to make decisions in shared environments where their actions affect each other's rewards. MARL extends single-agent RL to cooperative, competitive, and mixed settings: agents may need to collaborate to achieve a common goal, compete to maximize their individual rewards, or do both simultaneously. Applications include traffic signal coordination, robot swarm control, game playing (StarCraft II, Dota 2), financial market simulation, and multi-robot warehouse logistics.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Multi-agent system''' — A system composed of multiple interacting autonomous agents, each making decisions in a shared environment.
* '''Cooperative MARL''' — All agents share a common reward and must coordinate to maximize it collectively.
* '''Competitive MARL''' — Agents have opposing objectives; one agent's gain is another's loss (zero-sum games).
* '''Mixed (general-sum) MARL''' — Agents have partially aligned and partially conflicting objectives.
* '''Centralized training, decentralized execution (CTDE)''' — The dominant MARL paradigm: train agents with access to global information, but at deployment each agent acts only on local observations.
* '''Joint action space''' — The combined action space of all agents; grows exponentially with the number of agents.
* '''Partial observability''' — Each agent observes only part of the global state; the Dec-POMDP framework models this.
* '''Non-stationarity''' — From any agent's perspective, the environment is non-stationary because other agents are also learning and changing their policies.
* '''Value decomposition''' — Methods that decompose a joint value function into per-agent components for scalable learning (VDN, QMIX).
* '''QMIX''' — A cooperative MARL algorithm that learns a monotonic mixing function over individual agent Q-values.
* '''MADDPG (Multi-Agent Deep Deterministic Policy Gradient)''' — A CTDE actor-critic method for continuous action spaces in multi-agent settings.
* '''Communication in MARL''' — Allowing agents to send messages to each other, improving coordination.
* '''Emergent behavior''' — Complex collective behaviors arising from individual agent policies without explicit programming.
* '''Mean-field game''' — An approximation for very large agent populations where each agent interacts with the mean behavior of others.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
MARL introduces challenges absent in single-agent RL:

**Non-stationarity**: From agent A's perspective, the environment includes other agents B, C, D whose policies change as they learn. This violates the Markov property that single-agent RL assumes — the "environment" is non-stationary. This makes convergence guarantees harder to establish and training more unstable.

**The credit assignment problem**: In cooperative settings with a shared reward, how do we determine each agent's contribution to the collective outcome? If the team succeeds, which agent deserves credit? Solving this is essential for effective individual agent learning.

**Scalability**: The joint action space grows exponentially with the number of agents. With 10 agents each having 10 actions, the joint space has 10^10 possibilities — intractable for explicit joint optimization.

**CTDE solutions**: The dominant approach (CTDE) resolves these by training with a centralized critic that accesses all agents' observations and actions (solving non-stationarity and credit assignment), while policies execute using only local observations (enabling decentralized deployment). QMIX uses this framework for cooperative settings, constraining the joint Q-function to be monotonic in individual Q-values.

**Emergent communication**: Agents trained in multi-agent settings can develop their own communication protocols — learned languages not designed by humans but effective for coordination. This is fascinating from both an AI and linguistics perspective.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Cooperative MARL with QMIX (using PyMARL/EPyMARL):'''
<syntaxhighlight lang="python">
# Conceptual QMIX implementation
import torch
import torch.nn as nn

class AgentQNetwork(nn.Module):
    """Individual agent Q-network (decentralized)."""
    def __init__(self, obs_dim, n_actions, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, n_actions)
        )
    def forward(self, obs): return self.net(obs)

class QMIXMixer(nn.Module):
    """Monotonic mixing network (centralized, training only)."""
    def __init__(self, n_agents, state_dim, embed_dim=32):
        super().__init__()
        self.n_agents = n_agents
        # Hypernetwork generates mixing weights from global state
        self.hyper_w1 = nn.Linear(state_dim, n_agents * embed_dim)
        self.hyper_w2 = nn.Linear(state_dim, embed_dim)
        self.hyper_b1 = nn.Linear(state_dim, embed_dim)
        self.hyper_b2 = nn.Sequential(nn.Linear(state_dim, embed_dim), nn.ReLU(),
                                       nn.Linear(embed_dim, 1))

    def forward(self, agent_qs, state):
        B = agent_qs.size(0)
        w1 = torch.abs(self.hyper_w1(state)).view(B, self.n_agents, -1)
        b1 = self.hyper_b1(state).view(B, 1, -1)
        hidden = torch.nn.functional.elu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1)
        w2 = torch.abs(self.hyper_w2(state)).view(B, -1, 1)
        b2 = self.hyper_b2(state).view(B, 1, 1)
        q_total = torch.bmm(hidden, w2) + b2  # Monotonic mixing
        return q_total.squeeze(-1)
</syntaxhighlight>

; MARL algorithm selection
: '''Cooperative, discrete actions''' → QMIX, MAPPO (shared PPO with centralized critic)
: '''Cooperative, continuous actions''' → MADDPG, FACMAC
: '''Competitive (2-player)''' → Self-play + PPO; AlphaZero (for games)
: '''Large populations (100+ agents)''' → Mean-field RL (MF-Q, MF-AC)
: '''Communication allowed''' → QMIX + Attention Communication; CommNet
: '''Real-world robotics''' → Decentralized variants (no centralized training feasible)
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ MARL Setting Comparison
! Setting !! Agents !! Reward !! Key Algorithm !! Example
|-
| Cooperative || Multiple || Shared || QMIX, MAPPO || Robot swarms, traffic control
|-
| Competitive (zero-sum) || 2+ || Opposing || Self-play, NFSP || Chess, poker, StarCraft
|-
| Mixed (general-sum) || Multiple || Individual || MADDPG, LOLA || Economic simulation
|-
| Team vs. team || 2+ teams || Team-shared || Team self-play || DOTA 2, sports simulation
|}

'''Failure modes''': Non-stationarity causes oscillating or divergent training. Emergent equilibria may not be globally optimal (e.g., agents learn to cooperate in a suboptimal way). Reward shaping misalignment — individual reward components don't align with true cooperative objectives. Scalability collapse when agent numbers exceed ~20 for joint-action methods.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Evaluation beyond individual agent performance: (1) **Cooperative task success rate**: overall team success rate at shared objective. (2) **Coordination quality**: breakdown of failure modes — which agents failed, which coordination steps went wrong. (3) **Emergent behavior analysis**: visualize learned strategies; are they sensible and interpretable? (4) **Transfer**: do agents generalize to different team compositions or opponent policies? (5) **Scalability**: how does performance degrade as number of agents scales from 5 to 50 to 500?
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a MARL system for warehouse robot coordination: (1) Define agents: each robot is an independent agent. (2) Observation: each robot observes its own position, nearest shelves, other robots within sensing radius. (3) Action: move to adjacent cell, pick item, place item, wait. (4) Reward: shared team throughput (items delivered per minute). (5) Training: CTDE with MAPPO — global observation for critic, local observation for actor. (6) Communication: allow robots to broadcast intended next position to prevent collisions. (7) Evaluation: test on held-out warehouse configurations; measure throughput vs. rule-based baseline.

[[Category:Artificial Intelligence]]
[[Category:Reinforcement Learning]]
[[Category:Multi-Agent Systems]]
</div>