Editing
Multi-Agent Reinforcement Learning
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Multi-agent reinforcement learning (MARL) studies how multiple autonomous agents learn to make decisions in shared environments where their actions affect each other's rewards. MARL extends single-agent RL to cooperative, competitive, and mixed settings: agents may need to collaborate to achieve a common goal, compete to maximize their individual rewards, or do both simultaneously. Applications include traffic signal coordination, robot swarm control, game playing (StarCraft II, Dota 2), financial market simulation, and multi-robot warehouse logistics. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Multi-agent system''' β A system composed of multiple interacting autonomous agents, each making decisions in a shared environment. * '''Cooperative MARL''' β All agents share a common reward and must coordinate to maximize it collectively. * '''Competitive MARL''' β Agents have opposing objectives; one agent's gain is another's loss (zero-sum games). * '''Mixed (general-sum) MARL''' β Agents have partially aligned and partially conflicting objectives. * '''Centralized training, decentralized execution (CTDE)''' β The dominant MARL paradigm: train agents with access to global information, but at deployment each agent acts only on local observations. * '''Joint action space''' β The combined action space of all agents; grows exponentially with the number of agents. * '''Partial observability''' β Each agent observes only part of the global state; the Dec-POMDP framework models this. * '''Non-stationarity''' β From any agent's perspective, the environment is non-stationary because other agents are also learning and changing their policies. * '''Value decomposition''' β Methods that decompose a joint value function into per-agent components for scalable learning (VDN, QMIX). * '''QMIX''' β A cooperative MARL algorithm that learns a monotonic mixing function over individual agent Q-values. * '''MADDPG (Multi-Agent Deep Deterministic Policy Gradient)''' β A CTDE actor-critic method for continuous action spaces in multi-agent settings. * '''Communication in MARL''' β Allowing agents to send messages to each other, improving coordination. * '''Emergent behavior''' β Complex collective behaviors arising from individual agent policies without explicit programming. * '''Mean-field game''' β An approximation for very large agent populations where each agent interacts with the mean behavior of others. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == MARL introduces challenges absent in single-agent RL: **Non-stationarity**: From agent A's perspective, the environment includes other agents B, C, D whose policies change as they learn. This violates the Markov property that single-agent RL assumes β the "environment" is non-stationary. This makes convergence guarantees harder to establish and training more unstable. **The credit assignment problem**: In cooperative settings with a shared reward, how do we determine each agent's contribution to the collective outcome? If the team succeeds, which agent deserves credit? Solving this is essential for effective individual agent learning. **Scalability**: The joint action space grows exponentially with the number of agents. With 10 agents each having 10 actions, the joint space has 10^10 possibilities β intractable for explicit joint optimization. **CTDE solutions**: The dominant approach (CTDE) resolves these by training with a centralized critic that accesses all agents' observations and actions (solving non-stationarity and credit assignment), while policies execute using only local observations (enabling decentralized deployment). QMIX uses this framework for cooperative settings, constraining the joint Q-function to be monotonic in individual Q-values. **Emergent communication**: Agents trained in multi-agent settings can develop their own communication protocols β learned languages not designed by humans but effective for coordination. This is fascinating from both an AI and linguistics perspective. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''Cooperative MARL with QMIX (using PyMARL/EPyMARL):''' <syntaxhighlight lang="python"> # Conceptual QMIX implementation import torch import torch.nn as nn class AgentQNetwork(nn.Module): """Individual agent Q-network (decentralized).""" def __init__(self, obs_dim, n_actions, hidden=64): super().__init__() self.net = nn.Sequential( nn.Linear(obs_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, n_actions) ) def forward(self, obs): return self.net(obs) class QMIXMixer(nn.Module): """Monotonic mixing network (centralized, training only).""" def __init__(self, n_agents, state_dim, embed_dim=32): super().__init__() self.n_agents = n_agents # Hypernetwork generates mixing weights from global state self.hyper_w1 = nn.Linear(state_dim, n_agents * embed_dim) self.hyper_w2 = nn.Linear(state_dim, embed_dim) self.hyper_b1 = nn.Linear(state_dim, embed_dim) self.hyper_b2 = nn.Sequential(nn.Linear(state_dim, embed_dim), nn.ReLU(), nn.Linear(embed_dim, 1)) def forward(self, agent_qs, state): B = agent_qs.size(0) w1 = torch.abs(self.hyper_w1(state)).view(B, self.n_agents, -1) b1 = self.hyper_b1(state).view(B, 1, -1) hidden = torch.nn.functional.elu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1) w2 = torch.abs(self.hyper_w2(state)).view(B, -1, 1) b2 = self.hyper_b2(state).view(B, 1, 1) q_total = torch.bmm(hidden, w2) + b2 # Monotonic mixing return q_total.squeeze(-1) </syntaxhighlight> ; MARL algorithm selection : '''Cooperative, discrete actions''' β QMIX, MAPPO (shared PPO with centralized critic) : '''Cooperative, continuous actions''' β MADDPG, FACMAC : '''Competitive (2-player)''' β Self-play + PPO; AlphaZero (for games) : '''Large populations (100+ agents)''' β Mean-field RL (MF-Q, MF-AC) : '''Communication allowed''' β QMIX + Attention Communication; CommNet : '''Real-world robotics''' β Decentralized variants (no centralized training feasible) </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ MARL Setting Comparison ! Setting !! Agents !! Reward !! Key Algorithm !! Example |- | Cooperative || Multiple || Shared || QMIX, MAPPO || Robot swarms, traffic control |- | Competitive (zero-sum) || 2+ || Opposing || Self-play, NFSP || Chess, poker, StarCraft |- | Mixed (general-sum) || Multiple || Individual || MADDPG, LOLA || Economic simulation |- | Team vs. team || 2+ teams || Team-shared || Team self-play || DOTA 2, sports simulation |} '''Failure modes''': Non-stationarity causes oscillating or divergent training. Emergent equilibria may not be globally optimal (e.g., agents learn to cooperate in a suboptimal way). Reward shaping misalignment β individual reward components don't align with true cooperative objectives. Scalability collapse when agent numbers exceed ~20 for joint-action methods. </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == Evaluation beyond individual agent performance: (1) **Cooperative task success rate**: overall team success rate at shared objective. (2) **Coordination quality**: breakdown of failure modes β which agents failed, which coordination steps went wrong. (3) **Emergent behavior analysis**: visualize learned strategies; are they sensible and interpretable? (4) **Transfer**: do agents generalize to different team compositions or opponent policies? (5) **Scalability**: how does performance degrade as number of agents scales from 5 to 50 to 500? </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Designing a MARL system for warehouse robot coordination: (1) Define agents: each robot is an independent agent. (2) Observation: each robot observes its own position, nearest shelves, other robots within sensing radius. (3) Action: move to adjacent cell, pick item, place item, wait. (4) Reward: shared team throughput (items delivered per minute). (5) Training: CTDE with MAPPO β global observation for critic, local observation for actor. (6) Communication: allow robots to broadcast intended next position to prevent collisions. (7) Evaluation: test on held-out warehouse configurations; measure throughput vs. rule-based baseline. [[Category:Artificial Intelligence]] [[Category:Reinforcement Learning]] [[Category:Multi-Agent Systems]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information