Mixture of Experts: Difference between revisions
New BloomWiki article: Mixture of Experts |
BloomWiki: Mixture of Experts |
||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Mixture of Experts (MoE) is a neural network architecture where the model is split into specialized sub-networks called "experts," with a learned router that activates only a subset for each input. This decouples total parameters from compute: a model can have 100B+ parameters but activate only a fraction per token. MoE is the architecture behind Mixtral, and reportedly GPT-4 and Gemini. | Mixture of Experts (MoE) is a neural network architecture where the model is split into specialized sub-networks called "experts," with a learned router that activates only a subset for each input. This decouples total parameters from compute: a model can have 100B+ parameters but activate only a fraction per token. MoE is the architecture behind Mixtral, and reportedly GPT-4 and Gemini. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Expert''' — A sub-network (typically an FFN block) that specializes in certain types of inputs. | * '''Expert''' — A sub-network (typically an FFN block) that specializes in certain types of inputs. | ||
* '''Gating network (router)''' — A learned function mapping each token to a probability distribution over experts. | * '''Gating network (router)''' — A learned function mapping each token to a probability distribution over experts. | ||
| Line 14: | Line 19: | ||
* '''Switch Transformer''' — Google's top-1 routing MoE scaled to 1.6 trillion parameters. | * '''Switch Transformer''' — Google's top-1 routing MoE scaled to 1.6 trillion parameters. | ||
* '''Expert parallelism''' — Distributing different experts across different GPUs for scale. | * '''Expert parallelism''' — Distributing different experts across different GPUs for scale. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
The core motivation for MoE is **parameter-compute decoupling**: in a dense model, doubling parameters doubles compute. In a sparse MoE with E experts and top-K routing, total parameters scale with E while compute stays roughly constant — only K experts fire per token. | The core motivation for MoE is **parameter-compute decoupling**: in a dense model, doubling parameters doubles compute. In a sparse MoE with E experts and top-K routing, total parameters scale with E while compute stays roughly constant — only K experts fire per token. | ||
| Line 25: | Line 32: | ||
In practice, every other FFN layer in the transformer is replaced with an MoE layer (alternating dense and sparse), while attention layers remain dense. | In practice, every other FFN layer in the transformer is replaced with an MoE layer (alternating dense and sparse), while attention layers remain dense. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''Using Mixtral 8x7B for inference:''' | '''Using Mixtral 8x7B for inference:''' | ||
<syntaxhighlight lang="python"> | <syntaxhighlight lang="python"> | ||
| Line 50: | Line 59: | ||
: '''DeepSeek-V2/V3''' → 160 experts, top-6; extremely large expert pool | : '''DeepSeek-V2/V3''' → 160 experts, top-6; extremely large expert pool | ||
: '''Switch Transformer''' → Top-1 routing, scales to 1.6T; Google research | : '''Switch Transformer''' → Top-1 routing, scales to 1.6T; Google research | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ Dense vs. Sparse MoE Trade-offs | |+ Dense vs. Sparse MoE Trade-offs | ||
| Line 68: | Line 79: | ||
'''Failure modes''': Expert collapse, token dropping when experts are full, load imbalance causing some experts to never train effectively, and inter-device communication bottleneck in expert parallelism. | '''Failure modes''': Expert collapse, token dropping when experts are full, load imbalance causing some experts to never train effectively, and inter-device communication bottleneck in expert parallelism. | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Evaluating</span> == | |||
MoE-specific evaluation: (1) Expert load distribution — Gini coefficient or entropy over expert usage; alert if any expert handles >3× average load. (2) Expert specialization — do different experts handle semantically distinct inputs? Visualize token types routed to each expert. (3) Routing consistency — does the same input consistently route to the same experts? High variance suggests instability. (4) Quality vs. active FLOPs — compare at equal compute budgets, not equal total parameters. | MoE-specific evaluation: (1) Expert load distribution — Gini coefficient or entropy over expert usage; alert if any expert handles >3× average load. (2) Expert specialization — do different experts handle semantically distinct inputs? Visualize token types routed to each expert. (3) Routing consistency — does the same input consistently route to the same experts? High variance suggests instability. (4) Quality vs. active FLOPs — compare at equal compute budgets, not equal total parameters. | ||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Creating</span> == | |||
Designing an MoE architecture: Replace every other FFN with an MoE layer. Use top-2 routing for stability. Add auxiliary load balancing loss (weight 0.01). Set expert capacity factor 1.25. Implement expert parallelism by assigning expert groups to different GPUs. Monitor expert usage histograms every 1000 steps. At inference, cache frequently activated experts in faster memory for common patterns. | Designing an MoE architecture: Replace every other FFN with an MoE layer. Use top-2 routing for stability. Add auxiliary load balancing loss (weight 0.01). Set expert capacity factor 1.25. Implement expert parallelism by assigning expert groups to different GPUs. Monitor expert usage histograms every 1000 steps. At inference, cache frequently activated experts in faster memory for common patterns. | ||
| Line 78: | Line 93: | ||
[[Category:Deep Learning]] | [[Category:Deep Learning]] | ||
[[Category:Large Language Models]] | [[Category:Large Language Models]] | ||
</div> | |||
Latest revision as of 01:54, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Mixture of Experts (MoE) is a neural network architecture where the model is split into specialized sub-networks called "experts," with a learned router that activates only a subset for each input. This decouples total parameters from compute: a model can have 100B+ parameters but activate only a fraction per token. MoE is the architecture behind Mixtral, and reportedly GPT-4 and Gemini.
Remembering[edit]
- Expert — A sub-network (typically an FFN block) that specializes in certain types of inputs.
- Gating network (router) — A learned function mapping each token to a probability distribution over experts.
- Sparse MoE — Only top-K experts (K=1 or 2) activated per token; compute stays constant.
- Top-K routing — Selecting K highest-probability experts per token.
- Expert capacity — Maximum tokens an expert processes per batch; overflow tokens are dropped.
- Load balancing loss — Auxiliary loss encouraging uniform token distribution across experts.
- Expert collapse — Failure mode where router learns to route all tokens to one expert.
- Active parameters — Parameters actually used per forward pass; far fewer than total in MoE.
- Mixtral 8x7B — 8 experts per MoE layer, top-2 routing, 47B total, ~13B active per token.
- Switch Transformer — Google's top-1 routing MoE scaled to 1.6 trillion parameters.
- Expert parallelism — Distributing different experts across different GPUs for scale.
Understanding[edit]
The core motivation for MoE is **parameter-compute decoupling**: in a dense model, doubling parameters doubles compute. In a sparse MoE with E experts and top-K routing, total parameters scale with E while compute stays roughly constant — only K experts fire per token.
The router computes a score for each expert (dot product of token representation with learned expert embeddings), applies softmax, and selects top-K. The token passes through each selected expert; outputs are weighted by router probabilities and summed.
- Why experts specialize**: Training dynamics naturally lead experts to focus on different input types. In a language MoE, different experts activate for code, scientific text, different languages, or different grammatical structures — without explicit programming.
- Load balancing challenge**: Without intervention, routing is unstable — popular experts improve faster, attracting more tokens, improving more. The auxiliary load balancing loss penalizes uneven distribution, forcing all experts to be used roughly equally.
In practice, every other FFN layer in the transformer is replaced with an MoE layer (alternating dense and sparse), while attention layers remain dense.
Applying[edit]
Using Mixtral 8x7B for inference: <syntaxhighlight lang="python"> from transformers import AutoModelForCausalLM, AutoTokenizer import torch
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.float16, device_map="auto" # Distributes experts across available GPUs
) tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
prompt = "[INST] Explain quantum entanglement simply. [/INST]" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=256, temperature=0.7) print(tokenizer.decode(out[0], skip_special_tokens=True)) </syntaxhighlight>
- MoE model comparison
- Mixtral 8x7B → 8 experts, top-2, 47B total, 13B active; strong open model
- Mixtral 8x22B → 141B total, top-2; very high quality
- DeepSeek-V2/V3 → 160 experts, top-6; extremely large expert pool
- Switch Transformer → Top-1 routing, scales to 1.6T; Google research
Analyzing[edit]
| Property | Dense Model | Sparse MoE (top-2 of 8) |
|---|---|---|
| Total parameters | N | N×8 |
| Active parameters | N | N×(2/8) = N/4 |
| Compute per token | ∝N | ∝N/4 |
| Memory required | N × dtype | N×8 × dtype |
| Communication | None | Inter-device expert routing |
Failure modes: Expert collapse, token dropping when experts are full, load imbalance causing some experts to never train effectively, and inter-device communication bottleneck in expert parallelism.
Evaluating[edit]
MoE-specific evaluation: (1) Expert load distribution — Gini coefficient or entropy over expert usage; alert if any expert handles >3× average load. (2) Expert specialization — do different experts handle semantically distinct inputs? Visualize token types routed to each expert. (3) Routing consistency — does the same input consistently route to the same experts? High variance suggests instability. (4) Quality vs. active FLOPs — compare at equal compute budgets, not equal total parameters.
Creating[edit]
Designing an MoE architecture: Replace every other FFN with an MoE layer. Use top-2 routing for stability. Add auxiliary load balancing loss (weight 0.01). Set expert capacity factor 1.25. Implement expert parallelism by assigning expert groups to different GPUs. Monitor expert usage histograms every 1000 steps. At inference, cache frequently activated experts in faster memory for common patterns.