Mixture Of Experts
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Mixture of Experts (MoE) is a neural network architecture where the model is split into specialized sub-networks called "experts," with a learned router that activates only a subset for each input. This decouples total parameters from compute: a model can have 100B+ parameters but activate only a fraction per token. MoE is the architecture behind Mixtral, and reportedly GPT-4 and Gemini.
Remembering
- Expert — A sub-network (typically an FFN block) that specializes in certain types of inputs.
- Gating network (router) — A learned function mapping each token to a probability distribution over experts.
- Sparse MoE — Only top-K experts (K=1 or 2) activated per token; compute stays constant.
- Top-K routing — Selecting K highest-probability experts per token.
- Expert capacity — Maximum tokens an expert processes per batch; overflow tokens are dropped.
- Load balancing loss — Auxiliary loss encouraging uniform token distribution across experts.
- Expert collapse — Failure mode where router learns to route all tokens to one expert.
- Active parameters — Parameters actually used per forward pass; far fewer than total in MoE.
- Mixtral 8x7B — 8 experts per MoE layer, top-2 routing, 47B total, ~13B active per token.
- Switch Transformer — Google's top-1 routing MoE scaled to 1.6 trillion parameters.
- Expert parallelism — Distributing different experts across different GPUs for scale.
Understanding
The core motivation for MoE is parameter-compute decoupling: in a dense model, doubling parameters doubles compute. In a sparse MoE with E experts and top-K routing, total parameters scale with E while compute stays roughly constant — only K experts fire per token.
The router computes a score for each expert (dot product of token representation with learned expert embeddings), applies softmax, and selects top-K. The token passes through each selected expert; outputs are weighted by router probabilities and summed.
Why experts specialize: Training dynamics naturally lead experts to focus on different input types. In a language MoE, different experts activate for code, scientific text, different languages, or different grammatical structures — without explicit programming.
Load balancing challenge: Without intervention, routing is unstable — popular experts improve faster, attracting more tokens, improving more. The auxiliary load balancing loss penalizes uneven distribution, forcing all experts to be used roughly equally.
In practice, every other FFN layer in the transformer is replaced with an MoE layer (alternating dense and sparse), while attention layers remain dense.
Applying
Using Mixtral 8x7B for inference: <syntaxhighlight lang="python"> from transformers import AutoModelForCausalLM, AutoTokenizer import torch
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.float16, device_map="auto" # Distributes experts across available GPUs
) tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
prompt = "[INST] Explain quantum entanglement simply. [/INST]" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=256, temperature=0.7) print(tokenizer.decode(out[0], skip_special_tokens=True)) </syntaxhighlight>
- MoE model comparison
- Mixtral 8x7B → 8 experts, top-2, 47B total, 13B active; strong open model
- Mixtral 8x22B → 141B total, top-2; very high quality
- DeepSeek-V2/V3 → 160 experts, top-6; extremely large expert pool
- Switch Transformer → Top-1 routing, scales to 1.6T; Google research
Analyzing
| Property | Dense Model | Sparse MoE (top-2 of 8) |
|---|---|---|
| Total parameters | N | N×8 |
| Active parameters | N | N×(2/8) = N/4 |
| Compute per token | ∝N | ∝N/4 |
| Memory required | N × dtype | N×8 × dtype |
| Communication | None | Inter-device expert routing |
Failure modes: Expert collapse, token dropping when experts are full, load imbalance causing some experts to never train effectively, and inter-device communication bottleneck in expert parallelism.
Evaluating
MoE-specific evaluation: (1) Expert load distribution — Gini coefficient or entropy over expert usage; alert if any expert handles >3× average load. (2) Expert specialization — do different experts handle semantically distinct inputs? Visualize token types routed to each expert. (3) Routing consistency — does the same input consistently route to the same experts? High variance suggests instability. (4) Quality vs. active FLOPs — compare at equal compute budgets, not equal total parameters.
Creating
Designing an MoE architecture: Replace every other FFN with an MoE layer. Use top-2 routing for stability. Add auxiliary load balancing loss (weight 0.01). Set expert capacity factor 1.25. Implement expert parallelism by assigning expert groups to different GPUs. Monitor expert usage histograms every 1000 steps. At inference, cache frequently activated experts in faster memory for common patterns.