Meta Learning
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Meta-learning, also called "learning to learn," is a subfield of machine learning focused on designing models and algorithms that improve their learning ability with experience. While standard machine learning trains a model to perform a specific task, meta-learning trains a model to learn new tasks quickly — often from very few examples. The goal is not to solve a particular problem, but to become a more efficient learner across a distribution of problems. Meta-learning underpins few-shot learning, rapid adaptation in robotics, and is increasingly applied to hyperparameter optimization and neural architecture search.
Remembering
- Meta-learning — Learning to learn; training models to adapt quickly to new tasks using experience across many tasks.
- Meta-learner — The higher-level model that learns across tasks and produces or adapts base learners.
- Base learner — The model that is applied to individual tasks; updated by the meta-learner.
- Support set — The small labeled dataset provided at test time to adapt to a new task (analogous to training data for the base learner).
- Query set — The test examples for the new task on which performance is evaluated after adaptation.
- N-way K-shot learning — A meta-learning task setting: N classes, K labeled examples per class in the support set.
- Episode — One meta-learning training iteration, consisting of a sampled task with its support and query sets.
- MAML (Model-Agnostic Meta-Learning) — A gradient-based meta-learning algorithm that finds initialization parameters enabling rapid fine-tuning on new tasks.
- Prototypical Networks — A metric-based meta-learning approach that classifies by distance to class prototype embeddings.
- Matching Networks — A metric-based approach using an attention mechanism over support set embeddings.
- Meta-SGD — An extension of MAML that also meta-learns per-parameter learning rates.
- In-context learning — The emergent ability of large LLMs to learn new tasks from examples provided in the prompt, without gradient updates.
- Hyperparameter optimization (HPO) — Automatically finding optimal hyperparameters; meta-learning approaches (BOHB, SMAC) use experience across runs.
Understanding
Standard training gives a model a fixed behavior. Meta-learning gives a model the ability to quickly adapt its behavior given a few new examples.
The meta-learning objective: Across many tasks T sampled from a distribution p(T), find model parameters θ that can quickly adapt to any task using only a few examples. Formally: minθ E{T~p(T)}[LT(f{θ'})] where θ' = Adapt(θ, supportsetT).
Three meta-learning approaches:
Metric-based: Learn an embedding space where classification is easy — similar examples are close, different ones are far. At test time, classify by distance to class prototypes (Prototypical Networks) or by weighted attention over support examples (Matching Networks).
Optimization-based (MAML): Find model initialization θ such that a few gradient steps on the support set produce a good model for the query set. The meta-update optimizes through the adaptation process — it literally backpropagates through gradient descent steps.
Model-based: Use a recurrent or attention architecture that quickly updates its "memory" when shown support examples. The model's hidden state encodes the task context, enabling immediate adaptation.
In-context learning (emergent in LLMs) is meta-learning without gradient updates: GPT-4 can learn to translate into a new language, write in a new style, or follow new formatting rules from just a few examples in the prompt. The model's weights don't change — it "adapts" purely through the attention mechanism reading the context.
Applying
MAML implementation for few-shot classification: <syntaxhighlight lang="python"> import torch import torch.nn as nn from copy import deepcopy
class MAML:
def __init__(self, model, inner_lr=0.01, outer_lr=0.001, n_inner_steps=5):
self.model = model
self.inner_lr = inner_lr
self.optimizer = torch.optim.Adam(model.parameters(), lr=outer_lr)
self.n_inner_steps = n_inner_steps
self.loss_fn = nn.CrossEntropyLoss()
def inner_adapt(self, support_x, support_y):
"""Fast adaptation on support set (simulated fine-tuning)."""
fast_model = deepcopy(self.model)
fast_optimizer = torch.optim.SGD(fast_model.parameters(), lr=self.inner_lr)
for _ in range(self.n_inner_steps):
fast_optimizer.zero_grad()
self.loss_fn(fast_model(support_x), support_y).backward()
fast_optimizer.step()
return fast_model
def meta_update(self, episodes):
"""Outer loop: update θ to minimize query loss after adaptation."""
meta_loss = 0.0
for support_x, support_y, query_x, query_y in episodes:
adapted = self.inner_adapt(support_x, support_y)
meta_loss += self.loss_fn(adapted(query_x), query_y)
self.optimizer.zero_grad()
(meta_loss / len(episodes)).backward()
self.optimizer.step()
</syntaxhighlight>
- Meta-learning approach selection
- Image classification, few-shot → Prototypical Networks (simple, effective)
- Any architecture, gradient-based → MAML, FOMAML (first-order approximation)
- NLP, few-shot → In-context learning with a large LLM
- HPO automation → BOHB (Bayesian Optimization + Hyperband), Optuna
- NAS → DARTS (gradient-based), evolutionary search
Analyzing
| Approach | Requires Gradient Through Adapt | Speed | Generalization |
|---|---|---|---|
| Metric-based (Prototypical) | No | Very fast | Good (within domain) |
| MAML | Yes (expensive) | Slow (2nd order) | Good |
| FOMAML | No (first-order approx) | Moderate | Good |
| In-context learning | No (inference only) | Fast | Excellent (large LLMs) |
| Model-based (NTM) | No | Fast | Moderate |
Failure modes: MAML's second-order gradients are computationally expensive and numerically unstable. Metric-based methods fail when tasks are too diverse for a shared embedding space. In-context learning degrades when the context window fills or examples are poorly formatted. Meta-overfitting — the meta-learner overfits to the meta-training task distribution, failing on truly novel tasks.
Evaluating
Evaluation on standard benchmarks: Omniglot (20-way 1-shot, 5-shot character recognition), miniImageNet (5-way 1-shot, 5-shot), Meta-Dataset (diverse cross-domain few-shot). Report mean ± 95% CI across episodes. Expert practitioners evaluate generalization to task distributions outside meta-training — if a model only works on tasks similar to what it meta-trained on, it hasn't truly learned to learn.
Creating
Designing a meta-learning system for rapid domain adaptation:
- Collect a large collection of diverse tasks from the target distribution (if supervised) or define task samplers.
- For metric-based: train a shared encoder; at deployment, embed support set and classify by nearest prototype.
- For MAML: use FOMAML or Reptile (simpler first-order approximation) for practical training.
- For LLM-based: invest in prompt design and few-shot example selection — example quality matters more than quantity.
- Continually collect new tasks at deployment and add them to the meta-training distribution to prevent drift.