Continual Learning

From BloomWiki
Revision as of 01:49, 25 April 2026 by Wordpad (talk | contribs) (BloomWiki: Continual Learning)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Continual learning, also called lifelong learning or incremental learning, is the ability of a machine learning model to learn new tasks or data sequentially over time without forgetting previously acquired knowledge. This mirrors human cognition — we accumulate skills over a lifetime without having to re-learn everything from scratch. The central challenge is catastrophic forgetting: when a neural network is trained on new data, gradients overwrite the weights that encoded previous knowledge. Continual learning is essential for AI systems deployed in dynamic, evolving real-world environments.

Remembering

  • Continual learning — Training a model on a sequence of tasks or data streams over time without forgetting prior knowledge.
  • Catastrophic forgetting — The tendency of neural networks to abruptly lose previously learned knowledge when trained on new data.
  • Task-incremental learning — The model learns a sequence of distinct tasks, with task identity known at inference time.
  • Class-incremental learning — The model incrementally learns new classes; task identity is not given at test time (harder).
  • Domain-incremental learning — Same task but data distribution changes over time (e.g., new image styles).
  • Plasticity — The model's ability to learn new information quickly.
  • Stability — The model's ability to retain previously learned information.
  • Stability-plasticity dilemma — The fundamental trade-off: high plasticity enables fast learning but causes forgetting; high stability prevents forgetting but blocks new learning.
  • Elastic Weight Consolidation (EWC) — A regularization approach that penalizes changes to parameters important for previous tasks.
  • Progressive Neural Networks — Freeze previous task columns and add new lateral connections for new tasks; no forgetting but grows with each task.
  • Experience replay — Storing a small buffer of past examples and mixing them into training on new tasks.
  • Dark Experience Replay (DER) — Stores soft targets (logits) from past predictions, not just input-output pairs.
  • PackNet — Prunes and packs model weights for multiple tasks into the same fixed-size network.
  • Fisher information matrix — Used in EWC to measure parameter importance to previous tasks.

Understanding

The core problem: neural network weights encode knowledge through their specific values. When you train on Task B, gradient descent moves weights toward the minimum loss for Task B — often moving them away from the minimum for Task A. This is catastrophic forgetting: the gradient update for Task B destroys Task A's solution.

The stability-plasticity dilemma: A model that never forgets (high stability) must also never change weights (low plasticity) — so it can't learn new things. A model that learns quickly (high plasticity) overwrites old knowledge. Managing this trade-off is the central challenge.

Three families of solutions:

Regularization-based: Add a penalty term to the loss that discourages large changes to parameters important for previous tasks. EWC computes the Fisher information matrix (how important each parameter is to Task A's performance) and uses it to weight the penalty: L = Lnew + λ Σi Fi (θi - θ*_i)². Quadratic Penalties penalize based on simple L2 distance from previous parameters.

Memory-based (Replay): Keep a small buffer of past examples (coreset) and interleave them with new task data. This directly prevents forgetting by ensuring gradients for old tasks continue to appear. Gradient Episodic Memory (GEM) ensures new task gradients don't increase loss on stored examples.

Architecture-based: Allocate different model capacity to different tasks — freeze old weights, expand the model for new tasks (Progressive Neural Networks), or use dynamic sparse masks per task (PackNet, HAT).

Applying

EWC continual learning implementation: <syntaxhighlight lang="python"> import torch import torch.nn as nn from copy import deepcopy

class EWC:

   def __init__(self, model, dataloader, device='cuda', importance=1000):
       self.model = model
       self.importance = importance
       self.prev_params = {n: p.clone().detach() for n, p in model.named_parameters()}
       self.fisher = self._compute_fisher(dataloader, device)
   def _compute_fisher(self, dataloader, device):
       """Estimate diagonal Fisher information matrix."""
       fisher = {n: torch.zeros_like(p) for n, p in self.model.named_parameters()}
       self.model.eval()
       for x, y in dataloader:
           x, y = x.to(device), y.to(device)
           self.model.zero_grad()
           output = self.model(x)
           loss = nn.CrossEntropyLoss()(output, y)
           loss.backward()
           for n, p in self.model.named_parameters():
               if p.grad is not None:
                   fisher[n] += p.grad.pow(2) / len(dataloader)
       return fisher
   def penalty(self):
       """EWC regularization term."""
       loss = 0.0
       for n, p in self.model.named_parameters():
           loss += (self.fisher[n] * (p - self.prev_params[n]).pow(2)).sum()
       return self.importance * loss
  1. Training loop with EWC
  2. total_loss = task_loss + ewc.penalty()

</syntaxhighlight>

Continual learning approach selection
Memory-constrained, simple → EWC or SI (Synaptic Intelligence) regularization
Memory available → Experience replay (keep 200-500 examples per past task)
Separate task heads needed → Progressive Neural Networks
Production NLP → Domain-adaptive pre-training on streaming data with KL replay

Analyzing

Continual Learning Approach Comparison
Approach Memory Overhead Forgetting Level Scalability
Fine-tuning (no CL) None Catastrophic High (but useless)
EWC regularization Small (Fisher matrix) Moderate Good
Experience replay Coreset size Low Good
Progressive networks Grows with tasks Zero Poor (unbounded growth)
PackNet None (fixed size) Zero Moderate (limited capacity)

Failure modes: EWC fails when tasks are very different (Fisher approximation breaks down). Replay buffers don't scale to many tasks (coreset becomes unrepresentative). Progressive networks grow without bound. Class-incremental learning without task identity remains essentially unsolved at high accuracy.

Evaluating

Evaluate on standard benchmarks: Split-MNIST, Split-CIFAR-100, Permuted-MNIST. Metrics:

  1. Average accuracy after all tasks trained.
  2. Backward transfer — how much does learning new tasks affect old task accuracy (negative = forgetting).
  3. Forward transfer — does learning past tasks accelerate learning of new tasks?
  4. Memory efficiency — coreset size vs. performance trade-off. Expert practitioners always include a "joint training" upper bound (train on all tasks simultaneously) and a "fine-tuning" lower bound (no forgetting prevention).

Creating

Designing a production continual learning system:

  1. Identify the task sequence and whether task identity is available at inference.
  2. Choose replay if memory allows: maintain a diverse coreset using herding or iCaRL selection.
  3. Add EWC regularization on top of replay for additional stability.
  4. Use separate output heads per task if task identity is known.
  5. Monitor backward transfer in production: after every model update, evaluate on held-out samples from all past tasks.
  6. Implement a rollback mechanism if backward transfer exceeds threshold.