Editing Continual Learning

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Continual learning, also called lifelong learning or incremental learning, is the ability of a machine learning model to learn new tasks or data sequentially over time without forgetting previously acquired knowledge. This mirrors human cognition — we accumulate skills over a lifetime without having to re-learn everything from scratch. The central challenge is catastrophic forgetting: when a neural network is trained on new data, gradients overwrite the weights that encoded previous knowledge. Continual learning is essential for AI systems deployed in dynamic, evolving real-world environments.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Continual learning''' — Training a model on a sequence of tasks or data streams over time without forgetting prior knowledge.
* '''Catastrophic forgetting''' — The tendency of neural networks to abruptly lose previously learned knowledge when trained on new data.
* '''Task-incremental learning''' — The model learns a sequence of distinct tasks, with task identity known at inference time.
* '''Class-incremental learning''' — The model incrementally learns new classes; task identity is not given at test time (harder).
* '''Domain-incremental learning''' — Same task but data distribution changes over time (e.g., new image styles).
* '''Plasticity''' — The model's ability to learn new information quickly.
* '''Stability''' — The model's ability to retain previously learned information.
* '''Stability-plasticity dilemma''' — The fundamental trade-off: high plasticity enables fast learning but causes forgetting; high stability prevents forgetting but blocks new learning.
* '''Elastic Weight Consolidation (EWC)''' — A regularization approach that penalizes changes to parameters important for previous tasks.
* '''Progressive Neural Networks''' — Freeze previous task columns and add new lateral connections for new tasks; no forgetting but grows with each task.
* '''Experience replay''' — Storing a small buffer of past examples and mixing them into training on new tasks.
* '''Dark Experience Replay (DER)''' — Stores soft targets (logits) from past predictions, not just input-output pairs.
* '''PackNet''' — Prunes and packs model weights for multiple tasks into the same fixed-size network.
* '''Fisher information matrix''' — Used in EWC to measure parameter importance to previous tasks.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The core problem: neural network weights encode knowledge through their specific values. When you train on Task B, gradient descent moves weights toward the minimum loss for Task B — often moving them away from the minimum for Task A. This is catastrophic forgetting: the gradient update for Task B destroys Task A's solution.

'''The stability-plasticity dilemma''': A model that never forgets (high stability) must also never change weights (low plasticity) — so it can't learn new things. A model that learns quickly (high plasticity) overwrites old knowledge. Managing this trade-off is the central challenge.

'''Three families of solutions''':

'''Regularization-based''': Add a penalty term to the loss that discourages large changes to parameters important for previous tasks. EWC computes the Fisher information matrix (how important each parameter is to Task A's performance) and uses it to weight the penalty: L = L''new + λ Σ''i F''i (θ''i - θ*_i)². Quadratic Penalties penalize based on simple L2 distance from previous parameters.

'''Memory-based (Replay)''': Keep a small buffer of past examples (coreset) and interleave them with new task data. This directly prevents forgetting by ensuring gradients for old tasks continue to appear. Gradient Episodic Memory (GEM) ensures new task gradients don't increase loss on stored examples.

'''Architecture-based''': Allocate different model capacity to different tasks — freeze old weights, expand the model for new tasks (Progressive Neural Networks), or use dynamic sparse masks per task (PackNet, HAT).
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''EWC continual learning implementation:'''
<syntaxhighlight lang="python">
import torch
import torch.nn as nn
from copy import deepcopy

class EWC:
    def __init__(self, model, dataloader, device='cuda', importance=1000):
        self.model = model
        self.importance = importance
        self.prev_params = {n: p.clone().detach() for n, p in model.named_parameters()}
        self.fisher = self._compute_fisher(dataloader, device)

    def _compute_fisher(self, dataloader, device):
        """Estimate diagonal Fisher information matrix."""
        fisher = {n: torch.zeros_like(p) for n, p in self.model.named_parameters()}
        self.model.eval()
        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            self.model.zero_grad()
            output = self.model(x)
            loss = nn.CrossEntropyLoss()(output, y)
            loss.backward()
            for n, p in self.model.named_parameters():
                if p.grad is not None:
                    fisher[n] += p.grad.pow(2) / len(dataloader)
        return fisher

    def penalty(self):
        """EWC regularization term."""
        loss = 0.0
        for n, p in self.model.named_parameters():
            loss += (self.fisher[n] * (p - self.prev_params[n]).pow(2)).sum()
        return self.importance * loss

# Training loop with EWC
# total_loss = task_loss + ewc.penalty()
</syntaxhighlight>

; Continual learning approach selection
: '''Memory-constrained, simple''' → EWC or SI (Synaptic Intelligence) regularization
: '''Memory available''' → Experience replay (keep 200-500 examples per past task)
: '''Separate task heads needed''' → Progressive Neural Networks
: '''Production NLP''' → Domain-adaptive pre-training on streaming data with KL replay
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Continual Learning Approach Comparison
! Approach !! Memory Overhead !! Forgetting Level !! Scalability
|-
| Fine-tuning (no CL) || None || Catastrophic || High (but useless)
|-
| EWC regularization || Small (Fisher matrix) || Moderate || Good
|-
| Experience replay || Coreset size || Low || Good
|-
| Progressive networks || Grows with tasks || Zero || Poor (unbounded growth)
|-
| PackNet || None (fixed size) || Zero || Moderate (limited capacity)
|}

'''Failure modes''': EWC fails when tasks are very different (Fisher approximation breaks down). Replay buffers don't scale to many tasks (coreset becomes unrepresentative). Progressive networks grow without bound. Class-incremental learning without task identity remains essentially unsolved at high accuracy.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Evaluate on standard benchmarks: Split-MNIST, Split-CIFAR-100, Permuted-MNIST. Metrics:
# Average accuracy after all tasks trained.
# Backward transfer — how much does learning new tasks affect old task accuracy (negative = forgetting).
# Forward transfer — does learning past tasks accelerate learning of new tasks?
# Memory efficiency — coreset size vs. performance trade-off. Expert practitioners always include a "joint training" upper bound (train on all tasks simultaneously) and a "fine-tuning" lower bound (no forgetting prevention).
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a production continual learning system:
# Identify the task sequence and whether task identity is available at inference.
# Choose replay if memory allows: maintain a diverse coreset using herding or iCaRL selection.
# Add EWC regularization on top of replay for additional stability.
# Use separate output heads per task if task identity is known.
# Monitor backward transfer in production: after every model update, evaluate on held-out samples from all past tasks.
# Implement a rollback mechanism if backward transfer exceeds threshold.

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Continual Learning]]
</div>