Continual Learning - Revision history

Wordpad: BloomWiki: Continual Learning

2026-04-25T01:49:20Z

BloomWiki: Continual Learning

← Older revision		Revision as of 01:49, 25 April 2026
Line 1:		Line 1:
			<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
	{{BloomIntro}}		{{BloomIntro}}
	Continual learning, also called lifelong learning or incremental learning, is the ability of a machine learning model to learn new tasks or data sequentially over time without forgetting previously acquired knowledge. This mirrors human cognition — we accumulate skills over a lifetime without having to re-learn everything from scratch. The central challenge is catastrophic forgetting: when a neural network is trained on new data, gradients overwrite the weights that encoded previous knowledge. Continual learning is essential for AI systems deployed in dynamic, evolving real-world environments.		Continual learning, also called lifelong learning or incremental learning, is the ability of a machine learning model to learn new tasks or data sequentially over time without forgetting previously acquired knowledge. This mirrors human cognition — we accumulate skills over a lifetime without having to re-learn everything from scratch. The central challenge is catastrophic forgetting: when a neural network is trained on new data, gradients overwrite the weights that encoded previous knowledge. Continual learning is essential for AI systems deployed in dynamic, evolving real-world environments.
			</div>

	== Remembering ==		__TOC__

			<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Remembering</span> ==
	* '''Continual learning''' — Training a model on a sequence of tasks or data streams over time without forgetting prior knowledge.		* '''Continual learning''' — Training a model on a sequence of tasks or data streams over time without forgetting prior knowledge.
	* '''Catastrophic forgetting''' — The tendency of neural networks to abruptly lose previously learned knowledge when trained on new data.		* '''Catastrophic forgetting''' — The tendency of neural networks to abruptly lose previously learned knowledge when trained on new data.
Line 17:		Line 22:
	* '''PackNet''' — Prunes and packs model weights for multiple tasks into the same fixed-size network.		* '''PackNet''' — Prunes and packs model weights for multiple tasks into the same fixed-size network.
	* '''Fisher information matrix''' — Used in EWC to measure parameter importance to previous tasks.		* '''Fisher information matrix''' — Used in EWC to measure parameter importance to previous tasks.
			</div>

	== Understanding ==		<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Understanding</span> ==
	The core problem: neural network weights encode knowledge through their specific values. When you train on Task B, gradient descent moves weights toward the minimum loss for Task B — often moving them away from the minimum for Task A. This is catastrophic forgetting: the gradient update for Task B destroys Task A's solution.		The core problem: neural network weights encode knowledge through their specific values. When you train on Task B, gradient descent moves weights toward the minimum loss for Task B — often moving them away from the minimum for Task A. This is catastrophic forgetting: the gradient update for Task B destroys Task A's solution.

Line 30:		Line 37:

	'''Architecture-based''': Allocate different model capacity to different tasks — freeze old weights, expand the model for new tasks (Progressive Neural Networks), or use dynamic sparse masks per task (PackNet, HAT).		'''Architecture-based''': Allocate different model capacity to different tasks — freeze old weights, expand the model for new tasks (Progressive Neural Networks), or use dynamic sparse masks per task (PackNet, HAT).
			</div>

	== Applying ==		<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Applying</span> ==
	'''EWC continual learning implementation:'''		'''EWC continual learning implementation:'''
	<syntaxhighlight lang="python">		<syntaxhighlight lang="python">
Line 76:		Line 85:
	: '''Separate task heads needed''' → Progressive Neural Networks		: '''Separate task heads needed''' → Progressive Neural Networks
	: '''Production NLP''' → Domain-adaptive pre-training on streaming data with KL replay		: '''Production NLP''' → Domain-adaptive pre-training on streaming data with KL replay
			</div>

	== Analyzing ==		<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Analyzing</span> ==
	{\| class="wikitable"		{\| class="wikitable"
	\|+ Continual Learning Approach Comparison		\|+ Continual Learning Approach Comparison
Line 94:		Line 105:

	'''Failure modes''': EWC fails when tasks are very different (Fisher approximation breaks down). Replay buffers don't scale to many tasks (coreset becomes unrepresentative). Progressive networks grow without bound. Class-incremental learning without task identity remains essentially unsolved at high accuracy.		'''Failure modes''': EWC fails when tasks are very different (Fisher approximation breaks down). Replay buffers don't scale to many tasks (coreset becomes unrepresentative). Progressive networks grow without bound. Class-incremental learning without task identity remains essentially unsolved at high accuracy.
			</div>

	== Evaluating ==		<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Evaluating</span> ==
	Evaluate on standard benchmarks: Split-MNIST, Split-CIFAR-100, Permuted-MNIST. Metrics:		Evaluate on standard benchmarks: Split-MNIST, Split-CIFAR-100, Permuted-MNIST. Metrics:
	# Average accuracy after all tasks trained.		# Average accuracy after all tasks trained.
Line 101:		Line 114:
	# Forward transfer — does learning past tasks accelerate learning of new tasks?		# Forward transfer — does learning past tasks accelerate learning of new tasks?
	# Memory efficiency — coreset size vs. performance trade-off. Expert practitioners always include a "joint training" upper bound (train on all tasks simultaneously) and a "fine-tuning" lower bound (no forgetting prevention).		# Memory efficiency — coreset size vs. performance trade-off. Expert practitioners always include a "joint training" upper bound (train on all tasks simultaneously) and a "fine-tuning" lower bound (no forgetting prevention).
			</div>

	== Creating ==		<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Creating</span> ==
	Designing a production continual learning system:		Designing a production continual learning system:
	# Identify the task sequence and whether task identity is available at inference.		# Identify the task sequence and whether task identity is available at inference.
Line 114:		Line 129:
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Continual Learning]]		[[Category:Continual Learning]]
			</div>

Wordpad: BloomWiki: Continual Learning

2026-04-23T14:35:26Z

BloomWiki: Continual Learning

← Older revision		Revision as of 14:35, 23 April 2026
Line 96:		Line 96:

	== Evaluating ==		== Evaluating ==
	Evaluate on standard benchmarks: Split-MNIST, Split-CIFAR-100, Permuted-MNIST. Metrics: ~~(1)~~ Average accuracy after all tasks trained. ~~(2)~~ Backward transfer — how much does learning new tasks affect old task accuracy (negative = forgetting). ~~(3)~~ Forward transfer — does learning past tasks accelerate learning of new tasks? ~~(4)~~ Memory efficiency — coreset size vs. performance trade-off. Expert practitioners always include a "joint training" upper bound (train on all tasks simultaneously) and a "fine-tuning" lower bound (no forgetting prevention).		Evaluate on standard benchmarks: Split-MNIST, Split-CIFAR-100, Permuted-MNIST. Metrics:
			# Average accuracy after all tasks trained.
			# Backward transfer — how much does learning new tasks affect old task accuracy (negative = forgetting).
			# Forward transfer — does learning past tasks accelerate learning of new tasks?
			# Memory efficiency — coreset size vs. performance trade-off. Expert practitioners always include a "joint training" upper bound (train on all tasks simultaneously) and a "fine-tuning" lower bound (no forgetting prevention).

	== Creating ==		== Creating ==
	Designing a production continual learning system: ~~(1)~~ Identify the task sequence and whether task identity is available at inference. ~~(2)~~ Choose replay if memory allows: maintain a diverse coreset using herding or iCaRL selection. ~~(3)~~ Add EWC regularization on top of replay for additional stability. ~~(4)~~ Use separate output heads per task if task identity is known. ~~(5)~~ Monitor backward transfer in production: after every model update, evaluate on held-out samples from all past tasks. ~~(6)~~ Implement a rollback mechanism if backward transfer exceeds threshold.		Designing a production continual learning system:
			# Identify the task sequence and whether task identity is available at inference.
			# Choose replay if memory allows: maintain a diverse coreset using herding or iCaRL selection.
			# Add EWC regularization on top of replay for additional stability.
			# Use separate output heads per task if task identity is known.
			# Monitor backward transfer in production: after every model update, evaluate on held-out samples from all past tasks.
			# Implement a rollback mechanism if backward transfer exceeds threshold.

	[[Category:Artificial Intelligence]]		[[Category:Artificial Intelligence]]
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Continual Learning]]		[[Category:Continual Learning]]

Wordpad: BloomWiki: Continual Learning

2026-04-23T14:20:00Z

BloomWiki: Continual Learning

← Older revision		Revision as of 14:20, 23 April 2026
Line 21:		Line 21:
	The core problem: neural network weights encode knowledge through their specific values. When you train on Task B, gradient descent moves weights toward the minimum loss for Task B — often moving them away from the minimum for Task A. This is catastrophic forgetting: the gradient update for Task B destroys Task A's solution.		The core problem: neural network weights encode knowledge through their specific values. When you train on Task B, gradient descent moves weights toward the minimum loss for Task B — often moving them away from the minimum for Task A. This is catastrophic forgetting: the gradient update for Task B destroys Task A's solution.

	The stability-plasticity dilemma: A model that never forgets (high stability) must also never change weights (low plasticity) — so it can't learn new things. A model that learns quickly (high plasticity) overwrites old knowledge. Managing this trade-off is the central challenge.		'''The stability-plasticity dilemma''': A model that never forgets (high stability) must also never change weights (low plasticity) — so it can't learn new things. A model that learns quickly (high plasticity) overwrites old knowledge. Managing this trade-off is the central challenge.

	Three families of solutions:		'''Three families of solutions''':

	Regularization-based: Add a penalty term to the loss that discourages large changes to parameters important for previous tasks. EWC computes the Fisher information matrix (how important each parameter is to Task A's performance) and uses it to weight the penalty: L = ~~L_new~~ + λ ~~Σ_i F_i~~ (~~θ_i~~ - θ*_i)². Quadratic Penalties penalize based on simple L2 distance from previous parameters.		'''Regularization-based''': Add a penalty term to the loss that discourages large changes to parameters important for previous tasks. EWC computes the Fisher information matrix (how important each parameter is to Task A's performance) and uses it to weight the penalty: L = L''new + λ Σ''i F''i (θ''i - θ*_i)². Quadratic Penalties penalize based on simple L2 distance from previous parameters.

	Memory-based (Replay): Keep a small buffer of past examples (coreset) and interleave them with new task data. This directly prevents forgetting by ensuring gradients for old tasks continue to appear. Gradient Episodic Memory (GEM) ensures new task gradients don't increase loss on stored examples.		'''Memory-based (Replay)''': Keep a small buffer of past examples (coreset) and interleave them with new task data. This directly prevents forgetting by ensuring gradients for old tasks continue to appear. Gradient Episodic Memory (GEM) ensures new task gradients don't increase loss on stored examples.

	Architecture-based: Allocate different model capacity to different tasks — freeze old weights, expand the model for new tasks (Progressive Neural Networks), or use dynamic sparse masks per task (PackNet, HAT).		'''Architecture-based''': Allocate different model capacity to different tasks — freeze old weights, expand the model for new tasks (Progressive Neural Networks), or use dynamic sparse masks per task (PackNet, HAT).

	== Applying ==		== Applying ==

Wordpad: New BloomWiki article: Continual Learning

2026-04-23T06:46:08Z

New BloomWiki article: Continual Learning

New page

{{BloomIntro}}
Continual learning, also called lifelong learning or incremental learning, is the ability of a machine learning model to learn new tasks or data sequentially over time without forgetting previously acquired knowledge. This mirrors human cognition — we accumulate skills over a lifetime without having to re-learn everything from scratch. The central challenge is catastrophic forgetting: when a neural network is trained on new data, gradients overwrite the weights that encoded previous knowledge. Continual learning is essential for AI systems deployed in dynamic, evolving real-world environments.

== Remembering ==
* '''Continual learning''' — Training a model on a sequence of tasks or data streams over time without forgetting prior knowledge.
* '''Catastrophic forgetting''' — The tendency of neural networks to abruptly lose previously learned knowledge when trained on new data.
* '''Task-incremental learning''' — The model learns a sequence of distinct tasks, with task identity known at inference time.
* '''Class-incremental learning''' — The model incrementally learns new classes; task identity is not given at test time (harder).
* '''Domain-incremental learning''' — Same task but data distribution changes over time (e.g., new image styles).
* '''Plasticity''' — The model's ability to learn new information quickly.
* '''Stability''' — The model's ability to retain previously learned information.
* '''Stability-plasticity dilemma''' — The fundamental trade-off: high plasticity enables fast learning but causes forgetting; high stability prevents forgetting but blocks new learning.
* '''Elastic Weight Consolidation (EWC)''' — A regularization approach that penalizes changes to parameters important for previous tasks.
* '''Progressive Neural Networks''' — Freeze previous task columns and add new lateral connections for new tasks; no forgetting but grows with each task.
* '''Experience replay''' — Storing a small buffer of past examples and mixing them into training on new tasks.
* '''Dark Experience Replay (DER)''' — Stores soft targets (logits) from past predictions, not just input-output pairs.
* '''PackNet''' — Prunes and packs model weights for multiple tasks into the same fixed-size network.
* '''Fisher information matrix''' — Used in EWC to measure parameter importance to previous tasks.

== Understanding ==
The core problem: neural network weights encode knowledge through their specific values. When you train on Task B, gradient descent moves weights toward the minimum loss for Task B — often moving them away from the minimum for Task A. This is catastrophic forgetting: the gradient update for Task B destroys Task A's solution.

**The stability-plasticity dilemma**: A model that never forgets (high stability) must also never change weights (low plasticity) — so it can't learn new things. A model that learns quickly (high plasticity) overwrites old knowledge. Managing this trade-off is the central challenge.

**Three families of solutions**:

**Regularization-based**: Add a penalty term to the loss that discourages large changes to parameters important for previous tasks. EWC computes the Fisher information matrix (how important each parameter is to Task A's performance) and uses it to weight the penalty: L = L_new + λ Σ_i F_i (θ_i - θ*_i)². Quadratic Penalties penalize based on simple L2 distance from previous parameters.

**Memory-based (Replay)**: Keep a small buffer of past examples (coreset) and interleave them with new task data. This directly prevents forgetting by ensuring gradients for old tasks continue to appear. Gradient Episodic Memory (GEM) ensures new task gradients don't increase loss on stored examples.

**Architecture-based**: Allocate different model capacity to different tasks — freeze old weights, expand the model for new tasks (Progressive Neural Networks), or use dynamic sparse masks per task (PackNet, HAT).

== Applying ==
'''EWC continual learning implementation:'''
<syntaxhighlight lang="python">
import torch
import torch.nn as nn
from copy import deepcopy

class EWC:
def __init__(self, model, dataloader, device='cuda', importance=1000):
self.model = model
self.importance = importance
self.prev_params = {n: p.clone().detach() for n, p in model.named_parameters()}
self.fisher = self._compute_fisher(dataloader, device)

def _compute_fisher(self, dataloader, device):
"""Estimate diagonal Fisher information matrix."""
fisher = {n: torch.zeros_like(p) for n, p in self.model.named_parameters()}
self.model.eval()
for x, y in dataloader:
x, y = x.to(device), y.to(device)
self.model.zero_grad()
output = self.model(x)
loss = nn.CrossEntropyLoss()(output, y)
loss.backward()
for n, p in self.model.named_parameters():
if p.grad is not None:
fisher[n] += p.grad.pow(2) / len(dataloader)
return fisher

def penalty(self):
"""EWC regularization term."""
loss = 0.0
for n, p in self.model.named_parameters():
loss += (self.fisher[n] * (p - self.prev_params[n]).pow(2)).sum()
return self.importance * loss

# Training loop with EWC
# total_loss = task_loss + ewc.penalty()
</syntaxhighlight>

; Continual learning approach selection
: '''Memory-constrained, simple''' → EWC or SI (Synaptic Intelligence) regularization
: '''Memory available''' → Experience replay (keep 200-500 examples per past task)
: '''Separate task heads needed''' → Progressive Neural Networks
: '''Production NLP''' → Domain-adaptive pre-training on streaming data with KL replay

== Analyzing ==
{| class="wikitable"
|+ Continual Learning Approach Comparison
! Approach !! Memory Overhead !! Forgetting Level !! Scalability
|-
| Fine-tuning (no CL) || None || Catastrophic || High (but useless)
|-
| EWC regularization || Small (Fisher matrix) || Moderate || Good
|-
| Experience replay || Coreset size || Low || Good
|-
| Progressive networks || Grows with tasks || Zero || Poor (unbounded growth)
|-
| PackNet || None (fixed size) || Zero || Moderate (limited capacity)
|}

'''Failure modes''': EWC fails when tasks are very different (Fisher approximation breaks down). Replay buffers don't scale to many tasks (coreset becomes unrepresentative). Progressive networks grow without bound. Class-incremental learning without task identity remains essentially unsolved at high accuracy.

== Evaluating ==
Evaluate on standard benchmarks: Split-MNIST, Split-CIFAR-100, Permuted-MNIST. Metrics: (1) Average accuracy after all tasks trained. (2) Backward transfer — how much does learning new tasks affect old task accuracy (negative = forgetting). (3) Forward transfer — does learning past tasks accelerate learning of new tasks? (4) Memory efficiency — coreset size vs. performance trade-off. Expert practitioners always include a "joint training" upper bound (train on all tasks simultaneously) and a "fine-tuning" lower bound (no forgetting prevention).

== Creating ==
Designing a production continual learning system: (1) Identify the task sequence and whether task identity is available at inference. (2) Choose replay if memory allows: maintain a diverse coreset using herding or iCaRL selection. (3) Add EWC regularization on top of replay for additional stability. (4) Use separate output heads per task if task identity is known. (5) Monitor backward transfer in production: after every model update, evaluate on held-out samples from all past tasks. (6) Implement a rollback mechanism if backward transfer exceeds threshold.

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:Continual Learning]]