Adversarial Ml: Difference between revisions
BloomWiki: Adversarial Ml |
BloomWiki: Adversarial Ml |
||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Adversarial machine learning is the study of attacks against AI systems and defenses against those attacks. Just as traditional software has security vulnerabilities, machine learning models are vulnerable to adversarial attacks: carefully crafted inputs that cause a model to fail, often in ways invisible to humans. A stop sign with a few printed stickers can cause a self-driving car's vision system to see it as a speed limit sign. A "jailbreak" prompt can override an LLM's safety training. An imperceptible perturbation to an audio waveform can cause a speech recognizer to transcribe entirely different words. Adversarial ML is both an offensive research domain (finding vulnerabilities) and a defensive one (building robust AI). | Adversarial machine learning is the study of attacks against AI systems and defenses against those attacks. Just as traditional software has security vulnerabilities, machine learning models are vulnerable to adversarial attacks: carefully crafted inputs that cause a model to fail, often in ways invisible to humans. A stop sign with a few printed stickers can cause a self-driving car's vision system to see it as a speed limit sign. A "jailbreak" prompt can override an LLM's safety training. An imperceptible perturbation to an audio waveform can cause a speech recognizer to transcribe entirely different words. Adversarial ML is both an offensive research domain (finding vulnerabilities) and a defensive one (building robust AI). | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Adversarial example''' — An input deliberately modified to cause a model to produce an incorrect output, often with the modification imperceptible to humans. | * '''Adversarial example''' — An input deliberately modified to cause a model to produce an incorrect output, often with the modification imperceptible to humans. | ||
* '''Perturbation''' — The modification added to a clean input to create an adversarial example; typically constrained to be small. | * '''Perturbation''' — The modification added to a clean input to create an adversarial example; typically constrained to be small. | ||
| Line 18: | Line 23: | ||
* '''Data poisoning''' — Corrupting training data to cause specific model failures at test time. | * '''Data poisoning''' — Corrupting training data to cause specific model failures at test time. | ||
* '''Certified robustness''' — A formal guarantee that a model's prediction will not change within a specified perturbation radius. | * '''Certified robustness''' — A formal guarantee that a model's prediction will not change within a specified perturbation radius. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
The fundamental discovery (Szegedy et al., 2014) that shocked the ML community: deep neural networks, which achieve superhuman accuracy on image classification, are trivially fooled by adding small, structured noise imperceptible to humans. This revealed that neural networks are not learning the same features as humans — they rely on statistical patterns that are completely invisible to human perception. | The fundamental discovery (Szegedy et al., 2014) that shocked the ML community: deep neural networks, which achieve superhuman accuracy on image classification, are trivially fooled by adding small, structured noise imperceptible to humans. This revealed that neural networks are not learning the same features as humans — they rely on statistical patterns that are completely invisible to human perception. | ||
| Line 31: | Line 38: | ||
'''The robustness-accuracy tradeoff''': Adversarially robust models consistently perform slightly worse on clean data. This is a fundamental tension that has not yet been eliminated. | '''The robustness-accuracy tradeoff''': Adversarially robust models consistently perform slightly worse on clean data. This is a fundamental tension that has not yet been eliminated. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''FGSM and PGD attacks with foolbox:''' | '''FGSM and PGD attacks with foolbox:''' | ||
<syntaxhighlight lang="python"> | <syntaxhighlight lang="python"> | ||
| Line 73: | Line 82: | ||
: '''Physical world''' → Adversarial patches (printable), adversarial glasses, adversarial T-shirts | : '''Physical world''' → Adversarial patches (printable), adversarial glasses, adversarial T-shirts | ||
: '''Training time''' → Backdoor attacks, data poisoning, model poisoning | : '''Training time''' → Backdoor attacks, data poisoning, model poisoning | ||
</div> | |||
== Analyzing == | <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Analyzing</span> == | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ Defense Method Effectiveness | |+ Defense Method Effectiveness | ||
| Line 91: | Line 102: | ||
'''Failure modes''': Gradient masking — defenses that hide gradients (e.g., preprocessing) fail against gradient-free black-box attacks. Adaptive attacks — researchers always find ways around new defenses by adapting the attack to the specific defense. Security through obscurity — keeping model weights secret helps only until the attacker can query the model and train a surrogate. False sense of security from defenses not evaluated against strong adaptive attacks. | '''Failure modes''': Gradient masking — defenses that hide gradients (e.g., preprocessing) fail against gradient-free black-box attacks. Adaptive attacks — researchers always find ways around new defenses by adapting the attack to the specific defense. Security through obscurity — keeping model weights secret helps only until the attacker can query the model and train a surrogate. False sense of security from defenses not evaluated against strong adaptive attacks. | ||
</div> | |||
== Evaluating == | <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Evaluating</span> == | |||
Robust evaluation requires: | Robust evaluation requires: | ||
# '''AutoAttack''': the standard benchmark — an ensemble of diverse attacks including APGD-CE, APGD-T, FAB, Square. Report AutoAttack robust accuracy. | # '''AutoAttack''': the standard benchmark — an ensemble of diverse attacks including APGD-CE, APGD-T, FAB, Square. Report AutoAttack robust accuracy. | ||
| Line 98: | Line 111: | ||
# '''Adaptive attack''': design an attack specifically for your defense before claiming robustness. | # '''Adaptive attack''': design an attack specifically for your defense before claiming robustness. | ||
# '''Certified accuracy''': for safety-critical systems, measure the fraction of test examples for which the model has a certified robustness guarantee. Expert practitioners follow "the 10 commandments of evaluating defenses" — never claim robustness without strong adaptive attacks. | # '''Certified accuracy''': for safety-critical systems, measure the fraction of test examples for which the model has a certified robustness guarantee. Expert practitioners follow "the 10 commandments of evaluating defenses" — never claim robustness without strong adaptive attacks. | ||
</div> | |||
== Creating == | <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Creating</span> == | |||
Designing a robustness program: | Designing a robustness program: | ||
# Threat model: define ε (perturbation budget), attack knowledge (white/black-box), attack goal (targeted/untargeted). | # Threat model: define ε (perturbation budget), attack knowledge (white/black-box), attack goal (targeted/untargeted). | ||
| Line 111: | Line 126: | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
[[Category:AI Safety]] | [[Category:AI Safety]] | ||
</div> | |||
Latest revision as of 01:46, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Adversarial machine learning is the study of attacks against AI systems and defenses against those attacks. Just as traditional software has security vulnerabilities, machine learning models are vulnerable to adversarial attacks: carefully crafted inputs that cause a model to fail, often in ways invisible to humans. A stop sign with a few printed stickers can cause a self-driving car's vision system to see it as a speed limit sign. A "jailbreak" prompt can override an LLM's safety training. An imperceptible perturbation to an audio waveform can cause a speech recognizer to transcribe entirely different words. Adversarial ML is both an offensive research domain (finding vulnerabilities) and a defensive one (building robust AI).
Remembering[edit]
- Adversarial example — An input deliberately modified to cause a model to produce an incorrect output, often with the modification imperceptible to humans.
- Perturbation — The modification added to a clean input to create an adversarial example; typically constrained to be small.
- L∞ perturbation — Limits the maximum change to any single pixel/feature; the most common adversarial constraint.
- L2 perturbation — Limits the total Euclidean distance between the original and perturbed input.
- White-box attack — An attack with full knowledge of the model architecture, weights, and gradients.
- Black-box attack — An attack without model access; attacker only observes inputs and outputs.
- Targeted attack — An adversarial attack crafted to make the model produce a specific wrong output.
- Untargeted attack — An attack that only needs to make the model produce any wrong output.
- FGSM (Fast Gradient Sign Method) — A simple one-step adversarial attack using the sign of the gradient to perturb inputs.
- PGD (Projected Gradient Descent) — A stronger iterative multi-step adversarial attack; the gold standard for evaluating robustness.
- Adversarial training — The most effective defense: include adversarial examples in training data.
- Transferability — Adversarial examples often transfer between models trained on the same data, enabling black-box attacks.
- Backdoor attack (Trojan) — Poisoning training data with a trigger pattern that causes misbehavior only when the trigger is present.
- Data poisoning — Corrupting training data to cause specific model failures at test time.
- Certified robustness — A formal guarantee that a model's prediction will not change within a specified perturbation radius.
Understanding[edit]
The fundamental discovery (Szegedy et al., 2014) that shocked the ML community: deep neural networks, which achieve superhuman accuracy on image classification, are trivially fooled by adding small, structured noise imperceptible to humans. This revealed that neural networks are not learning the same features as humans — they rely on statistical patterns that are completely invisible to human perception.
Why do adversarial examples exist? Neural networks make decisions in high-dimensional spaces. In these spaces, the decision boundary can be very close to natural data points — a tiny step in the "wrong" direction (determined by the gradient) crosses the boundary. Humans are not sensitive to the same features that define neural network boundaries.
FGSM: The simplest attack. Given loss L(model(x), y), perturb: xadv = x + ε·sign(∇x L). Just one gradient step, in the direction that most increases loss. Cheap to compute, surprisingly effective.
PGD: Iterative multi-step FGSM with projection back to the ε-ball after each step. Much stronger than FGSM. Madry et al. (2018) proposed PGD adversarial training as a defense: train with worst-case PGD examples, producing much more robust models at some cost to clean accuracy.
Beyond image attacks: NLP adversarial attacks swap words for synonyms, change character-level features (invisible Unicode), or exploit LLM instruction following. Physical-world attacks print adversarial patterns on real objects (adversarial patches, adversarial glasses for face recognition bypass, adversarial stop signs). Backdoor attacks plant triggers in training data — a model learns to associate a trigger pattern (a specific watermark) with a target class.
The robustness-accuracy tradeoff: Adversarially robust models consistently perform slightly worse on clean data. This is a fundamental tension that has not yet been eliminated.
Applying[edit]
FGSM and PGD attacks with foolbox: <syntaxhighlight lang="python"> import torch import foolbox as fb from torchvision.models import resnet50, ResNet50_Weights
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).eval() fmodel = fb.PyTorchModel(model, bounds=(0, 1))
- Load a clean image
images, labels = fb.utils.samples(fmodel, dataset='imagenet', batchsize=4)
- FGSM attack (L∞ perturbation ε=8/255)
attack = fb.attacks.FGSM() raw, clipped, success = attack(fmodel, images, labels, epsilons=8/255) print(f"FGSM success rate: {success.float().mean():.1%}")
- PGD attack (stronger, iterative)
pgd = fb.attacks.LinfPGD() # L∞ PGD, 40 steps raw_pgd, clipped_pgd, success_pgd = pgd(fmodel, images, labels, epsilons=8/255) print(f"PGD-40 success rate: {success_pgd.float().mean():.1%}")
- Adversarial training loop (simplified)
def adversarial_training_step(model, x, y, epsilon=8/255, alpha=2/255, n_steps=10):
x_adv = x.clone().detach().requires_grad_(True)
for _ in range(n_steps):
loss = torch.nn.CrossEntropyLoss()(model(x_adv), y)
grad = torch.autograd.grad(loss, x_adv)[0]
x_adv = x_adv + alpha * grad.sign()
x_adv = torch.clamp(x_adv, x - epsilon, x + epsilon).clamp(0, 1).detach()
x_adv.requires_grad_(True)
return x_adv.detach()
</syntaxhighlight>
- Adversarial attack taxonomy
- Image, white-box → FGSM (fast), PGD (strong), C&W (high-quality), AutoAttack (benchmark)
- Image, black-box → Square Attack (query-efficient), transfer attacks from surrogate
- NLP → TextFooler (word substitution), BERT-Attack, LLM jailbreaking
- Physical world → Adversarial patches (printable), adversarial glasses, adversarial T-shirts
- Training time → Backdoor attacks, data poisoning, model poisoning
Analyzing[edit]
| Defense | vs. White-box | vs. Black-box | Clean Accuracy Cost |
|---|---|---|---|
| No defense | 0% | Low | None |
| Input preprocessing (JPEG, smoothing) | Low (bypassed) | Moderate | Low |
| Adversarial training (PGD) | High | High | ~3-5% |
| Certified defenses (randomized smoothing) | Certified guarantee | High | High (10%+) |
| Ensemble methods | Moderate | Moderate | Medium |
Failure modes: Gradient masking — defenses that hide gradients (e.g., preprocessing) fail against gradient-free black-box attacks. Adaptive attacks — researchers always find ways around new defenses by adapting the attack to the specific defense. Security through obscurity — keeping model weights secret helps only until the attacker can query the model and train a surrogate. False sense of security from defenses not evaluated against strong adaptive attacks.
Evaluating[edit]
Robust evaluation requires:
- AutoAttack: the standard benchmark — an ensemble of diverse attacks including APGD-CE, APGD-T, FAB, Square. Report AutoAttack robust accuracy.
- White-box + black-box: evaluate both.
- Adaptive attack: design an attack specifically for your defense before claiming robustness.
- Certified accuracy: for safety-critical systems, measure the fraction of test examples for which the model has a certified robustness guarantee. Expert practitioners follow "the 10 commandments of evaluating defenses" — never claim robustness without strong adaptive attacks.
Creating[edit]
Designing a robustness program:
- Threat model: define ε (perturbation budget), attack knowledge (white/black-box), attack goal (targeted/untargeted).
- Baseline: measure clean and robust accuracy on AutoAttack — establishes where you start.
- Adversarial training: use PGD-AT or TRADES with ε=8/255 for L∞; accept ~3% clean accuracy cost.
- For NLP/LLMs: red-team jailbreaks; constitutional prompting and RLHF-safety training as defenses.
- Monitor: adversarial probing in production — periodically send adversarial test inputs to detect model degradation.
- For safety-critical systems: use certified defenses (randomized smoothing) for provable guarantees.