Editing Adversarial Ml

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Adversarial machine learning is the study of attacks against AI systems and defenses against those attacks. Just as traditional software has security vulnerabilities, machine learning models are vulnerable to adversarial attacks: carefully crafted inputs that cause a model to fail, often in ways invisible to humans. A stop sign with a few printed stickers can cause a self-driving car's vision system to see it as a speed limit sign. A "jailbreak" prompt can override an LLM's safety training. An imperceptible perturbation to an audio waveform can cause a speech recognizer to transcribe entirely different words. Adversarial ML is both an offensive research domain (finding vulnerabilities) and a defensive one (building robust AI).
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Adversarial example''' — An input deliberately modified to cause a model to produce an incorrect output, often with the modification imperceptible to humans.
* '''Perturbation''' — The modification added to a clean input to create an adversarial example; typically constrained to be small.
* '''L∞ perturbation''' — Limits the maximum change to any single pixel/feature; the most common adversarial constraint.
* '''L2 perturbation''' — Limits the total Euclidean distance between the original and perturbed input.
* '''White-box attack''' — An attack with full knowledge of the model architecture, weights, and gradients.
* '''Black-box attack''' — An attack without model access; attacker only observes inputs and outputs.
* '''Targeted attack''' — An adversarial attack crafted to make the model produce a specific wrong output.
* '''Untargeted attack''' — An attack that only needs to make the model produce any wrong output.
* '''FGSM (Fast Gradient Sign Method)''' — A simple one-step adversarial attack using the sign of the gradient to perturb inputs.
* '''PGD (Projected Gradient Descent)''' — A stronger iterative multi-step adversarial attack; the gold standard for evaluating robustness.
* '''Adversarial training''' — The most effective defense: include adversarial examples in training data.
* '''Transferability''' — Adversarial examples often transfer between models trained on the same data, enabling black-box attacks.
* '''Backdoor attack (Trojan)''' — Poisoning training data with a trigger pattern that causes misbehavior only when the trigger is present.
* '''Data poisoning''' — Corrupting training data to cause specific model failures at test time.
* '''Certified robustness''' — A formal guarantee that a model's prediction will not change within a specified perturbation radius.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The fundamental discovery (Szegedy et al., 2014) that shocked the ML community: deep neural networks, which achieve superhuman accuracy on image classification, are trivially fooled by adding small, structured noise imperceptible to humans. This revealed that neural networks are not learning the same features as humans — they rely on statistical patterns that are completely invisible to human perception.

'''Why do adversarial examples exist?''' Neural networks make decisions in high-dimensional spaces. In these spaces, the decision boundary can be very close to natural data points — a tiny step in the "wrong" direction (determined by the gradient) crosses the boundary. Humans are not sensitive to the same features that define neural network boundaries.

'''FGSM''': The simplest attack. Given loss L(model(x), y), perturb: x''adv = x + ε·sign(∇''x L). Just one gradient step, in the direction that most increases loss. Cheap to compute, surprisingly effective.

'''PGD''': Iterative multi-step FGSM with projection back to the ε-ball after each step. Much stronger than FGSM. Madry et al. (2018) proposed PGD adversarial training as a defense: train with worst-case PGD examples, producing much more robust models at some cost to clean accuracy.

'''Beyond image attacks''': NLP adversarial attacks swap words for synonyms, change character-level features (invisible Unicode), or exploit LLM instruction following. Physical-world attacks print adversarial patterns on real objects (adversarial patches, adversarial glasses for face recognition bypass, adversarial stop signs). Backdoor attacks plant triggers in training data — a model learns to associate a trigger pattern (a specific watermark) with a target class.

'''The robustness-accuracy tradeoff''': Adversarially robust models consistently perform slightly worse on clean data. This is a fundamental tension that has not yet been eliminated.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''FGSM and PGD attacks with foolbox:'''
<syntaxhighlight lang="python">
import torch
import foolbox as fb
from torchvision.models import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).eval()
fmodel = fb.PyTorchModel(model, bounds=(0, 1))

# Load a clean image
images, labels = fb.utils.samples(fmodel, dataset='imagenet', batchsize=4)

# FGSM attack (L∞ perturbation ε=8/255)
attack = fb.attacks.FGSM()
raw, clipped, success = attack(fmodel, images, labels, epsilons=8/255)
print(f"FGSM success rate: {success.float().mean():.1%}")

# PGD attack (stronger, iterative)
pgd = fb.attacks.LinfPGD()  # L∞ PGD, 40 steps
raw_pgd, clipped_pgd, success_pgd = pgd(fmodel, images, labels, epsilons=8/255)
print(f"PGD-40 success rate: {success_pgd.float().mean():.1%}")

# Adversarial training loop (simplified)
def adversarial_training_step(model, x, y, epsilon=8/255, alpha=2/255, n_steps=10):
    x_adv = x.clone().detach().requires_grad_(True)
    for _ in range(n_steps):
        loss = torch.nn.CrossEntropyLoss()(model(x_adv), y)
        grad = torch.autograd.grad(loss, x_adv)[0]
        x_adv = x_adv + alpha * grad.sign()
        x_adv = torch.clamp(x_adv, x - epsilon, x + epsilon).clamp(0, 1).detach()
        x_adv.requires_grad_(True)
    return x_adv.detach()
</syntaxhighlight>

; Adversarial attack taxonomy
: '''Image, white-box''' → FGSM (fast), PGD (strong), C&W (high-quality), AutoAttack (benchmark)
: '''Image, black-box''' → Square Attack (query-efficient), transfer attacks from surrogate
: '''NLP''' → TextFooler (word substitution), BERT-Attack, LLM jailbreaking
: '''Physical world''' → Adversarial patches (printable), adversarial glasses, adversarial T-shirts
: '''Training time''' → Backdoor attacks, data poisoning, model poisoning
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Defense Method Effectiveness
! Defense !! vs. White-box !! vs. Black-box !! Clean Accuracy Cost
|-
| No defense || 0% || Low || None
|-
| Input preprocessing (JPEG, smoothing) || Low (bypassed) || Moderate || Low
|-
| Adversarial training (PGD) || High || High || ~3-5%
|-
| Certified defenses (randomized smoothing) || Certified guarantee || High || High (10%+)
|-
| Ensemble methods || Moderate || Moderate || Medium
|}

'''Failure modes''': Gradient masking — defenses that hide gradients (e.g., preprocessing) fail against gradient-free black-box attacks. Adaptive attacks — researchers always find ways around new defenses by adapting the attack to the specific defense. Security through obscurity — keeping model weights secret helps only until the attacker can query the model and train a surrogate. False sense of security from defenses not evaluated against strong adaptive attacks.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Robust evaluation requires:
# '''AutoAttack''': the standard benchmark — an ensemble of diverse attacks including APGD-CE, APGD-T, FAB, Square. Report AutoAttack robust accuracy.
# '''White-box + black-box''': evaluate both.
# '''Adaptive attack''': design an attack specifically for your defense before claiming robustness.
# '''Certified accuracy''': for safety-critical systems, measure the fraction of test examples for which the model has a certified robustness guarantee. Expert practitioners follow "the 10 commandments of evaluating defenses" — never claim robustness without strong adaptive attacks.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a robustness program:
# Threat model: define ε (perturbation budget), attack knowledge (white/black-box), attack goal (targeted/untargeted).
# Baseline: measure clean and robust accuracy on AutoAttack — establishes where you start.
# Adversarial training: use PGD-AT or TRADES with ε=8/255 for L∞; accept ~3% clean accuracy cost.
# For NLP/LLMs: red-team jailbreaks; constitutional prompting and RLHF-safety training as defenses.
# Monitor: adversarial probing in production — periodically send adversarial test inputs to detect model degradation.
# For safety-critical systems: use certified defenses (randomized smoothing) for provable guarantees.

[[Category:Artificial Intelligence]]
[[Category:Machine Learning]]
[[Category:AI Safety]]
</div>