Editing
Adversarial Machine Learning
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == The fundamental discovery (Szegedy et al., 2014) that shocked the ML community: deep neural networks, which achieve superhuman accuracy on image classification, are trivially fooled by adding small, structured noise imperceptible to humans. This revealed that neural networks are not learning the same features as humans β they rely on statistical patterns that are completely invisible to human perception. **Why do adversarial examples exist?** Neural networks make decisions in high-dimensional spaces. In these spaces, the decision boundary can be very close to natural data points β a tiny step in the "wrong" direction (determined by the gradient) crosses the boundary. Humans are not sensitive to the same features that define neural network boundaries. **FGSM**: The simplest attack. Given loss L(model(x), y), perturb: x_adv = x + Ρ·sign(β_x L). Just one gradient step, in the direction that most increases loss. Cheap to compute, surprisingly effective. **PGD**: Iterative multi-step FGSM with projection back to the Ξ΅-ball after each step. Much stronger than FGSM. Madry et al. (2018) proposed PGD adversarial training as a defense: train with worst-case PGD examples, producing much more robust models at some cost to clean accuracy. **Beyond image attacks**: NLP adversarial attacks swap words for synonyms, change character-level features (invisible Unicode), or exploit LLM instruction following. Physical-world attacks print adversarial patterns on real objects (adversarial patches, adversarial glasses for face recognition bypass, adversarial stop signs). Backdoor attacks plant triggers in training data β a model learns to associate a trigger pattern (a specific watermark) with a target class. **The robustness-accuracy tradeoff**: Adversarially robust models consistently perform slightly worse on clean data. This is a fundamental tension that has not yet been eliminated. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information