adversarial examples and adversarial...

43
Adversarial Examples and Adversarial Training Ian Goodfellow, StaResearch Scientist, Google Brain CS 231n, Stanford University, 2017-05-30

Upload: vokhanh

Post on 10-Jan-2019

240 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

Adversarial Examples and Adversarial Training

Ian Goodfellow, Staff Research Scientist, Google Brain CS 231n, Stanford University, 2017-05-30

Page 2: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Overview• What are adversarial examples?

• Why do they happen?

• How can they be used to compromise machine learning systems?

• What are the defenses?

• How to use adversarial examples to improve machine learning, even when there is no adversary

Page 3: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

...solving CAPTCHAS and reading addresses...

...recognizing objects and faces….

(Szegedy et al, 2014)

(Goodfellow et al, 2013)

(Taigmen et al, 2013)

(Goodfellow et al, 2013)

and other tasks...

Since 2013, deep neural networks have matched human performance at...

Page 4: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial Examples

Timeline: “Adversarial Classification” Dalvi et al 2004: fool spam filter “Evasion Attacks Against Machine Learning at Test Time” Biggio 2013: fool neural nets Szegedy et al 2013: fool ImageNet classifiers imperceptibly Goodfellow et al 2014: cheap, closed form attack

Page 5: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Turning Objects into “Airplanes”

Page 6: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Attacking a Linear Model

Page 7: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Not just for neural nets• Linear models

• Logistic regression

• Softmax regression

• SVMs

• Decision trees

• Nearest neighbors

Page 8: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial Examples from Overfitting

x

x

x

OO

Ox O

Page 9: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial Examples from Excessive Linearity

xx

x

O O

O

O

O

x

Page 10: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Modern deep nets are very piecewise linear

Rectified linear unit

Carefully tuned sigmoid

Maxout

LSTM

Google Proprietary

Modern deep nets are very (piecewise) linear

Rectified linear unit

Carefully tuned sigmoid

Maxout

LSTM

Page 11: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Nearly Linear Responses in Practice

Arg

umen

t to

sof

tmax

Page 12: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Small inter-class distancesClean example

Perturbation Corrupted example

All three perturbations have L2 norm 3.96This is actually small. We typically use 7!

Perturbation changes the true class

Random perturbation does not change the class

Perturbation changes the input to “rubbish class”

Page 13: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

The Fast Gradient Sign Method

Page 14: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Maps of Adversarial and Random Cross-Sections

(collaboration with David Warde-Farley and Nicolas Papernot)

Page 15: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Maps of Adversarial Cross-Sections

Page 16: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Maps of Random Cross-SectionsAdversarial examples

are not noise

(collaboration with David Warde-Farley and Nicolas Papernot)

Page 17: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Estimating the Subspace Dimensionality

(Tramèr et al, 2017)

Page 18: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Clever Hans(“Clever Hans,

Clever Algorithms,” Bob Sturm)

Page 19: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Wrong almost everywhere

Page 20: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial Examples for RL

(Huang et al., 2017)

Page 21: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

High-Dimensional Linear Models

Weights

Signs of weights

Clean examples Adversarial

Page 22: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Linear Models of ImageNet

(Andrej Karpathy, “Breaking Linear Classifiers on ImageNet”)

Page 23: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

RBFs behave more intuitively

Page 24: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Cross-model, cross-dataset generalization

Page 25: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Cross-technique transferability

(Papernot 2016)

Page 26: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Train your own model

Transferability AttackTarget model with unknown weights, machine learning

algorithm, training set; maybe non-differentiable

Substitute model mimicking target

model with known, differentiable function

Adversarial examples

Adversarial crafting against substitute

Deploy adversarial examples against the target; transferability

property results in them succeeding

Page 27: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Cross-Training Data Transferability

Strong Weak Intermediate

(Papernot 2016)

Page 28: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Enhancing Transfer With Ensembles

(Liu et al, 2016)

Page 29: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial Examples in the Human Brain

(Pinna and Gregory, 2002)

These are concentric

circles, not

intertwined spirals.

Page 30: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Practical Attacks

• Fool real classifiers trained by remotely hosted API (MetaMind, Amazon, Google)

• Fool malware detector networks

• Display adversarial examples in the physical world and fool machine learning systems that perceive them through a camera

Page 31: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial Examples in the Physical World

(Kurakin et al, 2016)

Page 32: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Failed defenses

Weight decay

Adding noise at test time

Adding noise at train time

Dropout

Ensembles

Multiple glimpses

Generative pretraining Removing perturbation

with an autoencoder

Error correcting codes

Confidence-reducing perturbation at test time

Various non-linear units

Double backprop

Page 33: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Generative Modeling is not Sufficient to Solve the Problem

Page 34: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Universal approximator theorem

Neural nets can represent either function:

Maximum likelihood doesn’t cause them to learn the right function. But we can fix that...

Google Proprietary

Universal approximator theorem

Neural nets can represent either function:

Maximum likelihood doesn’t cause them to learn the right function. But we can fix that...

Page 35: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Training on Adversarial Examples

0 50 100 150 200 250 300

Training time (epochs)

10�2

10�1

100

Tes

tm

iscl

assi

fica

tion

rate Train=Clean, Test=Clean

Train=Clean, Test=Adv

Train=Adv, Test=Clean

Train=Adv, Test=Adv

Page 36: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial Training of other Models

• Linear models: SVM / linear regression cannot learn a step function, so adversarial training is less useful, very similar to weight decay

• k-NN: adversarial training is prone to overfitting.

• Takeway: neural nets can actually become more secure than other models. Adversarially trained neural nets have the best empirical success rate on adversarial examples of any machine learning model.

Page 37: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Weaknesses Persist

Page 38: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Adversarial TrainingLabeled as bird

Decrease probability of bird class

Still has same label (bird)

Page 39: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Virtual Adversarial TrainingUnlabeled; model

guesses it’s probably a bird, maybe a plane

Adversarial perturbation intended to

change the guess

New guess should match old guess

(probably bird, maybe plane)

Page 40: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Text Classification with VATRCV1 Misclassification Rate

6.00

6.50

7.00

7.50

8.00

Earlier SOTA SOTA Our baseline Adversarial Virtual Adversarial

Both Both + bidirectional model

6.68

6.977.05

7.12

7.40

7.20

7.70

Zoomed in for legibility

Page 41: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Universal engineering machine (model-based optimization)

Training data Extrapolation

Make new inventions by finding input that maximizes model’s predicted performance

Page 42: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

Conclusion• Attacking is easy

• Defending is difficult

• Adversarial training provides regularization and semi-supervised learning

• The out-of-domain input problem is a bottleneck for model-based optimization generally

Page 43: Adversarial Examples and Adversarial Trainingcs231n.stanford.edu/slides/2017/cs231n_2017_lecture16.pdfAdversarial Examples and Adversarial Training Ian Goodfellow, Staff Research

(Goodfellow 2016)

cleverhans

Open-source library available at: https://github.com/openai/cleverhans

Built on top of TensorFlow (Theano support anticipated) Standard implementation of attacks, for adversarial training and reproducible benchmarks