adversarial examples are not easily detected: bypassing ...adversarial examples are not easily...

12
Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of California, Berkeley ABSTRACT Neural networks are known to be vulnerable to adversarial examples: inputs that are close to natural inputs but classi- ๏ฌed incorrectly. In order to better understand the space of adversarial examples, we survey ten recent proposals that are designed for detection and compare their e๏ฌƒcacy. We show that all can be defeated by constructing new loss func- tions. We conclude that adversarial examples are signi๏ฌcantly harder to detect than previously appreciated, and the prop- erties believed to be intrinsic to adversarial examples are in fact not. Finally, we propose several simple guidelines for evaluating future proposed defenses. 1 INTRODUCTION Recent years have seen rapid growth in the area of machine learning. Neural networks, an idea that dates back decades, have been a driving force behind this rapid advancement. Their successes have been demonstrated in a wide set of domains, from classifying images [38], to beating humans at Go [35], to NLP [32, 40], to self driving cars [6]. In this paper, we study neural networks applied to image classi๏ฌcation. While neural networks are the most accurate machine learning approach known to date, they are against an adversary who attempts to fool the classi๏ฌer [5]. That is, given a natural image , an adversary can easily produce a visually similar image โ€ฒ that has a di๏ฌ€erent classi๏ฌcation. Such an instance โ€ฒ is known as an adversarial example [39], and they have been shown to exist in nearly all domains that neural networks are used. The research community has reacted to this observation in force, proposing many defenses that attempt to classify adversarial examples correctly [3, 16, 20, 21, 31, 33, 34, 41]. Unfortunately, most of these defenses are not e๏ฌ€ective at classifying adversarial examples correctly. Due to this di๏ฌƒculty, recent work has turned to attempting to detect them instead. We study ten detection schemes proposed in seven papers over the last year [4, 11, 12, 15, 18, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro๏ฌt or commercial advantage and that copies bear this notice and the full citation on the ๏ฌrst page. Copyrights for components of this work owned by others than the au- thor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci๏ฌc permission and/or a fee. Request permissions from [email protected]. AISecโ€™17, November 3, 2017, Dallas, TX, USA ยฉ 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery. ACM ISBN 978-1-4503-5202-4/17/11. . . $15.00 https://doi.org/10.1145/3128572.3140444 19, 24], and compare their e๏ฌƒcacy with the other defenses in a consistent manner. With new attacks, we show that in every case the defense can be evaded by an adversary who targets that speci๏ฌc defense. On simple datasets, the attacks slightly increase the distortion required, but on more complex datasets, adversarial examples remain completely indistinguishable from the original images. By studying these recent schemes that detect adversarial examples, we challenge the assumption that adversarial ex- amples have intrinsic di๏ฌ€erences from natural images. We also use these experiments to obtain a better understanding of the space of adversarial examples. We evaluate these defenses under three threat models. We ๏ฌrst consider a generic attacks that donโ€™t take any speci๏ฌc measures to fool any particular detector. We show six of the ten defenses are signi๏ฌcantly less e๏ฌ€ective than believed under this threat model. Second, we introduce novel white- box attacks that break each defense when tailored to the given defense; ๏ฌve of the defenses provide no increase in robustness; three increase robustness only slightly; the ๏ฌnal two increase e๏ฌ€ective only on simple datasets. Our attacks work by de๏ฌning a special attacker-loss function that captures the requirement that the adversarial examples must fool the defense, and optimizing for this loss function. We discover that the speci๏ฌc loss function chosen is critical to e๏ฌ€ectively defeating the defense: choosing the immediately obvious loss function often results in the defense appearing signi๏ฌcantly more robust than it actually is. Finally, we leverage the transferability [39] property to work even when the adversary does not have knowledge of the defenseโ€™s model parameters. Our results further suggest that there is a need for better ways to evaluate potential defenses. We believe our approach would be a useful baseline: to be worth considering, a pro- posed defense should follow the approach used here as a ๏ฌrst step towards arguing robustness. The code to reproduce our results is available online at http://nicholas.carlini.com/code/nn breaking detection. We make the following contributions: โˆ™ We ๏ฌnd that many defenses are unable to detect ad- versarial examples, even when the attacker is oblivi- ous to the speci๏ฌc defense used. โˆ™ We break all existing detection methods in the white- box (and black-box) setting by showing how to pick good attacker-loss functions for each defense. โˆ™ We draw conclusions about the space of adversarial examples, and o๏ฌ€er a note of caution about evalu- ating solely on MNIST; it appears that MNIST has somewhat di๏ฌ€erent security properties than CIFAR. โˆ™ We provide recommendations for evaluating defenses.

Upload: others

Post on 10-Mar-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

Adversarial Examples Are Not Easily Detected:Bypassing Ten Detection Methods

Nicholas Carlini David WagnerUniversity of California, Berkeley

ABSTRACT

Neural networks are known to be vulnerable to adversarialexamples: inputs that are close to natural inputs but classi-fied incorrectly. In order to better understand the space ofadversarial examples, we survey ten recent proposals thatare designed for detection and compare their efficacy. Weshow that all can be defeated by constructing new loss func-tions. We conclude that adversarial examples are significantlyharder to detect than previously appreciated, and the prop-erties believed to be intrinsic to adversarial examples are infact not. Finally, we propose several simple guidelines forevaluating future proposed defenses.

1 INTRODUCTION

Recent years have seen rapid growth in the area of machinelearning. Neural networks, an idea that dates back decades,have been a driving force behind this rapid advancement.Their successes have been demonstrated in a wide set ofdomains, from classifying images [38], to beating humans atGo [35], to NLP [32, 40], to self driving cars [6].

In this paper, we study neural networks applied to imageclassification. While neural networks are the most accuratemachine learning approach known to date, they are againstan adversary who attempts to fool the classifier [5]. That is,given a natural image ๐‘ฅ, an adversary can easily produce avisually similar image ๐‘ฅโ€ฒ that has a different classification.Such an instance ๐‘ฅโ€ฒ is known as an adversarial example [39],and they have been shown to exist in nearly all domains thatneural networks are used.

The research community has reacted to this observationin force, proposing many defenses that attempt to classifyadversarial examples correctly [3, 16, 20, 21, 31, 33, 34, 41].Unfortunately, most of these defenses are not effective atclassifying adversarial examples correctly.

Due to this difficulty, recent work has turned to attemptingto detect them instead. We study ten detection schemesproposed in seven papers over the last year [4, 11, 12, 15, 18,

Permission to make digital or hard copies of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than the au-thor(s) must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. Request permissionsfrom [email protected].

AISecโ€™17, November 3, 2017, Dallas, TX, USA

ยฉ 2017 Copyright held by the owner/author(s). Publication rightslicensed to Association for Computing Machinery.ACM ISBN 978-1-4503-5202-4/17/11. . . $15.00https://doi.org/10.1145/3128572.3140444

19, 24], and compare their efficacy with the other defensesin a consistent manner. With new attacks, we show thatin every case the defense can be evaded by an adversarywho targets that specific defense. On simple datasets, theattacks slightly increase the distortion required, but on morecomplex datasets, adversarial examples remain completelyindistinguishable from the original images.

By studying these recent schemes that detect adversarialexamples, we challenge the assumption that adversarial ex-amples have intrinsic differences from natural images. Wealso use these experiments to obtain a better understandingof the space of adversarial examples.

We evaluate these defenses under three threat models. Wefirst consider a generic attacks that donโ€™t take any specificmeasures to fool any particular detector. We show six ofthe ten defenses are significantly less effective than believedunder this threat model. Second, we introduce novel white-box attacks that break each defense when tailored to thegiven defense; five of the defenses provide no increase inrobustness; three increase robustness only slightly; the finaltwo increase effective only on simple datasets. Our attackswork by defining a special attacker-loss function that capturesthe requirement that the adversarial examples must fool thedefense, and optimizing for this loss function. We discoverthat the specific loss function chosen is critical to effectivelydefeating the defense: choosing the immediately obvious lossfunction often results in the defense appearing significantlymore robust than it actually is. Finally, we leverage thetransferability [39] property to work even when the adversarydoes not have knowledge of the defenseโ€™s model parameters.

Our results further suggest that there is a need for betterways to evaluate potential defenses. We believe our approachwould be a useful baseline: to be worth considering, a pro-posed defense should follow the approach used here as a firststep towards arguing robustness.

The code to reproduce our results is available online athttp://nicholas.carlini.com/code/nn breaking detection.

We make the following contributions:

โˆ™ We find that many defenses are unable to detect ad-versarial examples, even when the attacker is oblivi-ous to the specific defense used.

โˆ™ We break all existing detection methods in the white-box (and black-box) setting by showing how to pickgood attacker-loss functions for each defense.

โˆ™ We draw conclusions about the space of adversarialexamples, and offer a note of caution about evalu-ating solely on MNIST; it appears that MNIST hassomewhat different security properties than CIFAR.

โˆ™ We provide recommendations for evaluating defenses.

Page 2: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

2 BACKGROUND

The remainder of this section contains a brief survey of thefield of neural networks and adversarial machine learning.We encourage readers unfamiliar with this area to read thefollowing papers (in this order): [39], [13], [29], and [8].

2.1 Notation

Let ๐น (ยท) denote a neural network used for classification. Thefinal layer in this network is a softmax activation, so that theoutput is a probability distribution where ๐น (๐‘ฅ)๐‘– representsthe probability that object ๐‘ฅ is labeled with class ๐‘–.

All neural networks we study are feed-forward networksconsisting of multiple layers ๐น ๐‘– taking as input the result ofprevious layers. The outputs of the final layer are known aslogits; we represent them by ๐‘(ยท). Some layers involve thenon-linear ReLU [27] activation. Thus the ๐‘–th layer computes

๐น ๐‘–(๐‘ฅ) = ReLU(๐ด๐‘– ยท ๐น ๐‘–โˆ’1(๐‘ฅ) + ๐‘๐‘–)

where ๐ด๐‘– is a matrix and ๐‘๐‘– is a vector. Let ๐‘(๐‘ฅ) denote theoutput of the last layer (before the softmax), i.e., ๐‘(๐‘ฅ) =๐น๐‘›(๐‘ฅ). Then the final output of the network is

๐น (๐‘ฅ) = softmax(๐‘(๐‘ฅ)).

When we write ๐ถ(๐‘ฅ) we mean the classification of ๐น (ยท) on ๐‘ฅ:

๐ถ(๐‘ฅ) = arg max๐‘–(๐น (๐‘ฅ)๐‘–).

Along with the neural network, we are given a set oftraining instances with their corresponding labels (๐‘ฅ, ๐‘ฆ) โˆˆ ๐’ณ .

2.2 Adversarial Examples

The security of machine learning is a well studied field: earlywork considered this problem mostly on linear classifiers[9, 25]; later work more generally examined the security ofmachine learning [1, 2] to both evasion and poising attacks.

More recently, Biggio et al. and Szegedy et al. [5, 39]demonstrated test-time evasion attacks on neural networks.They were able to produce visually similar images that haddifferent labels assigned by the classifier.

We begin by defining an input to the classifier ๐น (ยท) naturalif it is an instance that was benignly created (e.g., all instancesin the training set and testing set are natural instances).Then, given a network ๐น (ยท) and a natural input ๐‘ฅ so that๐ถ(๐‘ฅ) = ๐‘™ we say that ๐‘ฅโ€ฒ is an (untargeted) adversarial exampleif ๐‘ฅโ€ฒ is close to ๐‘ฅ and ๐ถ(๐‘ฅโ€ฒ) = ๐‘™. A more restrictive caseis where the adversary picks a target ๐‘ก = ๐‘™ and seeks tofind ๐‘ฅโ€ฒ close to ๐‘ฅ such that ๐ถ(๐‘ฅโ€ฒ) = ๐‘ก; in this case we call๐‘ฅโ€ฒ a targeted adversarial example. We focus on targetedadversarial examples exclusively in this paper. When we saya neural network is robust we mean that it is difficult to findadversarial examples on it.

To define closeness, most attacks use an ๐ฟ๐‘ distance, de-

fined as โ€–๐‘‘โ€–๐‘ = (โˆ‘๐‘›

๐‘–=0 |๐‘ฃ๐‘–|๐‘)

1๐‘ . Common choices of ๐‘ include:

๐ฟ0, a measure of the number of pixels changed [30]; ๐ฟ2, thestandard Euclidean norm [8, 26, 39]; or ๐ฟโˆž, a measure ofthe maximum absolute change to any pixel [13]. If the totaldistortion under any of these three distance metrics is small,

the images will likely appear visually similar. We quantita-tively measure the robustness of a defense in this paper bymeasuring the distance to the nearest adversarial exampleunder the ๐ฟ2 metric.

One further property of adversarial examples we will makeuse of is the transferability property [13, 39]. It is oftenthe case that, when given two models ๐น (ยท) and ๐บ(ยท), anadversarial example on ๐น will also be an adversarial exampleon ๐บ, even if they are trained in completely different manners,on completely different training sets.

There has been a significant amount of work studyingmethods to construct adversarial examples [5, 8, 13, 26, 30,39] and to make networks robust against adversarial examples[3, 16, 20, 21, 31, 33, 34, 41]. To date, no defenses has beenable to classify adversarial examples correctly.

Given this difficulty in correctly classifying adversarialexamples, recent defenses have instead turned to detectingadversarial examples and reject them. We study these de-fenses in this paper [4, 11, 12, 15, 18, 19, 24].

2.3 Threat Model

As done in Biggio et al. [5], we consider three different threatmodels in this paper:

(1) An Zero-Knowledge Adversary generates adversar-ial examples on the unsecured model ๐น and is notaware that the detector ๐ท is in place. The detector issuccessful if it can detect these adversarial examples.

(2) A Perfect-Knowledge Adversary is aware the neu-ral network is being secured with a given detectionscheme ๐ท, knows the model parameters used by ๐ท,and can use these to attempt to evade both theoriginal network ๐น and the detector simultaneously.

(3) A Limited-Knowledge Adversary is aware the neu-ral network is being secured with a given detectionscheme, knows how it was trained, but does nothave access to the trained detector ๐ท (or the exacttraining data).

We evaluate each defense under these three threat models.We discuss our evaluation technique in Section 2.7.

2.4 Datasets

In this paper we consider two datasets used throughout theexisting work in this field.

TheMNIST dataset [23] consists of 70, 000 28ร—28 greyscaleimages of handwritten digits from 0 to 9. Our standard con-volutional network achieves 99.4% accuracy on this dataset.

The CIFAR-10 dataset [22] consists of 60, 000 32ร—32 colorimages of ten different objects (e.g., truck, airplane, etc). Thisdataset is substantially more difficult: the state of the artapproaches achieve 95% accuracy [36]. For comparison withprior work, we use the ResNet [17] architecture from Metzenet al. [18] trained in the same manner. This model achievesa 91.5% accuracy.

The first row of Figure 1 shows natural examples drawnfrom the test set of these datasets.

Page 3: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

2.5 Defenses

In order to better understand what properties are intrinsic ofadversarial examples and what properties are only artificiallytrue because of existing attack techniques, we choose thefirst seven papers released that construct defenses to detectadversarial examples.

Three of the defenses [12, 15, 18] use a second neuralnetwork to classify images as natural or adversarial. Threeuse PCA to detect statistical properties of the images ornetwork paramaters [4, 19, 24]. Two perform other statisticaltests [11, 15], and the final two perform input-normalizationwith randomization and blurring [11, 24].

We summarize our results in Figure 1. Some defensescan slightly increase distortion required for MNIST digits.However, no defense makes CIFAR adversarial examplesvisually distinguishable from the original image. We generateadversarial examples as described below.

2.6 Generating Adversarial Examples

We use the ๐ฟ2 attack algorithm of Carlini and Wagner [8]to generate targeted adversarial examples, as it is superiorto other published attacks. At a high level it is an iterativeattack as done in the initial work on constructing adversarialexamples [5, 38]. Given a neural network ๐น with logits ๐‘,the attack uses gradient descent to solve

minimize โ€–๐‘ฅโ€ฒ โˆ’ ๐‘ฅโ€–22 + ๐‘ ยท โ„“(๐‘ฅโ€ฒ)

where the loss function โ„“ is defined as

โ„“(๐‘ฅโ€ฒ) = max(max{๐‘(๐‘ฅโ€ฒ)๐‘– : ๐‘– = ๐‘ก} โˆ’ ๐‘(๐‘ฅโ€ฒ)๐‘ก,โˆ’๐œ…).

We now give some intuition behind this loss function. Thedifference max{๐‘(๐‘ฅโ€ฒ)๐‘– : ๐‘– = ๐‘ก}โˆ’๐‘(๐‘ฅโ€ฒ)๐‘ก is used to compare thetarget class ๐‘ก with the next-most-likely class. However, thisis minimized when the target class is significantly more likelythan the second most likely class, which is not a property wewant. This is fixed by taking the maximum of this quantitywith โˆ’๐œ…, which controls the confidence of the adversarialexamples. When ๐œ… = 0, the adversarial examples are calledlow-confidence adversarial examples and are only just classi-fied as the target class. As ๐œ… increases, the model classifiesthe adversarial example as increasingly more likely, we callthese high-confidence adversarial examples.

The constant ๐‘ is chosen via binary search. If ๐‘ is too small,the distance function dominates and the optimal solution willnot have a different label. If ๐‘ is too large, the objective termdominates and the adversarial example will not be nearby.

Of critical importance is that the loss function operatesover the logits ๐‘, and not the probabilities ๐น . As describedin [8], the optimal choice of the constant ๐‘ โˆผ 1

|โˆ‡โ„“| ; therefore,

if ๐น were used instead of ๐‘, no โ€œgoodโ€ constant ๐‘ would existsince ๐‘“ varies by several orders of magnitude (and ๐‘ usuallyonly by one). When constructing attacks in later sections,we often choose new loss functions โ„“ that also do not vary intheir magnitude.

Aside from C&Wโ€™s attack, the Fast Gradient Sign attack[13] and JSMA [30] are two attacks used by some defenses

for evaluation. These attacks are weaker than C&Wโ€™s attackand we do not use them for evaluation [8].

2.7 Attack Approach

In order to evaluate the robustness of each of the abovedefenses, we take three approaches to target each of the threethreat models introduced earlier.

Evaluate with a strong attack (Zero-Knowledge): In thisstep we generate adversarial examples with C&Wโ€™s attackand check whether the defense can detect this strong attack.This evaluation approach has the weakest threat model (theattacker is not even aware the defense is in place), so anydefense should trivially be able to detect this attack. Failingthis test implies that the second two tests will also fail.

Perform an adaptive, white-box attack (Perfect-Knowledge):The most powerful threat model, we assume here the adver-sary has access to the detector and can mount an adaptiveattack. To perform this attack, we construct a new loss func-tion, and generate adversarial examples that both fool theclassifier and also evade the detector.

The most difficult step in this attack is to construct a lossfunction that can be used to generate adversarial examples.In some cases, such a loss function might not be readilyavailable. In other cases, one may exist, but it may not bewell-suited to performing gradient descent over. It is of criticalimportance to choose a good loss function, and we describehow to construct such a loss function for each attack.

Construct a black-box attack (Limited-Knowledge): Thisattack is the most difficult for the adversary. We assume theadversary knows what type of defense is in place but doesnot know the detectorโ€™s paramaters. This evaluation is onlyinteresting if (a) the zero-knowledge attack failed to generateadversarial examples, and (b) the perfect-knowledge attacksucceeded. If the strong attack alone succeeded, when theadversary was not aware of the defense, they could mountthe same attack in this black-box case. Conversely, if thewhite-box attack failed, then a black-box attack will also fail(since the threat model is strictly harder).

In order to mount this attack, we rely on the transferabilityproperty: the attacker trains a substitute model in the sameway as the original model, but on a separate training set (ofsimilar size, and quality). The attacker can access substitutemodelโ€™s parameters, and performs a white-box attack onthe substitute model. Finally, we evaluate whether theseadversarial examples transfer to the original model.

When the classifier and detector are separate models, weassume the adversary has access to the classifier but not thedetector (we are analyzing the increase in security by usingthe detector).

If the detector and classifier are not separable (i.e., theclassifier is trained to also act as a detector), then to performa fair evaluation, we compare the adversarial examples gen-erated with black-box access to the (unsecured) classifier toadversarial examples generated with only black-box accessto both the classifier and detector.

Page 4: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

Reference

Unsecured

Grosseยง3.1Gongยง3.1Metzenยง3.2Hendrycksยง4.1Bhagojiยง4.2Li

ยง4.3Grosse

ยง5.1Feinman

ยง5.2Feinman

ยง6.1Li

ยง6.2

Figure 1: Summary of Results: adversarial examples on the MNIST and CIFAR datasets for each defense westudy. The first row corresponds to the original images.

3 SECONDARY CLASSIFICATIONBASED DETECTION

We now turn to evaluating the ten defenses. The first cate-gory of detection schemes we study build a second classifierwhich attempts to detect adversarial examples. Three of theapproaches take this direction.

For the remainder of this subsection, define ๐น (ยท) to be theclassification network and ๐ท(ยท) to be the detection network.๐น (ยท) is defined as in Section 2.1 outputting a probabilitydistribution over the 10 classes, and ๐ท : R๐‘คยทโ„Žยท๐‘ โ†’ (โˆ’โˆž,โˆž)represent the logits of the likelihood the instance is adversar-ial. That is, sigmoid(๐ท(๐‘ฅ)) : R๐‘คยทโ„Žยท๐‘ โ†’ [0, 1] represents theprobability the instance is adversarial.

3.1 Adversarial Retraining

Grosse et al. [15] propose a variant on adversarial re-training.Instead of attempting to classify the adversarial examplescorrectly (by adding adversarial examples to the training set,with their correct labels), they introduce a new ๐‘ +1st classโ€” solely for adversarial examples โ€” and train the network todetect adversarial examples. Specifically, they propose thefollowing procedure:

(1) Train a model ๐น๐‘๐‘Ž๐‘ ๐‘’ on the training data ๐’ณ0 = ๐’ณ .(2) Generate adversarial examples on model ๐นbase for

each (๐‘ฅ๐‘–, ๐‘ฆ๐‘–) โˆˆ ๐‘‹. Call these examples ๐‘ฅโ€ฒ๐‘–.

(3) Let ๐’ณ1 = ๐’ณ0 โˆช {(๐‘ฅโ€ฒ๐‘–, ๐‘ + 1) : ๐‘– โˆˆ |๐’ณ |} where ๐‘ + 1

is the new label for adversarial examples.(4) Train a model ๐นsecured on the training data ๐’ณ1.

Gong et al. [12] construct a very similar defense technique.Instead of re-training the model ๐นsecured completely, theyconstruct a binary classifier ๐ท that simply learns to partitionsthe instances ๐‘ฅ from ๐‘ฅโ€ฒ by training on the training set

๐’ณ1 = {(๐‘ฅ๐‘–, 1) : ๐‘– โˆˆ |๐’ณ |} โˆช {(๐‘ฅโ€ฒ๐‘–, 0) : ๐‘– โˆˆ |๐’ณ |}.

We re-implement these two defenses and find that adver-sarial retraining is able to detect adversarial examples whengenerated with the fast gradient sign and JSMA attacks withnear-100% accuracy.

Zero-Knowledge Attack Evaluation. We train these twoschemes on the entire MNIST training set, using C&Wโ€™sattack to generate adversarial examples. In this way weconstruct a model ๐นsecured and ๐ท.

We then construct adversarial examples for ๐นbase from eachimage in the test set using C&Wโ€™s attack. Both approachesdetect these previously unseen test adversarial examples.Grosse et al. detects 98.5% of attacks as adversarial. Further,it classifies half of the remaining 1.5% correctly. Gong et al.achieve 98% accuracy in detecting adversarial examples.

Investigating further, we find that even if we train onadversarial examples generated using an untargeted attack,both schemes can detect targeted adversarial examples.

Page 5: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

Neither of these defenses claim robustness against CIFAR,however when we perform this evaluation, we find to achievea 70% detection rate requires a 40% false positive rate. Thisalone can be seen as a failure of these approaches on CIFAR.

Perfect-Knowledge Attack Evaluation. Next, we evaluatethese defenses assuming the adversary is aware of these de-fenses and parameters. That is, we directly attack the de-fended model. Our experiments revealed that these defensesare ineffective and add almost no increase in robustness.

For Grosseโ€™s defense, we use C&Wโ€™s attack on ๐นsecured

to generate adversarial examples; it succeeds 100% of thetime. We computed the mean ๐ฟ2-distance from the originalsample to the adversarial example. Adversarial examplesagainst ๐นbase are at average ๐ฟ2 distance of 2.05 from theoriginal sample; adversarial examples against ๐นsecured havean average distance of 2.26. Thus the defense has not reducedthe success rate at generating adversarial examples, and hasonly increased the mean distortion by 10%.

Gongโ€™s defense does not fare any better. To help constructadversarial examples that will simultaneously fool ๐นbase and๐ท, we define a new function ๐บ(ยท) that represents the combi-nation of the classifier (with logits ๐‘๐น (ยท)) and detector (withlogits ๐‘๐ท(ยท)). In particular, we define

๐บ(๐‘ฅ)๐‘– =

โŽงโŽจโŽฉ๐‘๐น (๐‘ฅ)๐‘– if ๐‘– โ‰ค ๐‘

(๐‘๐ท(๐‘ฅ) + 1) ยทmax๐‘—

๐‘๐น (๐‘ฅ)๐‘— if ๐‘– = ๐‘ + 1

Effectively, ๐บ acts as a classifier on ๐‘ + 1 classes. It isconstructed to have two useful properties: if ๐‘๐ท(๐‘ฅ) > 0 (i.e.,if the detector classifies ๐‘ฅ as malicious) then we will have

arg max๐‘–(๐บ(๐‘ฅ)๐‘–) = ๐‘ + 1

(where ๐‘ is the new adversarial class), and if ๐‘๐ท(๐‘ฅ) < 0 (i.e.,if the detector classifies ๐‘ฅ as natural) then we will have

arg max๐‘–(๐บ(๐‘ฅ)๐‘–) = arg max๐‘–(๐‘๐น (๐‘ฅ)๐‘–).

Why did we choose this particular function ๐บ(ยท)? Recallfrom earlier that when using a gradient-descent based attackalgorithm, there is a constant ๐‘ in the loss function that tradesoff the relative importance of reducing the distance from thenatural instance vs. classifying the input as the target class.If the value of the loss function changes by several ordersof magnitude, no value of ๐‘ will be optimal and adversarialexamples will be very hard to find. Therefore, we construct๐บ so that it is mostly linear.

Now we treat this function ๐บ as if it were a neural network,and feed it directly into C&Wโ€™s attack algorithm insteadof the function ๐‘. It is easy to see that if an instance ๐‘ฅโ€ฒ

is classified by ๐บ as label ๐‘™ โ‰ค ๐‘ then both ๐ถ(๐‘ฅโ€ฒ) = ๐‘™ andthe detector classifies ๐‘ฅโ€ฒ as natural (non-adversarial). Thisapproach successfully generates adversarial examples againstGongโ€™s defense with 100% success. The mean ๐ฟ2 distortion ofthese adversarial examples is 2.31, an improvement of about10% over the unprotected scheme.

We therefore conclude neither of these schemes are effectiveon MNIST. While adversarial examples are detectable with

neural networks when not attempting to be evasive, it ispossible to construct evasive adversarial examples.

On CIFAR, the above attacks work in an identical mannerand give very similar results: the distortion is increased lessthan 5%.

Limited-Knowledge Attack Evaluation. For brevity, we onlydescribe the attack on Grosseโ€™s scheme (Gongโ€™s results areidentical). We generate transferable adversarial examplesas described above. Instead of training one model on thecomplete training set, we split the training set in half. Weassume the first half is known to the attacker and the secondhalf is used by the defender to train their network. Thus, weconstruct two models ๐‘…1 and ๐‘…2 each trained on half of thetraining data. We treat ๐‘…2 as the defenderโ€™s model (the targetmodel for which we want to produce adversarial examples)and give the attacker white-box access to all parameters ofmodel ๐‘…1. Then we apply the C&W attack to each instancein the test set to generate adversarial examples for ๐‘…1, andwe test whether they fool ๐‘…2.

As a baseline, we applied this procedure to the unsecuredmodel. The attack succeeded 98% of the time, with a mean๐ฟ2 distortion of 5.1. Then, we generated two secured models๐‘…1 and ๐‘…2 implementing Grosseโ€™s defense, and applied thesame attack. We observed a 98% success rate, with a mean ๐ฟ2

distortion of 5.3, which is only 4% higher than the baseline.Therefore, we conclude that Grosseโ€™s defense is ineffectiveand can be attacked even by an attacker who does not theexact model parameters of the secured model.

3.2 Examining Convolutional Layers

In contrast to the prior approach, which attempts to detectadversarial examples based on the contents of the imageitself, Metzen et al. [18] detect adversarial examples by look-ing at the inner convolutional layers of the network. Theyaugment the classification neural network with a detectionneural network that takes its input from various intermediatelayers of the classification network. This detection networkis trained identically to Gongโ€™s defense above. We refer in-terested readers to the original paper for complete details onthe detector setup [18]. This defense only argues robustnessagainst CIFAR, since it looks at the inner layers of a ResNet(which are not usually used for MNIST).

We then train the detector by attaching it to the outputof the first residual block as done in [18]. We confirmed thattheir detector is able to detect adversarial examples with 99%probability on simple attacks (fast gradient sign or JSMA).

Zero-Knowledge Attack Evaluation. We found training thedetector to be very difficult and highly sensitive; we exper-imented with different optimizers, hyper-parameters, andinitializers to reduce the effect of this.

For evaluation, we generate adversarial examples for eachinstance in the test set, using the C&W attack. The bestdetector we were able to train correctly obtains an 81%true positive rate at 28% false positive rate. This is in starkcontrast to the 99% success at detecting simpler attacks,

Page 6: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

which highlights that weak attacks give a biased view ofdefense robustness.

Similar to the prior defense (that also had a high falsepositive rate), this defense is unable to perform well evenwhen the adversary is not attempting to evade it.

Perfect-Knowledge Attack Evaluation. Our white-box at-tack completely defeats Metzenโ€™s defense: it is able to produceadversarial examples that simultaneously are mis-classifiedby the original network and evade the detector. We gener-ate adversarial examples using C&Wโ€™s attack applied to thesame function ๐บ(ยท) defined in Section 3.1. The mean distanceto adversarial examples increases from 0.169 ๐ฟ2 distortionon the unsecured model to 0.227 on the secured scheme, animprovement of 34%. However, in absolute terms, the adver-sarial examples generated are still indistinguishable from theoriginal inputs.

Limited-Knowledge Attack Evaluation. To investigate ifthis defense is robust to attacks in a black-box setting, weperform a standard transferability test as done above. We splitthe training data in half, and train two detector models, oneon each half of the training data. Then, we attack the seconddetector given only white-box access to the first detector.

On MNIST, we found that even low-confidence adversarialexamples transfer 84% of the time between the two detectorswhen the classifier network is known by the adversary. Byusing high-confidence adversarial examples, the attack successrate can be increased to 98% at the cost of increasing themean distortion by a further 28%, which is small enoughthat adversarial examples remain indistinguishable from theoriginal images.

4 PRINCIPAL COMPONENTANALYSIS DETECTION

Principal Component Analysis (PCA) transforms a set ofpoints in a ๐‘›-dimensional space to a new set of points in a๐‘˜-dimensional space (๐‘˜ โ‰ค ๐‘›) through a linear transformation.We assume the reader is familiar with PCA for the remainderof this section.

4.1 Input Image PCA

Hendrycks & Gimpel [19] use PCA to detect natural imagesfrom adversarial examples, finding that adversarial examplesplace a higher weight on the larger principal components thannatural images (and lower weight on the earlier principalcomponents).

Zero-Knowledge Attack Evaluation. We first reproducetheir results by running PCA on MNIST. To see if adversarialexamples really do use larger principal components moreoften, we compute how much each component is used. Let๐‘‹1, . . . , ๐‘‹๐‘› be the training set instances. We define the score๐‘†(๐‘—) of the ๐‘—th PCA component as

๐‘†(๐‘—) =1

๐‘

๐‘โˆ‘๐‘–=1

|๐‘ƒ๐ถ๐ด(๐‘‹๐‘–)๐‘— |.

0 200 400 600 800Component 1umber

10โˆ’18

10โˆ’14

10โˆ’10

10โˆ’6

10โˆ’2

0ean

AbV

olut

e Va

lue

(log

Vcal

e)

ValidAdverVarial

Figure 2: PCA on the MNIST dataset reveals a dif-ference between natural images and adversarial im-ages, however this is caused by an artifact of MNIST:border pixels on natural images are often 0 butslightly-positive on adversarial examples.

We train a classification network on the training set andcompute the component scores ๐‘†(1), . . . , ๐‘†(784). Then, foreach image in the test set, we find the nearest adversarialexample with C&Wโ€™s attack and we compute the componentscores on these adversarial examples. The results are plottedin Figure 2.

Our results agree with Hendrycks et. al [19]: there is nodifference on the first principal components, but there is a sub-stantial difference between natural and adversarial instanceson the later components. On the MNIST data set, their de-fense does detect zero-knowledge attacks, if the attacker doesnot attempt to defeat the defense.

Looking Deeper. At first glance, this might lead us tobelieve that PCA is a powerful and effective method fordetecting adversarial examples. However, whenever thereare large abnormalities in the data, one must be careful tounderstand their cause.

In this case, the reason for the difference is that thereare pixels on the MNIST dataset that are almost alwaysset to 0. Since the MNIST dataset is constructed by taking24x24 images and centering them (by center-of-mass) ona 28x28 grid, the majority of the pixels on the boundaryof natural images are zero. Because these border pixels areessentially always zero for natural instances, the last principalcomponents are heavily concentrated on these border pixels.This explains why the last 74 principal components (9.4% ofthe components) contribute less than 10โˆ’30 of the varianceon the training set.

In short, the detected difference between the natural andadversarial examples is because the border pixels are nearlyalways zero for natural MNIST instances, whereas typicaladversarial examples have non-zero values on the border.

Page 7: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

While adversarial examples are different from natural imageson MNIST in this way, this is not an intrinsic property ofadversarial examples; it is instead due to an artifact of theMNIST dataset. When we perform the above evaluation onCIFAR, there is no detectable difference between adversarialexamples and natural data. As a result, the Hendrycks de-fense is not effective for CIFAR โ€” it is specific to MNIST.Also, this deeper understanding of why the defense workson MNIST suggests that adaptive attacks might be able toavoid detection by simply leaving those pixels unchanged.

Perfect-Knowledge Attack Evaluation. We found that theHendrycks defense can be broken by a white-box attacker withknowledge of the defense. Details are deferred to Section 4.2,where we break a strictly stronger defense. In particular, wefound in our experiments that we can generate adversarialexamples that are restricted to change only the first ๐‘˜ princi-pal components (i.e., leave all later components unchanged),and these adversarial examples that are not detected by theHendrycks defense.

4.2 Dimensionality Reduction

Bhagoji et al. [4] propose a defense based on dimensionalityreduction: instead of training a classifier on the originaltraining data, they reduce the ๐‘Š ยท๐ป ยท ๐ถ = ๐‘ -dimensionalinput (e.g., 784 for MNIST) to a much smaller ๐พ-dimensionalinput (e.g., 20) and train a classifier on this smaller input.The classifier uses a fully-connected neural network: PCAloses spatial locality, so a convolutional network cannot beused (we therefore consider only MNIST).

This defense restricts the attacker so they can only ma-nipulate the first ๐พ components: the classifier ignores othercomponents. If adversarial examples rely on the last principalcomponents (as hypothesized), then restricting the attack toonly the first ๐พ principal components should dramaticallyincrease the required distortion to produce an adversarialexample. We test this prediction empirically.

We reimplement their algorithm with their same model (afully-connected network with two hidden layers of 100 units).We train 26 models with different values of ๐พ, ranging from9 to 784 dimensions. Models with fewer than 25 dimensionshave lower accuracy; all models with more than 25 dimensionshave 97% or higher accuracy.

Perfect-Knowledge Attack Evaluation. We evaluate Bhagojiโ€™sdefense by constructing targeted attacks against all 26 modelswe trained. We show the mean distortion for each model inFigure 3. The most difficult model to attack uses only thefirst 25 principal components; it is nearly 3ร— more robustthan the model that keeps all 784 principal components.

However, crucially, we find that even the model that keepsthe first 25 principal components is less robust than almostany standard, unsecured convolutional neural network; anunprotected network achieves both higher accuracy (99.5%accuracy) and better robustness to adversarial examples(measured by the mean distortion). In summary, Bhagojiโ€™sdefense is not secure against white-box attacks.

0 200 400 600Number of Principle Components

0.0

0.5

1.0

1.5

2.0

Mean

Dist

ance

to A

dver

saria

l Exa

mple Baseline (CNN)

PCA Model (CNN)PCA Model (FC)

Figure 3: Performing dimensionality reduction in-creases the robustness of a 100-100-10 fully-connected neural network, but is still less securethan just using an unsecured CNN (the baseline).Dimensionality reduction does not help on a networkthat is already convolutional.

Looking Deeper. Next, we show that this result is not anartifact of the network architecture โ€” it is not caused justbecause fully-connected network are less robust than convo-lutional networks. We study a second algorithm that Bhagojiet al. present but did not end up using, which combines PCAwith a convolutional neural network architecture. This allowsus to perform an experiment where the network architectureis held fixed, and the only change is whether dimensionalityreduction is used or not. In particular, instead of using thefirst ๐พ principal components as features for a fully-connectednetwork, they use PCA to map each image into the reduced-dimensionality PCA space, and then immediately map it backto the image space. This effectively projects the image intoa reduced-dimension manifold. They train a convolutionalclassifier on the projected images. This classifier achieves ahigher accuracy (99% when using at least 25 dimensions).

In our experiments we found that this approach is no morerobust than an unsecured convolutional network (applied tothe original image), despite only using a limited number of theprincipal components. We conclude that defenses based onlimiting the adversary to only the first principal componentsare not effective. It follows that Hendrycks et. al โ€™s defense isbroken as well, as the adversarial images generated in thisway change only the first ๐พ components and leave all latercomponents unchanged.

4.3 Hidden Layer PCA

Li et al. [24] apply PCA to the values after inner convolutionallayers of the neural network, and use a cascade classifierto detect adversarial examples. Specifically, they proposebuilding a cascade classifier that accepts the input as natural

Page 8: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

only if all classifiers ๐ถ๐‘– accept the input, but rejects it if anydo. Each classifier ๐ถ๐‘– a linear SVM that acts on the PCA ofthe ๐‘–th convolutional layer of the network.

They evaluate their scheme on ImageNet. In the remainderof this section we demonstrate their defense is not effectiveon MNIST and CIFAR; others have shown that attacking anImageNet classifier would be even easier [8, 26].

Zero-Knowledge Attack Evaluation. Li et al. generatedadversarial examples with Szegedy et. al โ€™s L-BFGS algo-rithm [39] and found that the first linear SVM achieved 80%true positive rate at 0% false positive rate โ€“ an ideal use-casefor a cascade classifier. We evaluated the effectiveness of theirmethod at detecting adversarial examples generated withC&Wโ€™s attack (when trained on adversarial examples gener-ated with the C&W attack). For MNIST, their scheme onlyachieves a 62% true positive rate at a 37% false positive rate(and 56% at 40% on CIFAR) on the first layer of the cascadeclassifier. This results causes the cascade classifiers to havea 92% and 98% false positive rates on MNIST/CIFAR. Weconclude that Liโ€™s defense is not effective against the C&Wattack.

5 DISTRIBUTIONAL DETECTION

Next, we study two defenses that detect adversarial exam-ples by comparing the distribution of natural images to thedistribution of adversarial examples. They use classical statis-tical methods to distinguish natural images from adversarialimages.

5.1 Maximum Mean Discrepancy

Grosse et al. [15] consider a very powerful threat model:assume we are given two sets of images ๐‘†1 and ๐‘†2, such thatwe know ๐‘†1 contains only natural images, and we know that๐‘†2 contains either all adversarial examples, or all naturalimages. They ask the question: can we determine which ofthese two situations is the case?

To achieve this, they use the Maximum Mean Discrep-ancy (MMD) test [7, 14], a statistical hypothesis test thatanswers the question โ€œare these two sets drawn from thesame underlying distribution?โ€

The MMD is a theoretically useful technique that can beformally shown to always detect a difference if one occurs.However, it is computationally infeasible to compute, so asimple polynomial-time approximation is almost always used.In our experiments, we use the same approximation used byGrosse et al. [14].

To test whether ๐‘‹1 and ๐‘‹2 are drawn from the samedistribution, Grosse et al. use Fisherโ€™s permutation test [28]with the MMD test statistic. To do this, initially let ๐‘Ž =๐‘€๐‘€๐ท(๐‘‹1, ๐‘‹2). Then, shuffle the elements of ๐‘‹1 and ๐‘‹2

into two new sets ๐‘Œ1 and ๐‘Œ2, and let ๐‘ = ๐‘€๐‘€๐ท(๐‘Œ1, ๐‘Œ2). If๐‘Ž < ๐‘ then reject the null hypothesis, and conclude they aredrawn from different distributions. Repeat this process manytimes, and repor the p-value as the fraction of times the nullhypothesis is rejected.

As a first step, we verified that MMD correctly detectsnatural instances as being natural. To do this, we assign๐‘‹1 and ๐‘‹2 to be random samples from the test set. Theprocedure above correctly fails to reject the null hypothesis(๐‘ > 0.05) in distinguishing the two distributions.

Zero-Knowledge Attack Evaluation. We repeated this ex-periment, this time producing targeted adversarial exampleswith C&Wโ€™s attack algorithm. Even when using a set of 100images, MMD fails to reject the null hypothesis (๐‘ > 0.05).MMD also fails to detect attacks on CIFAR, even with a setsize as large as 10, 000 (which uses every sample in the testset). Since MMD is one of the most powerful multidimen-sional statistical tests, and even it is not effective, we arguethat without significant modification, statistical tests will notbe able to detect adversarial examples.

5.2 Kernel Density Estimation

Feinman et al. [11] present a defense they call kernel densityestimation. They use a Gaussian Mixture Model to modeloutputs from the final hidden layer of a neural network,and argue that adversarial examples belong to a differentdistribution than that of natural images.

Specifically, given an instance ๐‘ฅ classified as label ๐‘ก, kerneldensity estimation estimates the likelihood of ๐‘ฅ as

๐พ๐ท๐ธ(๐‘ฅ) =1

|๐‘‹๐‘ก|โˆ‘๐‘ โˆˆ๐‘‹๐‘ก

exp(|๐น๐‘›โˆ’1(๐‘ฅ)โˆ’ ๐น๐‘›โˆ’1(๐‘ )|2

๐œŽ2)

where ๐‘‹๐‘ก is the set of training instances with label ๐‘ก and๐น๐‘›โˆ’1(๐‘ฅ) is the output of the final hidden layer on input ๐‘ฅ.The detector is therefore constructed by selecting a threshold๐œ and reporting ๐‘ฅ as adversarial if ๐พ๐ท๐ธ(๐‘ฅ) < ๐œ , otherwisereporting ๐‘ฅ as natural.

The motivation behind this approach is that the later hid-den layers of a neural network have been shown to capturehigh-level semantic information about the input. Therefore,using a simple classifier on this final layer will be more accu-rate than if it were applied to the original input images, asthe prior defense did.

Zero-Knowledge Attack Evaluation. Feinmanโ€™s defense isable to detect C&W adversarial examples on MNIST, butnot on CIFAR. Looking deeper, on CIFAR, for each image inthe test set ๐‘ฅ and closest adversarial example ๐‘ฅโ€ฒ, we compare๐พ๐ท๐ธ(๐‘ฅโ€ฒ) to ๐พ๐ท๐ธ(๐‘ฅ). Surprisingly, we find that 80% ofthe time, the adversarial example has a higher likelihoodscore than the original image.Therefore, Feinmanโ€™s defensecannot work on CIFAR. In the remainder of this section, weshow how to break this defense on MNIST with increaseddistortion.

Perfect-Knowledge Attack Evaluation. To mount a white-box attack, we construct a new minimization formulationthat differs from the original only in that we introduce anew loss term โ„“2(๐‘ฅ

โ€ฒ) that penalizes being detected by thedetector:

minimize โ€–๐‘ฅโˆ’ ๐‘ฅโ€ฒโ€–22 + ๐‘ ยท (โ„“(๐‘ฅโ€ฒ) + โ„“2(๐‘ฅโ€ฒ))

Page 9: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

where we define

โ„“2(๐‘ฅโ€ฒ) = max(โˆ’ log(๐พ๐ท๐ธ(๐‘ฅโ€ฒ))โˆ’ ๐œ–, 0)

where ๐œ– controls the likelihood measure of the adversarial ex-amples. In our attack, we set ๐œ– to the median ofโˆ’ log(๐พ๐ท๐ธ(ยท))on the training set, so that โ„“2(๐‘ฅ

โ€ฒ) โ‰ค 0 if and only if ๐พ๐ท๐ธ(๐‘ฅโ€ฒ)is greater than half of the training instances KDE.

In practice, we mount this attack in two phases. First, wesolve the original C&W minimization formulation to obtainan adversarial example ๏ฟฝ๏ฟฝ. Typically ๏ฟฝ๏ฟฝ will be detected by thedetector, so in the second phase we modify it to no longer bedetected: we use this ๏ฟฝ๏ฟฝ as the initial value of ๐‘ฅโ€ฒ in the aboveoptimization problem and use gradient descent to improveit. Performing this two-step optimization is useful to allowfor different constants ๐‘ chosen for initially generating anadversarial example and for making it not detected.

This approach is effective at generating adversarial exam-ples for MNIST that are not detected by Feinmanโ€™s defense.The mean distortion is 3.81, up from 2.05 on an unsecurednetwork. In order to implement this attack more efficiently,instead of computing the KDE across all training instancesduring the inner optimization loop, we randomly sample 100instances (< 5% of the total) at first and use only those.A distortion of 3.81 begins to allow humans to notice theadversarial nature of these images (Figure 1).

We believe the reason this defense is so much more effectiveon MNIST than on CIFAR because this defense is effectivelya K-nearest-neighbor classifier with a neural network featureextractor. In fact, prior work has shown that using an RBFkernel SVM (essentially identical to what is done here) iseffective at defending MNIST to adversarial examples.

Limited-Knowledge Attack Evaluation. Again we only demon-strate the vulnerability on MNIST: the defense does not workon CIFAR. Note that our white-box attack algorithm didnโ€™tuse the entire training set; it uses only a random sampleof 100 instances from the training set (less than 5% of thetraining instances). This hints that black-box attacks mightsucceed. This is indeed the case: we found that when thedefender uses the first 95% of the training data, and the last5% of the training data is left for the adversary (who uses itto form their own KDE estimator), the attack still succeedswith equal distortion.

6 NORMALIZATION DETECTION

6.1 Dropout Randomization

Feinman et al. propose a second detection method calledBayesian neural network uncertainty that measures the un-certainty of the neural network on the given input. Insteadof relying on the reported confidence of the network (whichcan easily be controlled by an adversary), they add random-ization to the network. The hope is that a natural image willhave the same (correct) label regardless of the random valueschosen, while adversarial examples wonโ€™t always be predictedwith the same label. Dropout [37] is used as the method ofadding randomness.

To determine the uncertainty of the network on a giveninstance, we apply the randomized network ๐น๐‘Ÿ (with dropoutenabled) to the input ๐ฟ times. Then, the uncertainty isdefined as

๐‘ˆ(๐‘ฅ) =

(1

๐ฟ

๐ฟโˆ‘๐‘–=1

โ€–๐น๐‘Ÿ(๐‘ฅ)โ€–

)โˆ’

1๐ฟ

๐ฟโˆ‘๐‘–=1

๐น๐‘Ÿ(๐‘ฅ)

where we take โ€–๐‘ฆโ€– to mean ๐‘ฆ๐‘‡ ๐‘ฆ (the squared ๐ฟ2 norm). Noticethat if the predictions ๐น๐‘Ÿ(๐‘ฅ) are the same every time, theuncertainty is 0. In other words, the uncertainty is the sumof the variances of each component of the output, computedover the ๐ฟ outputs from the randomized network.

To classify an input ๐‘ฅ as adversarial or natural, we pick athreshold ๐œ and a number of times to evaluate the network๐ฟ and compute ๐‘ˆ(๐‘ฅ). We have found the algorithm is notsensitive to the choice of ๐ฟ as long as it is greater than 20(and we pick ๐ฟ = 30 for the remainder of this section). Wereturn โ€œadversarialโ€ if ๐‘ˆ(๐‘ฅ) > ๐œ and โ€œnaturalโ€ otherwise.

When we run this detection scheme on the adversarialexamples generated by the FGS and JSMA attack algorithmson MNIST, it detects 96% of adversarial examples with afalse positive rate lower than 1% when ๐œ = 0.02.

Zero-Knowledge Attack Evaluation. We generated adver-sarial examples with C&Wโ€™s attack and find 75% of theseadversarial examples are detected by this defense on MNIST.Surprisingly, this defense is even more effective at detect-ing attacks on CIFAR: it detects 95% of C&W adversarialexamples.

Perfect-Knowledge Attack Evaluation. Our white-box at-tacks work by defining an objective function that combinesthe model and detector. In an initial attempt, we follow theapproach taken to break the prior detection schemes, let

๐บ(๐‘ฅ)๐‘– =

{๐‘(๐‘ฅ)๐‘– if ๐‘– โ‰ค ๐‘

(1 + ๐‘ˆ(๐‘ฅ)โˆ’ ๐œ) ยทmax๐‘–

๐‘(๐‘ฅ)๐‘– if ๐‘– = ๐‘ + 1

(where ๐‘ + 1 is the new adversarial class). This objectivefunction is defined so that if ๐‘ˆ(๐‘ฅ) > ๐œ (i.e., ๐‘ฅ is classified asadversarial by Feinmanโ€™s defense) then we will have

arg max๐‘–(๐บ(๐‘ฅ)๐‘–) = ๐‘ + 1,

and if ๐‘ˆ(๐‘ฅ) < ๐œ (i.e., ๐‘ฅ is classified as natural) then

arg max๐‘–(๐บ(๐‘ฅ)๐‘–) = arg max๐‘–(๐น (๐‘ฅ)๐‘–).

We then attempt to generated adversarial examples byapplying C&Wโ€™s attack to ๐บ(ยท). However, this attack fails:the defense is still able to detect 60% of adversarial examples.

Looking Deeper. We investigate why this attack failed.Because values of the function ๐‘ˆ(ยท) are computed by calling๐น (ยท) (and not ๐‘(ยท)), there is no good constant ๐‘ that C&Wโ€™salgorithm can select. This causes gradient descent to performpoorly and rarely returns successful adversarial examples.

Therefore, for this defense, we abandon the approach ofconstructing a single function to optimize over and instead

Page 10: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

change the loss function โ„“ used by the attack. Recall that thetypical loss function used (for a given network ๐‘(ยท)) is

โ„“๐‘(๐‘ฅโ€ฒ) = max(max{๐‘(๐‘ฅโ€ฒ)๐‘– : ๐‘– = ๐‘ก} โˆ’ ๐‘(๐‘ฅโ€ฒ)๐‘ก,โˆ’๐œ…).

If every prediction ๐น๐‘Ÿ(๐‘ฅ) of the randomized network produceda very similar result, the uncertainty value ๐‘ˆ(๐‘ฅ) would bevery small.

We sample ๐พ different deterministic networks {๐‘๐‘—(ยท) : ๐‘— โˆˆ[1,๐พ]} each with different randomness used during dropout.If we were able to have arg max๐‘–๐‘๐‘—(๐‘ฅ)๐‘– = ๐‘ก for every network๐‘—, for ๐พ big enough, it would be highly likely that ๐น๐‘Ÿ(๐‘ฅ)would always produce label ๐‘ก for any randomness. Thus, weconstruct a new loss function โ„“โ€ฒ(๐‘ฅโ€ฒ) =

โˆ‘๐พ๐‘—=1 โ„“๐‘๐‘— (๐‘ฅ

โ€ฒ) as theaverage of the loss functions on each fixed model ๐‘๐‘— . Thenwe use C&Wโ€™s attack with this revised loss function.

This approach successfully generates adversarial examplesthat fool the dropout defense with 98% success. On MNIST,the mean ๐‘™2 distortion is 3.68, up from the baseline of 2.05.This is the largest distortion required by any of the defenseswe have evaluated; the distortion here is nearing the levels ofhuman perceptibility (Figure 1). On CIFAR the distortionrequired again increases to 1.1, a factor of 5ร— larger, but isstill entirely imperceptible (Figure 1).

Limited-Knowledge Attack Evaluation. It turns out thatadversarial examples generated with the white-box approachtransfer with high probability across models. This is due tothe fact that our white-box attack assumes we do not knowthe exact randomization settings, and therefore constructadversarial examples that are effective regardless of random-ization. This is similar to the black-box threat model, wherethe adversary does not have access to the model parameters.

However, to improve the rate of transferability, we againconstruct two models ๐น (ยท) and ๐บ(ยท) on different subsets ofthe training data. We provide the adversary access to theparameters of ๐น and use the white-box attack above togenerate adversarial examples for ๐น ; we then test whetherthey fool ๐บ.

On MNIST, we find that adversarial examples generatedwith our algorithm transfer to ๐บ with 90% success. We canincrease the transfer rate to 98% at the cost of increasing themean distortion only 15%, to 4.23. While this 15% increasemay seem like a failure of this defense under the black-boxthreat model, we see this instead as a success of defenseunder the white-box threat model. It has made constructingadaptive white-box attacks nearly as hard as constructingblack-box attacks.

In fact, we find that this is the reason why the CIFARnetwork has such a larger increase in distortion: to obtaintransferable cifar adversarial examples requires a distortionof approximately 4ร— larger than non-transferable attacks.We consider this the only defense we evaluate that is not com-pletely broken (even though we still can generate adversarialexamples that are imperceptible on CIFAR).

6.2 Mean Blur

The second detection method proposed by Li et al. appliesa 3ร— 3 average filter to blur the image before applying theclassifier. The authors admit this defense is โ€œoverly simplisticโ€but still argue it is effective at alleviating adversarial exam-ples. We confirm this simple defense can remove adversarialexamples generated with fast gradient sign, as they found intheir paper.

Zero-Knowledge Attack Evaluation. When we use C&Wโ€™sattack, we find that this defense effectively removes low-confidence adversarial examples: 80% of adversarial examples(at a mean ๐ฟ2 distortion of 2.05) are no longer classifiedincorrectly.

This attack can even partially alleviate high-confidenceadversarial examples. To ensure they remain adversarial afterblurring, we must increase the distortion by a factor of 3ร—.

Perfect-Knowledge Attack Evaluation. Observe that takingthe mean over every 3ร— 3 region on the image is the sameas adding another convolutional layer to the beginning ofthe neural network with one output channel that performsthis calculation. Given the network ๐น , we define ๐น โ€ฒ(๐‘ฅ) =๐น (blur(๐‘ฅ)) and apply C&Wโ€™s attack against ๐น โ€ฒ. When we doso, we find that the mean distance to adversarial examplesdoes not increase. Therefore, blurring is not an effectivedefense.

7 LESSONS

7.1 Properties of adversarial examples

After examining these ten defenses, we now draw conclusionsabout the nature of the space of adversarial examples andthe ability to detect them with different approaches.

Randomization can increase required distortion. Byfar the most effective defense technique, dropout randomiza-tion, made generating adversarial examples nearly five timesmore difficult on CIFAR. In particular, it makes generatingadversarial examples on the network as difficult as generat-ing transferable adversarial examples, a task known to beharder [29]. Additionally, if it were possible to find a wayto eliminate transferability, a randomization-based defensemay be able to detect adversarial examples. At this time, webelieve this is the most promising direction of future work.

MNIST properties may not hold on CIFAR. Mostdefenses that increased the distortion on MNIST had a sig-nificantly lower distortion increase on CIFAR. In particu-lar, kernel density estimation, the most effective defense onMNIST, was completely ineffective on CIFAR.

Detection neural networks can be bypassed. Acrossall of the defenses we evaluate, the least effective schemesused another neural network (or more neural network layers)to attempt to identify adversarial examples. Given that ad-versarial examples can fool a single classifier, it makes sensethat adversarial examples can fool a classifier and detector.

Page 11: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

Operating on raw pixel values is ineffective. Defensesthat operated directly on the pixel values were too simpleto succeed. On MNIST, these defenses provided reasonablerobustness against weak attacks; however when evaluatingon stronger attacks, these defenses all failed. This should notbe surprising: the reason neural networks are used is thatthey are able to extract deep and meaningful features fromthe input data. A simple linear detector is not effective atclassification when operating on raw pixel values, so it shouldnot be surprising it does not work at detecting adversarialexamples. (This can be seen especially well on CIFAR, whereeven weak attacks often succeed against defenses that operateon the input pixel space.)

7.2 Recommendations for Defenses

We have several recommendations for how researchers propos-ing new defenses can better evaluate their proposals. Many ofthese recommendations may appear to be obvious, howevermost of the papers we evaluate do not follow any.

Evaluate using a strong attack. Evaluate proposed de-fenses using the strongest attacks known. Do not use fastgradient sign or JSMA exclusively : most defenses that detectthese attacks fail against stronger attacks. In particular, Fastgradient sign was not even designed to produce high-qualityattacks: it was created to demonstrate neural networks arehighly linear. Using these algorithms as a first test is rea-sonable first step, but is not sufficient. We recommend newschemes evaluate against strong iterative attacks.

Demonstrate white-box attacks fail. It is not sufficientto show that a defense can detect adversarial examples: onemust also show that a adversary aware of the defense cannot generate attacks that evade detection. We show how toperform that kind of evaluation: construct a differentiablefunction that is minimized when the image fools the classifierand is treated as natural by the detector, and apply a strongiterative attack (e.g., C&Wโ€™s attack) to this function.

Report false positive and true positive rates. Whenconstructing a detection-based defense, it is not enough toreport the accuracy of the detector. A 60% accuracy caneither be very useful (e.g., if it achieves a high true-positiverate at a 0% false-positive rate) or entirely useless (e.g.,if it detects most adversarial images as adversarial at thecost of many natural images as adversarial). Instead, reportboth the false positive and true positive rates. To allow forcomparisons with other work, we suggest reporting at leastthe true positive rate at 1% false positive rate; showing aROC curve would be even better.

Evaluate on more than MNIST. We have found thatdefenses that only evaluated on the MNIST dataset typicallyeither (a) were unable to produce an accurate classifier onCIFAR, (b) were entirely useless on CIFAR and were not ableto detect even the fast gradient sign attack, or (c) were evenweaker against attack on CIFAR than the other defenses weevaluated. Future schemes need to be evaluated on multiple

data sets โ€” evaluating their security solely on MNIST is notsufficient. While we have found CIFAR to be a reasonabletask for evaluating security, in the future as defenses improveit may become necessary to evaluate on harder datasets (suchas ImageNet [10]).

Release source code. In order to allow others to buildon their work, authors should release the source code of theirdefenses. Not releasing source code only sets back the researchcommunity and hinders future security analysis. Seven ofthe ten we evaluate did not release their code (even aftercontacting the authors), requiring us to reimplement thedefenses before evaluation.

8 CONCLUSION

Unlike standard machine-learning tasks, where achieving ahigher accuracy on a single benchmark is in itself a usefuland interesting result, this is not sufficient for secure machinelearning. We must consider how an attacker might react to anyproposed defense, and evaluate whether the defense remainssecure against an attacker who knows how the defense works.

In this paper we evaluate ten proposed defenses and demon-strate that none of them are able to withstand a white-boxattack. We do this by constructing defense-specific loss func-tions that we minimize with a strong iterative attack algo-rithm. With these attacks, on CIFAR an adversary can createimperceptible adversarial examples for each defense.

By studying these ten defenses, we have drawn two lessons:existing defenses lack thorough security evaluations, andadversarial examples are much more difficult to detect thanpreviously recognized. We hope that our work will help raisethe bar for evaluation of proposed defenses and perhapshelp others to construct more effective defenses. Further, ourevaluations of these defenses expand on what is believed tobe possible with constructing adversarial examples: we haveshown that, so far, there are no known intrinsic propertiesthat differentiate adversarial examples from regular images.We believe that constructing defenses to adversarial examplesis an important challenge that must be overcome before thesenetworks are used in potentially security-critical domains,and hope our work can bring us closer towards this goal.

9 ACKNOWLEDGEMENTS

We would like to thank Kathrin Grosse, Reuben Feinman,Fuxin Li, and Metzen Jan Hendrik for discussing their de-fenses with us, along with the anonymous reviewers for theirfeedback. This work was supported by the AFOSR underMURI award FA9550-12-1-0040, Intel through the ISTC forSecure Computing, the Hewlett Foundation through the Cen-ter for Long-Term Cybersecurity, and Qualcomm.

Page 12: Adversarial Examples Are Not Easily Detected: Bypassing ...Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods Nicholas Carlini David Wagner University of

REFERENCES[1] Marco Barreno, Blaine Nelson, Anthony D Joseph, and JD Tygar.

2010. The security of machine learning. Machine Learning 81, 2(2010), 121โ€“148.

[2] Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph,and J Doug Tygar. 2006. Can machine learning be secure?.In Proceedings of the 2006 ACM Symposium on Information,computer and communications security. ACM, 16โ€“25.

[3] Osbert Bastani, Yani Ioannou, Leonidas Lampropoulos, DimitriosVytiniotis, Aditya Nori, and Antonio Criminisi. 2016. Measuringneural net robustness with constraints. In Advances In NeuralInformation Processing Systems. 2613โ€“2621.

[4] Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. 2017.Dimensionality Reduction as a Defense against Evasion Attacks onMachine Learning Classifiers. arXiv preprint arXiv:1704:02654(2017).

[5] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson,

Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli.2013. Evasion attacks against machine learning at test time. InJoint European Conference on Machine Learning and Knowl-edge Discovery in Databases. Springer, 387โ€“402.

[6] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bern-hard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel,Mathew Monfort, Urs Muller, Jiakai Zhang, and others. 2016.End to End Learning for Self-Driving Cars. arXiv preprintarXiv:1604.07316 (2016).

[7] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-PeterKriegel, Bernhard Scholkopf, and Alex J Smola. 2006. Integratingstructured biological data by kernel maximum mean discrepancy.Bioinformatics 22, 14 (2006), e49โ€“e57.

[8] Nicholas Carlini and David Wagner. 2017. Towards evaluatingthe robustness of neural networks. IEEE Symposium on Securityand Privacy (2017).

[9] Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma,and others. 2004. Adversarial classification. In Proceedings of thetenth ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 99โ€“108.

[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on. IEEE, 248โ€“255.

[11] Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew BGardner. 2017. Detecting Adversarial Samples from Artifacts.arXiv preprint arXiv:1703.00410 (2017).

[12] Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku. 2017. Adversarialand Clean Data Are Not Twins. arXiv preprint arXiv:1704.04960(2017).

[13] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014.Explaining and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 (2014).

[14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, BernhardScholkopf, and Alexander Smola. 2012. A kernel two-sample test.Journal of Machine Learning Research 13, Mar (2012), 723โ€“773.

[15] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, MichaelBackes, and Patrick McDaniel. 2017. On the (Statistical) Detec-tion of Adversarial Examples. arXiv preprint arXiv:1702.06280(2017).

[16] Shixiang Gu and Luca Rigazio. 2014. Towards deep neural networkarchitectures robust to adversarial examples. arXiv preprintarXiv:1412.5068 (2014).

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.Deep residual learning for image recognition. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition.770โ€“778.

[18] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bas-tian Bischoff. 2017. On Detecting Adversarial Perturbations. InInternational Conference on Learning Representations. arXivpreprint arXiv:1702.04267.

[19] Dan Hendrycks and Kevin Gimpel. 2017. Early Methods forDetecting Adversarial Images. In International Conference onLearning Representations (Workshop Track).

[20] Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvari.2015. Learning with a strong adversary. CoRR, abs/1511.03034(2015).

[21] Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. 2015.Robust Convolutional Neural Networks under Adversarial Noise.arXiv preprint arXiv:1511.06306 (2015).

[22] Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiplelayers of features from tiny images. (2009).

[23] Yann LeCun, Corinna Cortes, and Christopher JC Burges. 1998.The MNIST database of handwritten digits. (1998).

[24] Xin Li and Fuxin Li. 2016. Adversarial Examples Detectionin Deep Networks with Convolutional Filter Statistics. arXivpreprint arXiv:1612.07767 (2016).

[25] Daniel Lowd and Christopher Meek. 2005. Adversarial learning.In Proceedings of the eleventh ACM SIGKDD internationalconference on Knowledge discovery in data mining. ACM, 641โ€“647.

[26] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and PascalFrossard. 2016. Deepfool: a simple and accurate method to fooldeep neural networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 2574โ€“2582.

[27] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear unitsimprove restricted boltzmann machines. In Proceedings of the27th international conference on machine learning (ICML-10).807โ€“814.

[28] Anders Oden and Hans Wedel. 1975. Arguments for Fisherโ€™spermutation test. The Annals of Statistics (1975), 518โ€“520.

[29] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow.2016. Transferability in machine learning: from phenomena toblack-box attacks using adversarial samples. arXiv preprintarXiv:1605.07277 (2016).

[30] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrik-son, Z Berkay Celik, and Ananthram Swami. 2016. The limitationsof deep learning in adversarial settings. In Security and Privacy(EuroS&P), 2016 IEEE European Symposium on. IEEE, 372โ€“387.

[31] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, andAnanthram Swami. 2016. Distillation as a defense to adversarialperturbations against deep neural networks. IEEE Symposiumon Security and Privacy (2016).

[32] Slav Petrov. 2016. Announcing syntaxnet: The worldโ€™s mostaccurate parser goes open source. Google Research Blog, May12 (2016), 2016.

[33] Andras Rozsa, Ethan M Rudd, and Terrance E Boult. 2016. Ad-versarial diversity and hard positive generation. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recog-nition Workshops. 25โ€“32.

[34] Uri Shaham, Yutaro Yamada, and Sahand Negahban. 2015. Un-derstanding Adversarial Training: Increasing Local Stability ofNeural Nets through Robust Optimization. arXiv preprintarXiv:1511.05432 (2015).

[35] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, LaurentSifre, George Van Den Driessche, Julian Schrittwieser, IoannisAntonoglou, Veda Panneershelvam, Marc Lanctot, and others.2016. Mastering the game of Go with deep neural networks andtree search. Nature 529, 7587 (2016), 484โ€“489.

[36] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox,and Martin Riedmiller. 2015. Striving for simplicity: The allconvolutional net. In International Conference on Learning Rep-resentations (Workshop Track).

[37] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. 2014. Dropout: a sim-ple way to prevent neural networks from overfitting. Journal ofMachine Learning Research 15, 1 (2014), 1929โ€“1958.

[38] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens,and Zbigniew Wojna. 2016. Rethinking the inception architecturefor computer vision. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 2818โ€“2826.

[39] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna,Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguingproperties of neural networks. (2014).

[40] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Moham-mad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, QinGao, Klaus Macherey, and others. 2016. Googleโ€™s neural machinetranslation system: Bridging the gap between human and machinetranslation. arXiv preprint arXiv:1609.08144 (2016).

[41] Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow.2016. Improving the robustness of deep neural networks viastability training. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 4480โ€“4488.