interpretable deep learning under fire - arxiv · and even playing go [49]), enabling use cases...

Interpretable Deep Learning under Fire

Xinyang Zhang∗ Ningfei Wang? Hua Shen∗ Shouling Ji† Xiapu Luo‡ Ting Wang∗∗Pennsylvania State University ?University of California Irvine

†Zhejiang University and Alibaba-ZJU Joint Institute of Frontier Technologies‡Hong Kong Polytechnic University

AbstractProviding explanations for deep neural network (DNN)

models is crucial for their use in security-sensitive domains.A plethora of interpretation models have been proposed tohelp users understand the inner workings of DNNs: how doesa DNN arrive at a specific decision for a given input? Theimproved interpretability is believed to offer a sense of se-curity by involving human in the decision-making process.Yet, due to its data-driven nature, the interpretability itselfis potentially susceptible to malicious manipulations, aboutwhich little is known thus far.

Here we bridge this gap by conducting the first systematicstudy on the security of interpretable deep learning systems(IDLSes). We show that existing IDLSes are highly vulner-able to adversarial manipulations. Specifically, we presentADV2, a new class of attacks that generate adversarial inputsnot only misleading target DNNs but also deceiving theircoupled interpretation models. Through empirical evaluationagainst four major types of IDLSes on benchmark datasetsand in security-critical applications (e.g., skin cancer diag-nosis), we demonstrate that with ADV2 the adversary is ableto arbitrarily designate an input’s prediction and interpreta-tion. Further, with both analytical and empirical evidence, weidentify the prediction-interpretation gap as one root causeof this vulnerability – a DNN and its interpretation modelare often misaligned, resulting in the possibility of exploitingboth models simultaneously. Finally, we explore potentialcountermeasures against ADV2, including leveraging its lowtransferability and incorporating it in an adversarial trainingframework. Our findings shed light on designing and operat-ing IDLSes in a more secure and informative fashion, leadingto several promising research directions.

1 IntroductionThe recent advances in deep learning have led to break-

throughs in many long-standing machine learning tasks (e.g.,image classification [22], natural language processing [54],and even playing Go [49]), enabling use cases previouslyconsidered strictly experimental.

However, the state-of-the-art performance of deep neuralnetwork (DNN) models is often achieved at the cost of in-terpretability. It is challenging to intuitively understand the

Input<latexit sha1_base64="HfiBL7/3eQKuEO5raBIX81viL7c=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFN7qrYB/QjiWTZtrQJDMkGaUM/Q83LhRx67+482/MtLPQ1gOBwzn3knNPEHOmjet+O4WV1bX1jeJmaWt7Z3evvH/Q0lGiCG2SiEeqE2BNOZO0aZjhtBMrikXAaTsYX2d++5EqzSJ5byYx9QUeShYygo2VHnoCm5ES6a2MEzPtlytu1Z0BLRMvJxXI0eiXv3qDiCSCSkM41rrrubHxU6wMI5xOS71E0xiTMR7SrqUSC6r9dJZ6ik6sMkBhpOyTBs3U3xspFlpPRGAns5R60cvE/7xuYsJLP2XZSVSS+UdhwpGJUFYBGjBFieETSzBRzGZFZIQVJsYWVbIleIsnL5NWreqdVWt355X6VV5HEY7gGE7Bgwuoww00oAkEFDzDK7w5T86L8+58zEcLTr5zCH/gfP4AMxOS9w==</latexit>

Interpretation<latexit sha1_base64="mINq20rofX39BmBdszBiSEb4NVI=">AAACAHicbVDLSsNAFJ3UV62vqgsXboJFcFWSKuiy6EZ3FewD2lAm05t26MwkzEyEErLxV9y4UMStn+HOv3GSZqGtBwYO59zD3Hv8iFGlHefbKq2srq1vlDcrW9s7u3vV/YOOCmNJoE1CFsqejxUwKqCtqWbQiyRg7jPo+tObzO8+glQ0FA96FoHH8VjQgBKsjTSsHg041hPJkzuhQZqozo10WK05dSeHvUzcgtRQgdaw+jUYhSTmIDRhWKm+60TaS7DUlDBIK4NYQYTJFI+hb6jAHJSX5Aek9qlRRnYQSvOEtnP1dyLBXKkZ981ktq5a9DLxP68f6+DKS6iIYg2CzD8KYmbr0M7asEdUAtFsZggmkppdbTLBEhNThqqYEtzFk5dJp1F3z+uN+4ta87qoo4yO0Qk6Qy66RE10i1qojQhK0TN6RW/Wk/VivVsf89GSVWQO0R9Ynz/4EpdK</latexit>

Input<latexit sha1_base64="HfiBL7/3eQKuEO5raBIX81viL7c=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFN7qrYB/QjiWTZtrQJDMkGaUM/Q83LhRx67+482/MtLPQ1gOBwzn3knNPEHOmjet+O4WV1bX1jeJmaWt7Z3evvH/Q0lGiCG2SiEeqE2BNOZO0aZjhtBMrikXAaTsYX2d++5EqzSJ5byYx9QUeShYygo2VHnoCm5ES6a2MEzPtlytu1Z0BLRMvJxXI0eiXv3qDiCSCSkM41rrrubHxU6wMI5xOS71E0xiTMR7SrqUSC6r9dJZ6ik6sMkBhpOyTBs3U3xspFlpPRGAns5R60cvE/7xuYsJLP2XZSVSS+UdhwpGJUFYBGjBFieETSzBRzGZFZIQVJsYWVbIleIsnL5NWreqdVWt355X6VV5HEY7gGE7Bgwuoww00oAkEFDzDK7w5T86L8+58zEcLTr5zCH/gfP4AMxOS9w==</latexit>

Interpretation<latexit sha1_base64="mINq20rofX39BmBdszBiSEb4NVI=">AAACAHicbVDLSsNAFJ3UV62vqgsXboJFcFWSKuiy6EZ3FewD2lAm05t26MwkzEyEErLxV9y4UMStn+HOv3GSZqGtBwYO59zD3Hv8iFGlHefbKq2srq1vlDcrW9s7u3vV/YOOCmNJoE1CFsqejxUwKqCtqWbQiyRg7jPo+tObzO8+glQ0FA96FoHH8VjQgBKsjTSsHg041hPJkzuhQZqozo10WK05dSeHvUzcgtRQgdaw+jUYhSTmIDRhWKm+60TaS7DUlDBIK4NYQYTJFI+hb6jAHJSX5Aek9qlRRnYQSvOEtnP1dyLBXKkZ981ktq5a9DLxP68f6+DKS6iIYg2CzD8KYmbr0M7asEdUAtFsZggmkppdbTLBEhNThqqYEtzFk5dJp1F3z+uN+4ta87qoo4yO0Qk6Qy66RE10i1qojQhK0TN6RW/Wk/VivVsf89GSVWQO0R9Ynz/4EpdK</latexit>

(c)D

ual

Adve

rsar

ial

<latexit sha1_base64="9uq8lCkDaGJbheou4njlAPrJj+k=">AAACBnicbVDLSsNAFJ3UV62vqEsRBotQNyWpgi7rY+Gygn1AE8rNZNoOnTyYmQgldOXGX3HjQhG3foM7/8ZJm4W2Hhg4nHMuc+/xYs6ksqxvo7C0vLK6VlwvbWxube+Yu3stGSWC0CaJeCQ6HkjKWUibiilOO7GgEHictr3Rdea3H6iQLArv1TimbgCDkPUZAaWlnnnoBKCGIkgr5OQmAe7gSz/Lg2DAJz2zbFWtKfAisXNSRjkaPfPL8SOSBDRUhIOUXduKlZuCUIxwOik5iaQxkBEMaFfTEAIq3XR6xgQfa8XH/UjoFyo8VX9PpBBIOQ48ncyWlvNeJv7ndRPVv3BTFsaJoiGZfdRPOFYRzjrBPhOUKD7WBIhgeldMhiCAKN1ESZdgz5+8SFq1qn1ard2dletXeR1FdICOUAXZ6BzV0S1qoCYi6BE9o1f0ZjwZL8a78TGLFox8Zh/9gfH5A/y6mNI=</latexit>

(b)A

dve

rsar

ial

<latexit sha1_base64="RgeUUGhFH94kVwxCq4vpimsljSE=">AAACAHicbVC7TsMwFHXKq5RXgIGBxaJCKkuVFCQYCyyMRaIPqY0qx3Faq7YT2Q5SFWXhV1gYQIiVz2Djb3DaDNByJEtH55wr33v8mFGlHefbKq2srq1vlDcrW9s7u3v2/kFHRYnEpI0jFsmejxRhVJC2ppqRXiwJ4j4jXX9ym/vdRyIVjcSDnsbE42gkaEgx0kYa2kcDjvRY8rTmn10HeRJJilg2tKtO3ZkBLhO3IFVQoDW0vwZBhBNOhMYMKdV3nVh7KZKaYkayyiBRJEZ4gkakb6hAnCgvnR2QwVOjBDCMpHlCw5n6eyJFXKkp900yX1ctern4n9dPdHjlpVTEiSYCzz8KEwZ1BPM2YEAlwZpNDUFYUrMrxGMkEdamiYopwV08eZl0GnX3vN64v6g2b4o6yuAYnIAacMElaII70AJtgEEGnsEreLOerBfr3fqYR0tWMXMI/sD6/AHanpaT</latexit>

(a)B

enig

n<latexit sha1_base64="ioXxm8UlObaRLpfMj+F1bqWsqj0=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBItQNyWpgi5L3bisYB/QhjKZTtqhM5MwMxFL6K+4caGIW3/EnX/jpM1CWw8MHM65l3vmBDGjSrvut1XY2Nza3inulvb2Dw6P7ONyR0WJxKSNIxbJXoAUYVSQtqaakV4sCeIBI91gepv53UciFY3Eg57FxOdoLGhIMdJGGtrlAUd6InlaRRdNIuhYzId2xa25CzjrxMtJBXK0hvbXYBThhBOhMUNK9T031n6KpKaYkXlpkCgSIzxFY9I3VCBOlJ8uss+dc6OMnDCS5gntLNTfGyniSs14YCazpGrVy8T/vH6iwxs/pSJONBF4eShMmKMjJyvCGVFJsGYzQxCW1GR18ARJhLWpq2RK8Fa/vE469Zp3WavfX1UazbyOIpzCGVTBg2towB20oA0YnuAZXuHNmlsv1rv1sRwtWPnOCfyB9fkDx4uURQ==</latexit>

Figure 1: Sample (a) benign, (b) regular adversarial, and (c) dualadversarial inputs and interpretations on ResNet [22] (classifier) andCAM [64] (interpreter).

inference of complicated DNNs – how does a DNN arriveat a specific decision for a given input – due to their highnon-linearity and nested architectures. This is a major draw-back for applications in which the interpretability of decisionsis a critical prerequisite, while simple black-box predictionscannot be trusted by default. Another drawback of DNNsis their inherent vulnerability to adversarial inputs – mali-ciously crafted samples to trigger target DNNs to malfunc-tion [9,28,56] – which leads to unpredictable model behaviorsand hinders their use in security-sensitive domains.

The drawbacks have spurred intensive research on improv-ing the DNN interpretability via providing explanations ateither model-level [26,45,63] or instance-level [10,16,43,50].For example, in Figure 1 (a), an attribution map highlights aninput’s most informative part with respect to its classification,revealing their causal relationship. Such interpretability helpsusers understand the inner workings of DNNs, enabling usecases including model debugging [39], digesting security anal-ysis results [20], and detecting adversarial inputs [13]. Forinstance, in Figure 1 (b), an adversarial input, which causesthe target DNN to deviate from its normal behavior, gener-ates an attribution map highly distinguishable from its benigncounterpart, and is thus easily detectable.

As illustrated in Figure 2, a DNN model (classifier), cou-pled with an interpretation model (interpreter), forms an in-terpretable deep learning system (IDLS). The enhanced inter-

1

arX

iv:1

812.

0089

1v3

[cs

.CR

] 1

8 Se

p 20

19

Input ClassifierPrediction

f

?

Interpretation Interpreter

Figure 2: Workflow of an interpretable deep learning system (IDLS).

pretability of IDLSes is believed to offer a sense of security byinvolving human in the decision process [57]. However, givenits data-driven nature, this interpretability itself is potentiallysusceptible to malicious manipulations. Unfortunately, thusfar, little is known about the security vulnerability of IDLSes,not to mention mitigating such threats.

Our Work. To bridge the gap, in this paper, we conduct acomprehensive study on the security vulnerability of IDLSes,which leads to the following interesting findings.

First, we demonstrate that existing IDLSes are highly vul-nerable to adversarial manipulations. We present ADV2, anew class of attacks that generate adversarial inputs not onlymisleading a target DNN but also deceiving its coupled in-terpreter. By empirically evaluating ADV2 against four ma-jor types of IDLSes on benchmark datasets and in security-critical applications (e.g., skin cancer diagnosis), we showthat it is practical to generate adversarial inputs with predic-tions and interpretations arbitrarily chosen by the adversary.For example, Figure 1 (c) shows adversarial inputs that aremisclassified by target DNNs and also interpreted highly sim-ilarly to their benign counterparts. Thus the interpretability ofIDLSes merely provides limited security assurance.

Then, we show that one possible root cause of this attackvulnerability lies in the prediction-interpretation gap: the in-terpreter is often misaligned with the classifier, while theinterpreter’s interpretation only partially explains the classi-fier’s behavior, allowing the adversary to exploit both modelssimultaneously. This finding entails several intriguing ques-tions: (i) what, in turn, is the possible cause of this gap? (ii)how does this gap vary across different interpreters? (iii) whatis its implication for designing more robust interpreters? Weexplore all these key questions in our study.

Further, we investigate the transferability of ADV2 acrossdifferent interpreters. We note that it is often difficult to findadversarial inputs transferable across distinct types of inter-preters, as they generate interpretations from complementaryperspectives (e.g., back-propagation, intermediate representa-tions, input-prediction correspondence). This finding pointsto training an ensemble of interpreters as one potential coun-termeasure against ADV2.

Finally, we present adversarial interpretation distillation(AID), an adversarial training framework which integratesADV2 in training interpreters. We show that AID effectively

Notation Definition

f , g target classifier, interpreterx◦, x∗ benign, adversarial inputct , mt adversary’s target class, interpretationx[i] i-th dimension of xε perturbation magnitude bound‖ · ‖ vector norm

`int, `prd, `adv interpretation, prediction, overall lossα learning rate

Table 1. Symbols and notations.

reduces the prediction-interpretation gap and potentially helpsimprove the robustness of interpreters against ADV2.

To our best knowledge, this work represents the first sys-tematic study on the security vulnerability of existing IDLSes.We believe our findings shed light on designing and operatingIDLSes in a more secure and informative manner.

Roadmap. The remainder of the paper proceeds as follows.§ 2 introduces fundamental concepts; § 3 presents the ADV2

attack and details its implementation against four major typesof interpreters; § 4 empirically evaluates its effectiveness; § 5explores the fundamental causes of the attack vulnerabilityand discusses possible countermeasures; § 6 surveys relevantliterature; the paper is concluded in § 7.

2 PreliminariesWe begin with introducing a set of fundamental concepts

and assumptions. The symbols and notations used in thispaper are summarized in Table 1.

Classifier – In this paper, we primarily focus on predictivetasks (e.g., image classification [12]), in which a DNN f (i.e.,classifier) assigns a given input x to one of a set of predefinedclasses C , f (x) = c ∈ C .

Interpreter – In general, the DNN interpretability canbe obtained in two ways: designing interpretable DNNs[45, 62] or extracting post-hoc interpretations. The lattercase does not require modifying model architectures or pa-rameters, thereby leading to higher prediction accuracy. Wethus mainly consider post-hoc interpretations in this paper.More specifically, we focus on instance-level interpretabil-ity [10,16,26,37,38,45,48,50,63], which explains how a DNNf classifies a given input x and uncovers the causal relation-ship between x and f (x). We assume such interpretations aregiven in the form of attribution maps. As shown in Figure 2,the interpreter g generates an attribution map m = g(x; f ),with its i-th element m[i] quantifying the importance of x’si-th feature x[i] with respect to f (x).

Adversarial Attack – DNNs are inherently vulnerable toadversarial inputs, which are maliciously crafted samples toforce DNNs to misbehave [36, 56]. Typically, an adversarialinput x∗ is generated by modifying a benign input x◦ via pixelperturbation (e.g., PGD [35]) or spatial transformation (e.g.,STADV [60]), with the objective of forcing f to misclassifyx∗ to a target class ct , f (x∗) = ct 6= f (x◦). To ensure the at-tack evasiveness, the modification is often constrained to anallowed set (e.g., a norm ball Bε(x◦) = {x|‖x− x◦‖∞ ≤ ε}).

2

Consider PGD, a universal first-order adversarial attack, as aconcrete case. At a high level, PGD implements a sequenceof project gradient descent on the loss function:

x(i+1) = ΠBε(x◦)(x(i)−αsgn

(∇x`prd

(f(x(i)),ct)))

(1)

where Π is the projection operator, α represents the learn-ing rate, the loss function `prd measures the difference ofthe model prediction f (x) and the class ct targeted by theadversary (e.g., cross entropy), and x(0) is initialized as x◦.

Threat Model – Following the line of work on adversarialattacks [9,19,35,56], we assume in this paper a white-box set-ting: the adversary has complete access to the classifier f andthe interpreter g, including their architectures and parameters.This is a conservative and realistic assumption. Prior work hasshown that it is possible to train a surrogate model f ′ givenblack-box access to a target DNN f [41]; given that the inter-preter is often derived directly from the classifier (details in§ 3), the adversary may then train a substitution interpreter g′

based on f ′. We consider investigating such black-box attacksas our ongoing work.

3 ADV2 AttackThe interpretability of IDLSes is believed to offer a sense

of security by involving human in the decision process [13,17, 20, 57]; this belief has yet to be rigorously tested. Webridge this gap by presenting ADV2, a new class of attacksthat deceive target DNNs and their interpreters simultaneously.Below we first give an overview of ADV2 and then detail itsinstantiations against four major types of interpreters.

3.1 Attack FormulationThe ADV2 attack deceives both the DNN f and its coupled

interpreter g. Specifically, ADV2 generates an adversarial in-put x∗ by modifying a benign input x◦ such that

• (i) x∗ is misclassified by f to a target class ct , f (x∗) = ct ;

• (ii) x∗ triggers g to generate a target attribution map mt ,g(x∗; f ) = mt ;

• (iii) The difference between x∗ and x◦, ∆(x∗,x◦), is im-perceptible;

where the distance function ∆ depends on the concrete mod-ification: for pixel perturbation (e.g., [35]), it is instantiatedas Lp norm, while for spatial transformation (e.g., [60]), it isdefined as the overall spatial distortion.

In other words, the goal is to find sufficiently small per-turbation to the benign input that leads to the prediction andinterpretation desired by the adversary.

At a high level, we formulate ADV2 using the followingoptimization framework:

minx

∆(x,x◦) s.t.{

f (x) = ctg(x; f ) = mt

(2)

where the constraints ensure that (i) the adversarial input ismisclassified as ct and (ii) it triggers g to generate the targetattribution map mt .

As the constraints of f (x) = ct and g(x; f ) = mt are highlynon-linear for practical DNNs, we reformulate Eqn (2) in aform more suited for optimization:

minx

`prd( f (x),ct)+λìnt (g(x; f ),mt)

s.t. ∆(x,x◦)≤ ε (3)

where the prediction loss `prd is the same as in Eqn (1), theinterpretation loss ìnt measures the difference of adversarialmap g(x; f ) and target map mt , and the hyper-parameter λ

balances the two factors. Below we use àdv(x) to denote theoverall loss function defined in Eqn (3).

We construct the solver of Eqn (3) upon an adversarial at-tack framework. While it is flexible to choose the concreteframework, below we primarily use PGD [35] as the refer-ence and discuss the construction of ADV2 upon alternativeframeworks (e.g., spatial transformation [60]) in § 4.

Under this setting, we define `prd( f (x),ct) =− log( fct (x))(i.e., the negative log likelihood of x with respect to the classct), ∆(x,x◦) = ‖x− x◦‖∞, and ìnt(g(x; f ),mt) = ‖g(x; f )−mt‖2

2. In general, ADV2 searches for x∗ using a sequence ofgradient descent updates:

x(i+1) = ΠBε(x◦)(x(i)−αsgn

(∇xàdv

(x(i))))

(4)

However, directly applying Eqn (4) is often found inef-fective, due to the unique characteristics of individual inter-preters. In the following, we detail the instantiations of ADV2

against the back-propagation-, representation-, model-, andperturbation-guided interpreters, respectively.

3.2 Back-Propagation-Guided InterpretationThis class of interpreters compute the gradient (or its vari-

ants) of the model prediction with respect to a given input toderive the importance of each input feature. The hypothesisis that larger gradient magnitude indicates higher relevanceof the feature to the prediction. We consider gradient saliency(GRAD) [50] as a representative of this class.

Intuitively, GRAD considers a linear approximation of themodel prediction (probability) fc(x) for a given input x and agiven class c, and derives the attribution map m as:

m =

∣∣∣∣∂ fc(x)∂x

∣∣∣∣ (5)

To attack GRAD-based IDLSes, we may search for x∗ usinga sequence of gradient descent updates as defined in Eqn (4).However, according to Eqn (5), computing the gradient of theattribution map g(x; f ) amounts to computing the Hessianmatrix of fc(x), which is all-zero for DNNs with ReLU acti-vation functions. Thus the gradient of the interpretation lossìnt provides little information for updating x, which makesdirectly applying Eqn (4) ineffective.

3

1

-0.1 0 0.1

Figure 3: Comparison of h(z), σ(z), and r(z) near z = 0.

To overcome this, when performing back-propagation, wesmooth the gradient of ReLU, denoted by r(z), with a functionh(z) defined as (τ is a small constant, e.g., 10−4):

h(z),

{(z+√

z2 + τ)′ = 1+ z/√

z2 + τ (z < 0)(√

z2 + τ)′ = z/√

z2 + τ (z≥ 0)

Intuitively, h(z) tightly approximates r(z), while its gradi-ent is non-zero everywhere. Another possibility is the sig-moid function σ(z) = 1/(1+ e−z). Figure 3 compares dif-ferent functions near z = 0. Our evaluation shows that h(z)significantly outperforms σ(z) and r(z) in attacking GRAD.

This attack is extensible to other back-propagation-basedinterpreters (e.g., DEEPLIFT [48], SMOOTHGRAD [51], andLRP [6]), due to their fundamentally equivalent, gradient-centric formulations [3].

3.3 Representation-Guided InterpretationThis class of interpreters leverage the feature maps at in-

termediate layers of DNNs to generate attribution maps. Weconsider class activation mapping (CAM) [64] as a represen-tative interpreter of this class.

At a high level, CAM performs global average pooling [30]over the feature maps of the last convolutional layer, anduses the outputs as features for a linear layer with softmaxactivation to approximate the model predictions. Based onthis connectivity structure, CAM computes the attributionmaps by projecting the weights of the linear layer back to theconvolutional feature maps.

Formally, let ak[i, j] denote the activation of the k-th chan-nel of the last convolutional layer at the spatial position(i, j). The output of global average pooling is defined asAk = ∑i, j ak[i, j]. Further let wk,c be the weight of the con-nection between the k-th input and the c-th output of thelinear layer. The input to the softmax function for a class cwith respect to a given input x is approximated by:

zc(x)≈∑k

wk,c Ak = ∑i, j

∑k

wk,c ak[i, j] (6)

The class activation map mc is then given by:

mc[i, j] = ∑k

wk,c ak[i, j] (7)

Due to its use of deep representations at intermediate layers,CAM generates attribution maps of high visual quality andlimited noise and artifacts [30].

We instantiate g with a DNN that concatenates the part off up to its last convolutional layer and a linear layer parame-terized by {wk,c}. To attack CAM, we search for x∗ using a se-quence of gradient descent updates as defined in Eqn (4). Thisattack can be readily extended to other representation-guidedinterpreters (e.g., GRADCAM [47]), with details deferred toAppendix A1.

3.4 Model-Guided InterpretationInstead of relying on deep representations at intermediate

layers, model-guided methods train a meta-model to directlypredict the attribution map for any given input in a singlefeed-forward pass. We consider RTS [10] as a representativemethod in this category.

For a given input x in a class c, RTS finds its attributionmap m by solving the following optimization problem:

minm λ1rtv(m)+λ2rav(m)− log( fc (φ(x;m)))

+λ3 fc (φ(x;1−m))λ4

s.t. 0≤ m≤ 1(8)

Here rtv(m) denotes the total variation of m, which reducesnoise and artifacts in m; rav(m) represents the average valueof m, which minimizes the size of retained parts; φ(x;m) isthe operator using m as a mask to blend x with random colorsand Gaussian blur, which captures the impact of retained parts(where the mask is non-zero) on the model prediction; thehyper-parameters {λi}4

i=1 balance these factors. Intuitively,this formulation finds the sufficient and necessary parts of x,based on which f is able to make the prediction f (x) withhigh confidence.

However, solving Eqn (8) for every input during inferenceis fairly expensive. Instead, RTS trains a DNN to directlypredict the attribution map for any given input, without ac-cessing to the DNN f after training. In [44], this is achievedby composing a ResNet [22] pre-trained on ImageNet [12]as the encoder (which extracts feature maps of given inputsat different scales) and a U-NET [44] as the masking model,which is then trained to directly optimize Eqn (8). We con-sider the composition of this encoder and this masking modelas the interpreter g.

To attack RTS, one may directly apply Eqn (4). However,our evaluation shows that this strategy is often ineffectivefor finding desirable adversarial inputs. This is explained bythat the encoder enc(·) plays a significant role in generatingattribution maps, while solely relying on the outputs of themasking model is insufficient to guide the attack. We thusadd to Eqn (3) an additional loss term `enc(enc(x),enc(ct)),which measures the difference of the encoder’s outputs forthe adversarial input x and the target class ct .

We then search for the adversarial input x∗ with a sequenceof gradient descent updates defined in Eqn (4). More imple-mentation details are discussed in § 3.6.

4

3.5 Perturbation-Guided InterpretationThe fourth class of interpreters formulate finding the attri-

bution map by perturbing the input with minimum noise andobserving the change in the model prediction. We considerMASK [16] as a representative interpreter in this class.

For a given input x, MASK identifies its most informativeparts by checking whether changing such parts influences theprediction f (x). It learns a mask m, where m[i] = 0 if the i-thinput feature is retained and m[i] = 1 if the feature is replacedwith Gaussian noise. The optimal mask is found by solvingan optimization problem:

minm

fc(φ(x;m))+λ‖1−m‖1 s.t. 0≤ m≤ 1 (9)

where c denotes the current prediction c = f (x) and φ(x;m)is the perturbation operator which blends x with Gaussiannoise. The first term finds m that causes the probability of cto decrease significantly, while the second term encourages mto be sparse. Intuitively, solving Eqn (9) amounts to findingthe most informative and necessary parts of x with respect toits prediction f (x). Note that this formulation may result insignificant artifacts in m. A more refined formulation is givenin Appendix A2.

Unlike other classes of interpreters, to attack MASK, it isinfeasible to directly optimize Eqn (3) with iterative gradientdescent (Eqn (4)), because the interpreter g itself is formulatedas an optimization procedure.

Instead, we reformulate ADV2 using a bilevel optimiza-tion framework. For given x◦, ct , mt , f , and g, we re-definethe adversarial loss function as àdv(x,m) , `prd( f (x),ct)+λìnt(m,mt) by introducing m as an additional variable. Let`map(m;x) be the objective function defined in Eqn (9) (or itsvariant Eqn (16)). Note that m∗(x) = argminm `map(m;x) isthe attribution map found by MASK for a given input x. Wethen have the following attack framework:

minx

àdv (x, m∗(x))

s.t. m∗(x) = argminm

`map(m;x) (10)

Still, solving the bilevel optimization in Eqn (10) exactlyis challenging, as it requires recomputing m∗(x) by solvingthe inner optimization problem whenever x is updated. Wepropose an approximate iterative procedure which optimizesx and m by alternating between gradient descent on àdv and`map respectively.

More specifically, at the i-th iteration, given the currentinput x(i−1), we compute its attribution map m(i) by updatingm(i−1) with gradient descent on `map

(m(i−1);x(i−1)

); we then fix

m(i) and obtain x(i) by minimizing àdv after a single step ofgradient descent with respect to m(i). Formally, we define theobjective function for updating x(i) as:

àdv

(x(i−1), m(i)−ξ∇m`map

(m(i);x(i−1)

))

where ξ is the learning rate for this virtual gradient descent.The rationale behind this procedure is as follows. While it

is difficult to directly minimizing àdv (x,m∗(x)) with respectto x, we use a single-step unrolled map as a surrogate of m∗(x).A similar approach is used in [15]. Essentially, this iterativeoptimization defines a Stackelberg game [46] between theoptimizer for x (leader) and the optimizer for m (follower),which requires the leader to anticipate the follower’s nextmove to reach the equilibrium.

Algorithm 1: ADV2 against MASK.Input: x◦: benign input; ct : target class; mt : target map; f : target DNN;

g: MASK interpreterOutput: x∗: adversarial input

1 initialize x and m as x◦ and g(x◦; f );2 while not converged do

// update m3 update m by gradient descent along ∇m`map(m;x);

// update x with single-step lookahead4 update x by gradient descent along

∇xàdv(x, m−ξ∇m`map (m;x)

);

5 return x;

Algorithm 1 sketches the attack against MASK. More im-plementation details are given in § 3.6. The theoretical justifi-cation for its effectiveness is deferred to Appendix A3.

3.6 Implementation and OptimizationNext we detail the implementation of ADV2 and present

a suite of optimizations to improve the attack effectivenessagainst specific interpreters.

Iterative Optimizer – We build the optimizer based uponPGD [35], which iteratively updates the adversarial input usingEqn (4). By default, we use L∞ norm to measure the perturba-tion magnitude. It is possible to adopt alternative frameworksif other perturbation metrics are considered. For instance,instead of modifying pixels directly, one may generate adver-sarial inputs via spatial transformation [2, 60], in which theperturbation magnitude is often measured by the overall spa-tial distortion. We detail and evaluate spatial transformation-based ADV2 in § 4.

Warm Start – It is observed in our evaluation that it is of-ten inefficient to search for adversarial inputs by running theupdate steps of ADV2 (Eqn (4)) from scratch. Rather, first run-ning a fixed number (e.g., 400) of update steps of the regularadversarial attack and then resuming the ADV2 update stepssignificantly improves the search efficiency. Intuitively, thisstrategy first quickly approaches the manifold of adversarialinputs, and then searches for inputs satisfying both predictionand interpretation constraints.

Label Smoothing – Recall that we measure the predictionloss `prd( f (x),ct) with cross entropy. When attacking GRAD,ADV2 may generate intermediate inputs that cause f to makeover-confident predictions (e.g., with probability 1). The all-zero gradient of `prd prevents the attack from finding inputswith desirable interpretations. To solve this, we refine cross

5

entropy with label smoothing [55]. We sample yct from auniform distribution U(1−ρ,1) and define yc =

1−yct|C |−1 for c 6=

ct and `prd( f (x),ct) =−∑c∈C yc log fc(x). During the attack,we gradually decrease ρ from 0.05 to 0.01.

Multistep Lookahead – In implementing Algorithm 1, weapply multiple steps of gradient descent in both updatingm (line 3) and computing the surrogate map m∗(x) (line 4),which is observed to lead to faster convergence in our empiri-cal evaluation. Further, to improve the optimization stability,we use the average gradient to update m. Specifically, let{m(i)

j } be the sequence of maps obtained at the i-th iterationby applying multistep gradient descent. We use the aggregatedinterpretation loss ∑ j ‖m(i)

j −mt‖22 to compute the gradient for

updating m.Adaptive Learning Rate – To improve the convergence

of Algorithm 1, we also dynamically adjust the learning ratefor updating m and x. At each iteration, we use a runningAdam optimizer as a meta-learner [4] to estimate the optimallearning rate for updating m (line 3). We update x in a two-step fashion to stabilize the training: (i) first updating x interms of the prediction loss `prd, and (ii) updating it in termsof the interpretation loss `int. During (ii), we use a binarysearch to find the largest step size, such that x’s confidence isstill above a certain threshold κ after the perturbation.

Periodical Reset – Recall that in Algorithm 1, we updatethe estimate of the attribution map by following gradientdescent on `map. As the number of update steps increases,this estimate may deviate significantly from the true mapgenerated by the MASK interpreter, which negatively impactsthe attack effectiveness. To address this, periodically (e.g.,every 50 iterations), we replace the estimated map with themap g(x; f ) that is directly computed by MASK based on thecurrent adversarial input. At the same time, we reset the Adamstep parameter to correct its internal state.

4 Attack EvaluationNext we conduct an empirical study of ADV2 on a variety of

DNNs and interpreters from both qualitative and quantitativeperspectives. Specifically, our experiments are designed toanswer the following key questions about ADV2:

Q1: Is it effective to deceive target classifiers?Q2: Is it effective to mislead target interpreters?Q3: Is it evasive with respect to attack detection methods?Q4: Is it effective in real security-critical applications?Q5: Is it flexible to adopt alternative attack frameworks?

Experimental SettingWe first introduce the setting of our empirical evaluation.Datasets – Our evaluation primarily uses ImageNet [12],

which consists of 1.2 million images from 1,000 classes. Ev-ery image is center-cropped to 224×224 pixels. For a givenclassifier f , from the validation set of ImageNet, we randomlysample 1,000 images that are classified correctly by f to formour test set. All the pixels are normalized to [0,1].

Classifiers – We use two state-of-the-art DNNs as theclassifiers, ResNet-50 [22] and DenseNet-169 [24], whichrespectively attain 77.15% and 77.92% top-1 accuracy onImageNet. Using two DNNs of distinct capacities (50 layersversus 169 layers) and architectures (residual blocks versusdense blocks), we factor out the influence of the characteris-tics of individual DNNs.

Interpreters – We adopt GRAD [50], CAM [64], RTS [10],and MASK [16] as the representatives of back-propagation-,representation-, model-, and perturbation-guided interpretersrespectively. We adopt their open-source implementation inour evaluation. As RTS is tightly coupled with its target DNN(i.e., ResNet), we train a new masking model for DenseNet.To assess the validity of the implementation, we evaluateall the interpreters in a weakly semi-supervised localizationtask [8] using the benchmark dataset and method in [10].Table 2 summarizes the results. The performance of all theinterpreters is consistent with that reported in [10], with slightvariation due to the difference of underlying DNNs.

Interpreter GRAD CAM MASK RTS

Measure 43.1 43.8 45.2 34.2Table 2. Performance of the interpreters in this paper in a weaklysemi-supervised localization task (with ResNet as the classifier).

Attacks – We implement all the variants of ADV2 in § 3 onthe PGD framework. In addition, we also implement ADV2 ona spatial transformation framework (STADV) [2]. We compareADV2 with regular PGD [35], a universal first-order adversar-ial attack. For both ADV2 and PGD, we assume the settingof targeted attacks, in which the adversary attempts to forcethe DNNs to misclassify the adversarial inputs into randomlydesignated classes. The parameter settings of all the attacksare summarized in Appendix B.

Q1. Attack Effectiveness (Prediction)We first evaluate the effectiveness of ADV2 in terms of

deceiving target DNNs. The effectiveness is measured usingattack success rate, which is defined as

Attack Success Rate(ASR) =#successful trials

# total trials

and misclassification confidence (MC), which is the probabil-ity assigned by the DNN to the target class ct .

ResNet DenseNetGRAD CAM MASK RTS GRAD CAM MASK RTS

P 100% (1.0) 100% (1.0)100% 100% 98% 100% 100% 100% 96% 100%

A (0.99) (1.0) (0.99) (1.0) (0.98) (1.0) (0.98) (1.0)Table 3. Effectiveness of PGD (P) and ADV2 (A) against differentclassifiers and interpreters in terms of ASR (MC).

Table 3 summarizes the attack success rate and misclassi-fication confidence of ADV2 and PGD against different com-binations of classifiers and interpreters. Note that as PGD isonly applied on the classifier, its effectiveness is agnostic tothe interpreters. To make fair comparison, we fix the max-imum number of iterations as 1,000 for both attacks. It is

6

observed that ADV2 achieves high success rate (above 95%)and misclassification confidence (above 0.98) across all thecases, which is comparable with the regular PGD attack. Wethus have the following conclusion.

Observation 1

Despite its dual objectives, ADV2 is as effective asregular adversarial attacks in deceiving target DNNs.

Q2. Attack Effectiveness (Interpretation)Next we evaluate the effectiveness of ADV2 in terms of gen-

erating similar interpretations to benign inputs. Specifically,we compare the interpretations of benign and adversarial in-puts, which is crucial for understanding the security implica-tions of using interpretability as a means of defenses [13, 57].Due to the lack of standard metrics for interpretation plausi-bility, we use a variety of measures in our evaluation.

Image

AD

V2

<latexit sha1_base64="ndlTBZhGdyfqfclpd+RDOxZSbho=">AAACVHicbVBdTxNBFJ1dRKEiAj7ysrGY8NTsVo0+4seDj5jYQsIUMnt7l046H5uZu0gz2f/Bq/woEv+LD05LE7V4kklOzj33Y05ZK+kpz38m6dqj9cdPNjY7T7eebT/f2d0bets4wAFYZd1pKTwqaXBAkhSe1g6FLhWelNNP8/rJFTovrflGsxpHWlwaWUkQFKVzrgVNfBU+fB625/2LnW7eyxfIHpJiSbpsieOL3STnYwuNRkOghPdnRV7TKAhHEhS2Hd54rAVMxSWeRWqERj8Ki7Pb7FVUxlllXXyGsoX6d0cQ2vuZLqNzceZqbS7+t1bqlc1UvR8FaeqG0MD94qpRGdlsnkk2lg6B1CwSAU7G2zOYCCeAYnIdbvA7WK2FGQcOIB20gU/Rmbz3Fq/5FcTPowt8UtrrcMB9nFCTp5lCPjcftO0fd9uJGReriT4kw36veN3rf33TPfq4THuD7bOX7JAV7B07Yl/YMRswYI7dsB/sNrlLfqVr6fq9NU2WPS/YP0i3fwO2CbVe</latexit>

CAM<latexit sha1_base64="t9qJAjypfVgyUIJ4eLByrGfLUqE=">AAACUnicbVJNTxsxEPWmfKZ8tsdeVoRKPUW7QFWOUC69VAKpASQcIe9klljxx8qeBSJr/0av7Y/qpX+FE06IVAgdydLTe2884ycXlZKesuxv0nqzsLi0vLLafru2vrG5tf3u3NvaAfbAKusuC+FRSYM9kqTwsnIodKHwohidTPSLW3ReWvODxhX2tbgxspQgKFKca0FDX4aT4+/N9VYn62bTSl+DfAY6bFan19tJxgcWao2GQAnvr/Kson4QjiQobNq89lgJGIkbvIrQCI2+H6ZLN+nHyAzS0rp4DKVT9nlHENr7sS6ic7rkvDYh/6sVem4ylYf9IE1VExp4GlzWKiWbThJJB9IhkBpHIMDJuHsKQ+EEUMytzQ3egdVamEHgANJBE/gIncm6n/Ge30J8PLrAh4W9D7vcxxsq8jRWyCfm3ab5527aMeN8PtHX4Hyvm+93984OOkdfZ2mvsA9sh31iOfvCjtg3dsp6DFjFfrJf7HfyJ3loxV/yZG0ls5737EW11h4BPu20sA==</latexit>

MASK<latexit sha1_base64="mR9A3drYeDdYcEdkw7znvgB+AII=">AAACU3icbVDLThRBFK1uUXEUBVm66TiYuJp0o0aXqBsSQoLBARKqQ6rv3GYqU4+26jYyqfR3uNWPcsG3sLFmmAQdPEklJ+ee+6hTNUp6yvOrJL23cv/Bw9VHvcdP1p4+W994fuRt6wCHYJV1J5XwqKTBIUlSeNI4FLpSeFxNPs/qxxfovLTmK00bLLU4N7KWIChKJdeCxr4O+x8P97qz9X4+yOfI7pJiQfpsgYOzjSTnIwutRkOghPenRd5QGYQjCQq7Hm89NgIm4hxPIzVCoy/D/OouexWVUVZbF5+hbK7+3RGE9n6qq+icX7lcm4n/rVV6aTPVH8ogTdMSGrhZXLcqI5vNIslG0iGQmkYiwMl4ewZj4QRQDK7HDX4Hq7Uwo8ABpIMu8Ak6kw/e4SW/gPh5dIGPK3sZtriPExryNFXIZ+atrrt1d72YcbGc6F1ytD0o3gy2v7zt73xapL3KXrCX7DUr2Hu2w3bZARsyYN/YD/aT/Up+J9dpmq7cWNNk0bPJ/kG69gcUsLUV</latexit>

RTS<latexit sha1_base64="96Kp6KWXsT6/Gx6bjlx6ge0kIMI=">AAACUnicbVJNTxsxEPWmH9CUttAee1k1VOop2oVW5YjKpUeg+ZJwhLyTWWLFHyt7lhJZ+ze4wo/qpX+FE06I1DZ0JEtP773xjJ9cVEp6yrLfSevJ02fPNzZftF9uvXr9Znvn7cDb2gH2wSrrRoXwqKTBPklSOKocCl0oHBazo4U+vETnpTU9mlc41uLCyFKCoEhxrgVNfRlOez+a8+1O1s2WlT4G+Qp02KqOz3eSjE8s1BoNgRLen+VZReMgHElQ2LR57bESMBMXeBahERr9OCyXbtKPkZmkpXXxGEqX7N8dQWjv57qIzuWS69qC/K9W6LXJVB6MgzRVTWjgYXBZq5RsukgknUiHQGoegQAn4+4pTIUTQDG3Njf4E6zWwkwCB5AOmsBn6EzW/YJX/BLi49EFPi3sVdjlPt5Qkae5Qr4w7zbNH3fTjhnn64k+BoO9br7f3Tv53Dn8tkp7k71nH9gnlrOv7JB9Z8esz4BV7JrdsNvkV3LXir/kwdpKVj3v2D/V2roHir602A==</latexit>

GRAD<latexit sha1_base64="8M/5ECn0yx4s+c8rEiwRj7ahjPI=">AAACU3icbVDLThRBFK1uUXEUBVm66TiYuJp0o0aXqCSwROIACdUh1XduM5WpR1t1G5lU+jvc6ke54FvYWDNMgg6epJKTc8991KkaJT3l+VWS3lu5/+Dh6qPe4ydrT5+tbzw/8rZ1gEOwyrqTSnhU0uCQJCk8aRwKXSk8riafZ/XjC3ReWvOVpg2WWpwbWUsQFKWSa0FjX4e9w4+73dl6Px/kc2R3SbEgfbbAwdlGkvORhVajIVDC+9Mib6gMwpEEhV2Ptx4bARNxjqeRGqHRl2F+dZe9isooq62Lz1A2V//uCEJ7P9VVdM6vXK7NxP/WKr20meoPZZCmaQkN3CyuW5WRzWaRZCPpEEhNIxHgZLw9g7FwAigG1+MGv4PVWphR4ADSQRf4BJ3JB+/wkl9A/Dy6wMeVvQxb3McJDXmaKuQz81bX3bq7Xsy4WE70LjnaHhRvBttf3vZ3Pi3SXmUv2Ev2mhXsPdth++yADRmwb+wH+8l+Jb+T6zRNV26sabLo2WT/IF37A/ontQc=</latexit>

PG

D<latexit sha1_base64="37lHHgOI5+NdtgstF/aZnejcenw=">AAACUnicbVJNTxsxEPWmhUIKNLTHXlYNSJyiXaBqj6hFao9BagAJR8g7mSVW/LGyZymRtX+j1/KjuPBXeqoTIhVCR7L09N4bz/jJRaWkpyy7T1ovXq6svlpbb7/e2Nx609l+e+pt7QAHYJV154XwqKTBAUlSeF45FLpQeFZMvs70s2t0Xlrzg6YVDrW4MrKUIChSnGtBY1+G/rfj5rLTzXrZvNLnIF+ALltU/3I7yfjIQq3RECjh/UWeVTQMwpEEhU2b1x4rARNxhRcRGqHRD8N86SbdjcwoLa2Lx1A6Zx93BKG9n+oiOudLLmsz8r9aoZcmU/l5GKSpakIDD4PLWqVk01ki6Ug6BFLTCAQ4GXdPYSycAIq5tbnBn2C1FmYUOIB00AQ+QWey3ke84dcQH48u8HFhb8IO9/GGijxNFfKZeadp/rmbdsw4X070OTjd7+UHvf2Tw+7Rl0Xaa+w9+8D2WM4+sSP2nfXZgAGr2C/2m90md8mfVvwlD9ZWsuh5x55Ua+MvUfW0ug==</latexit>

Ben

ign

<latexit sha1_base64="I5npcQN4LAtdmDP5Sk7C8uyw2sI=">AAACV3icbVDLbhMxFPUM0IbwaAJLNiNSJFbRTAHBsiqbLotE2kp1FHlubhIrfozsO20ja76EbftR/RrqSSMBKUeydHTuuQ+fslLSU57fJemTp892djvPuy9evnq91+u/OfW2doAjsMq681J4VNLgiCQpPK8cCl0qPCuX39v62SU6L635SasKx1rMjZxJEBSlSW+Pa0ELp8MRGjk3zaQ3yIf5GtljUmzIgG1wMuknOZ9aqDUaAiW8vyjyisZBOJKgsOny2mMlYCnmeBGpERr9OKwvb7IPUZlmM+viM5St1b87gtDer3QZne2dfrvWiv+tlXprM82+jYM0VU1o4GHxrFYZ2ayNJZtKh0BqFYkAJ+PtGSyEE0AxvC43eAVWa2GmgQNIB03gS3QmH37Ba34J8fPoAl+U9jrscx8nVORppZC35v2m+eNuujHjYjvRx+T0YFh8Gh78+Dw4PNqk3WHv2Hv2kRXsKztkx+yEjRiwmv1iN+w2uUt+pztp58GaJpuet+wfpP17JmG2hw==</latexit>

Imag

e<latexit sha1_base64="9iHrmTNcTY9jNG+6WX4G+SLCMv4=">AAACVHicbVDLThsxFPUM5ZXybJdsRoRKXUUzQFWWqN20O5AIIOEUeW5uiBU/RvYdSmTNf3TbflSl/ksX9YRILaFXsnR0zrkPn7JS0lOe/0rSpRfLK6tr652XG5tb2zu7ry69rR1gH6yy7roUHpU02CdJCq8rh0KXCq/KycdWv7pH56U1FzStcKDFnZEjCYIi9YVrQWOnw+dIY3O70817+ayy56CYgy6b19ntbpLzoYVaoyFQwvubIq9oEIQjCQqbDq89VgImcfpNhEZo9IMwO7vJ3kRmmI2si89QNmP/7QhCez/VZXS2Z/pFrSX/q5V6YTONTgZBmqomNPC4eFSrjGzWZpINpUMgNY1AgJPx9gzGwgmgmFyHG/wKVmthhoEDSAdN4BN0Ju+9wwd+D/Hz6AIfl/YhHHAfJ1TkaaqQt+aDpvnrbjox42Ix0efg8rBXHPUOz4+7px/maa+xPbbP3rKCvWen7BM7Y30GzLFv7Dv7kfxMfqdL6fKjNU3mPa/Zk0q3/gCkw7Xc</latexit>

Figure 4: Attribution maps of benign and adversarial (PGD, ADV2)inputs with respect to GRAD, CAM, MASK, and RTS on ResNet.

Visualization – We first qualitatively compare the interpre-tations of benign and adversarial (PGD, ADV2) inputs. Fig-ure 4 show a set of sample inputs and their attribution mapswith respect to GRAD, CAM, MASK, and RTS (more samplesin Appendix C1). Observe that in all the cases, the ADV2

inputs generate interpretations perceptually indistinguishablefrom their benign counterparts. In comparison, the PGD inputsare easily identifiable by inspecting their attribution maps.

LLLppp Measure – Besides qualitatively comparing the attribu-tion maps of benign and adversarial inputs, we also measuretheir similarity quantitatively. By considering attribution mapsas matrices, we measure the L1 distance between benign andadversarial maps. Figure 5 summarizes the results (other Lpmeasures in Appendix C1). For comparison, we normalizeall the measures to [0,1] by dividing them by the number ofpixels.

We have the following observations. (i) Compared withPGD, ADV2 generates attribution maps much more similarto benign cases. The average L1 measure of ADV2 is morethan 60% lower than PGD across all the interpreters. (ii) Theeffectiveness of ADV2 varies with the target interpreter. Forinstance, compared with other interpreters, the difference be-tween PGD and ADV2 is relatively marginal on GRAD, imply-

GRAD CAM MASK RTS0

0.1

0.2

0.3

GRAD CAM MASK RTS

(ResNet) (DenseNet)

PGD

ADV2

Figure 5: Average L1 distance between benign and adversarial (PGD,ADV2) attribution maps.

ing that different interpreters may inherently feature varyingrobustness against ADV2. (iii) The effectiveness of ADV2

seems insensitive to the underlying DNN. On both ResNetand DenseNet, it achieves similar L1 measures.

IoU Test – Another quantitative measure for the similar-ity of attribution maps is the intersection-over-union (IoU)score. It is widely used in object detection [21] to comparemodel predictions with ground-truth bounding boxes. For-mally, the IoU score of a binary-valued map m with respectto a baseline map m◦ is defined as their Jaccard similarity:IoU(m) = |O(m)∩O(m◦)|/|O(m)∪O(m◦)|, where O(m) de-notes the set of non-zero dimensions in m. In our case, as thevalues of attribution maps are floating numbers, we first applythreshold binarization on the maps.

GRAD CAM MASK RTS

(ResNet)

0

0.2

0.4

0.6

0.8

1

IoU

Sco

re

GRAD CAM MASK RTS

(DenseNet)

PGDADV

2

Figure 6: IoU scores of adversarial attribution maps (PGD, ADV2)with respect to benign maps.

Following a typical rule used in the object detection task[21] where a detected region of interest (RoI) is consideredpositive if its IoU score is above 0.5 with respect to a ground-truth mask, we thus consider an attribution map as plausible ifits IoU score exceeds 0.5 with respect to the benign attributionmap. Figure 6 compares the average IoU scores of adversarialmaps (PGD, ADV2) with respect to the benign cases. Observethat ADV2 achieves IoU scores above 0.5 across all the inter-preters, which are more than 40% higher than PGD in all thecases. Especially on RTS, in which the attribution maps arenatively binary-valued, ADV2 achieves IoU scores above 0.9on both ResNet and DenseNet.

Based on both qualitative and quantitative measures, wehave the following conclusion.

Observation 2

ADV2 is able to generate adversarial inputs with inter-pretations highly similar to benign cases.

7

Q3. Attack EvasivenessIntuitively, from the adversary’s perspective, ADV2 entails

a search space for adversarial inputs no larger than its underly-ing adversarial attack (e.g., PGD), as ADV2 needs to optimizeboth the prediction loss `prd and interpretation loss ìnt, whileADV2 only needs to optimize `prd. Next we compare PGD andADV2 in terms of their evasiveness with respect to adversarialattack detection methods.

Basic ADV2 – To be succinct, we consider feature squeez-ing (FS) [61] as a concrete detection method. FS reduces theadversary’s search space by coalescing inputs correspondingto different feature vectors into a single input, and detectsadversarial inputs by comparing their predictions under origi-nal and squeezed settings. This operation is implemented inthe form of a set of “squeezers”: bit depth reduction, localsmoothing, and non-local smoothing.

Squeezer Setting PGD MASK-A RTS-A MASK-A∗ RTS-A∗

Bit Depth 2-bit 92.3% 84.1% 94.0% 11.7% 29.4%Reduction 3-bit 72.7% 89.2% 88.3% 35.9% 13.9%

L. Smoothing 3×3 97.3% 98.6% 99.0% 16.5% 3.4%N. Smoothing 11-3-4 52.3% 74.7% 75.3% 51.7% 29.4%

Table 4. Detectability of adversarial inputs by PGD, basic ADV2 (A),and adaptive ADV2 (A∗) using feature squeezing.

Table 4 lists the detection rate of adversarial inputs (PGD,ADV2) using different types of squeezers on ResNet. Observethat the squeezers seem effective to detect both ADV2 andPGD inputs. For instance, local smoothing achieves higherthan 97% success rate in detecting both ADV2 and PGD inputs,with difference less than 2%. We thus have:

Observation 3

The overall detectability of ADV2 and PGD with re-spect to feature squeezing is not significantly different.

Adaptive ADV2 – We now adapt ADV2 to evade the detec-tion of FS. Related to existing adaptive attacks against FS [23],this optimization is interesting in its own right. Specifically,for smoothing squeezers, we augment the loss function àdv(x)(Eqn (3)) with the term `sqz( f (x), f (ψ(x)), which is the crossentropy of the predictions of original and squeezed inputs (ψis the squeezer).

Algorithm 2: Adaptive ADV2 against Feature Squeezing.Input: x◦: benign input; ct : target class; f : target DNN; g:

target interpreter; ψ: bit depth reduction; i: bit depthOutput: x∗: adversarial input// augmented àdv with `sqz w.r.t. smoothing

// attack in squeezed space

1 x+← PGD on ψ(x◦) with target ct and α = 1/2i;// attack in original space

2 search for x∗ = argminx∈Bε(x◦) àdv(x)+λ‖ f (x)− f (x+)‖1;3 return x∗;

For bit depth reduction, we use a two-stage strategy. (i) Wefirst search in the squeezed space for an adversarial input x+

that is close to x◦’s ε-neighborhood. To do so, we run PGDover ψ(x◦) with learning rate α = 1/2i (i is the bit depth). (ii)We then search in x◦’s ε-neighborhood for an adversarial inputx∗ that is classified similarly as x+. To do so, we augment theloss function àdv(x) with a probability loss term ‖ f (x)−f (x+)‖1 ( f (x+) is x+’s probability vector), and then applyPGD to search for x∗ within x◦’s ε-neighborhood. The overallalgorithm is sketched in Algorithm 2.

Metric MASK RTSP A A∗ P A A∗

∆L1 0.28 0.09 0.09 0.22 0.01 0.02IoU 0.21 0.65 0.61 0.29 0.93 0.94

Table 5. L1 measures and IoU scores of adversarial attribution maps(PGD, basic and adaptive ADV2) with respect to benign maps.

Table 4 summarizes the detection rate of adversarial inputsgenerated by adaptive ADV2, which drops significantly, com-pared with the case of basic ADV2. Note that here we onlyshow the possibility of adapting ADV2 to evade a representa-tive detection method, and consider an in-depth study on thismatter as our ongoing work. Meanwhile, we compare the L1measures and IoU scores of the attribution maps generated bybasic and adaptive ADV2 (with respect to the benign maps).Table 5 shows the results. Observe that the optimization inadaptive ADV2 has little impact on its attack effectivenessagainst the interpreters. We may thus conclude:

Observation 4

It is possible to adapt ADV2 to generate adversarialinputs evasive with respect to feature squeezing.




AD

V2


PG


Ben

ign


Imag



Figure 7: Attribution maps of benign and adversarial (ADV2) inputsin the skin cancer screening application.

Q4. Real ApplicationWe now evaluate the effectiveness of ADV2 in real security-

critical applications. We use the skin cancer screening taskfrom the ISIC 2018 challenge [18] as a case study, in whichgiven skin lesion images are categorized into a seven-diseasetaxonomy. We adopt a competition-winning model1 (withResNet as its backbone) as the classifier, which attains 82.27%weighted multi-class accuracy on the holdout set (more detailsin Appendix C2).

1https://github.com/ngessert/isic2018

8

https://github.com/ngessert/isic2018

We apply ADV2 on this classifier and measure its effective-ness of generating plausible interpretations. Figure 7 showsa set of samples and their attribution maps on the four inter-preters. Observe that ADV2 generates interpretations visuallyindiscernible from their benign counterparts in all the cases.

GRAD CAM MASK RTS0

0.2

0.4

0.6

0.8

1

IoU

Sco

re

GRAD CAM MASK RTS0

0.05

0.1

0.15

0.2

0.25PGD

ADV2

PGD

ADV2

(a) (b)

Figure 8: L1 measures (a) and IoU scores (b) of adversarial attribu-tion maps (PGD, ADV2) with respect to benign maps.

This similarity is further quantitatively validated in Fig-ure 8, which shows the L1 measures (other Lp measures inAppendix C2) and IoU scores of the maps generated by ADV2

with respect to the benign maps. For instance, the IoU scoresof ADV2 exceed 0.62 across all the interpreters.

Q5. Alternative Attack FrameworkBesides the PGD framework, ADV2 can also be flexibly

built upon alternative frameworks. Here we construct ADV2

upon STADV [60], a spatial transformation-based adversarialattack. The implementation details are given in Appendix A4.




Ben

ign


Imag


StA

dv

<latexit sha1_base64="NzR0vFw1/rtnavTKvEaAIpMYkWM=">AAACVHicbVDLThRBFK1uRHBUBFy66TiYuJp0g0aWiBuXGB0goUZSffs2U5l6dKpuj0wq/R9s8aNM/BcX1gyTqIMnqeTk3HMfdcpGSU95/jNJ1x6sP9zYfNR7/OTp1rPtnd1Tb1sHOASrrDsvhUclDQ5JksLzxqHQpcKzcvJhXj+bovPSmi80a3CkxZWRtQRBUfrKtaCxr8Nnel9Nu8vtfj7IF8juk2JJ+myJk8udJOeVhVajIVDC+4sib2gUhCMJCrsebz02AibiCi8iNUKjH4XF2V32KipVVlsXn6Fsof7dEYT2fqbL6FycuVqbi/+tlXplM9WHoyBN0xIauFtctyojm80zySrpEEjNIhHgZLw9g7FwAigm1+MGv4HVWpgqcADpoAt8gs7kg7d4zacQP48u8HFpr8Me93FCQ55mCvncvNd1f9xdL2ZcrCZ6n5zuD4qDwf6nN/2j42Xam+wFe8les4K9Y0fsIzthQwbMsRt2y74nP5Jf6Vq6fmdNk2XPc/YP0q3fl1611Q==</latexit>

AD

V2

<latexit sha1_base64="NqmC7Em/oF1WRetJaXKkRYdecS4=">AAAB9XicbVDLSsNAFL2pr1pfUZduBovgqiRV0GV9LFxWsA9o0zKZTtqhk0mYmSgl9D/cuFDErf/izr9xmmahrQcGDufcyz1z/JgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU0WJJLRBIh7Jto8V5UzQhmaa03YsKQ59Tlv++Gbmtx6pVCwSD3oSUy/EQ8ECRrA2Uq8bYj1SQXp12+xVp3277FScDGiZuDkpQ4563/7qDiKShFRowrFSHdeJtZdiqRnhdFrqJorGmIzxkHYMFTikykuz1FN0YpQBCiJpntAoU39vpDhUahL6ZjJLuejNxP+8TqKDSy9lIk40FWR+KEg40hGaVYAGTFKi+cQQTCQzWREZYYmJNkWVTAnu4peXSbNacc8q1fvzcu06r6MIR3AMp+DCBdTgDurQAAISnuEV3qwn68V6tz7mowUr3zmEP7A+fwAuIZJM</latexit>

Figure 9: Attribution maps of benign and adversarial (STADV,STADV-based ADV2) inputs with respect to GRAD, CAM, MASK,and RTS on ResNet.

Figure 9 illustrates sample benign and adversarial (STADV,ADV2) inputs and their interpretations. Compared withSTADV, ADV2 generates adversarial inputs with maps muchmore similar to the benign cases, which indicates the effec-tiveness of ADV2 constructed upon the STADV framework.

This observation is also confirmed by the L1 measures andIoU scores of adversarial attribution maps, which are shown inFigure 10 (more results in Appendix C3). Interestingly, com-pared with the other interpreters, MASK seems more resilientto STADV-based ADV2. The comparison with the results ofPGD-based ADV2 (Figure 5 and 6) implies that the (relative)robustness of different interpreters may vary with the concreteattacks (details in § 5).

Overall we have the following conclusion.

GRAD CAM MASK RTS(a)

0

0.08

0.16

0.24

0.32

0.4

GRAD CAM MASK RTS(b)

0

0.2

0.4

0.6

0.8

1

IoU

Sco

re

StAdv

ADV2

StAdv

ADV2

Figure 10: L1 measures (a) and IoU scores (b) of adversarial attri-bution maps (STADV, STADV-based ADV2) with respect to benignmaps on ResNet.

Observation 5

As a general class of attacks, ADV2 can be flexiblybuilt upon alternative adversarial attack frameworks.

5 DiscussionWhile it is shown in § 4 that ADV2 is effective against a

range of classifiers and interpreters, the cause of this effec-tiveness is unclear yet. Next we conduct a study on this rootcause from both analytical and empirical perspectives. Basedon our findings, we further discuss potential countermeasuresagainst ADV2.

Q1. Root of Attack VulnerabilityRecall that the formulation of ADV2 in Eqn (3) defines two

seemingly conflicting objectives: (i) maximizing the predic-tion change while (ii) minimizing the interpretation change.We thus conjecture that the effectiveness of ADV2 may stemfrom the partial independence between a classifier and itsinterpreter – the interpreter’s explanations only partially de-scribe the classifier’s predictions, making it practical to exploitboth models simultaneously.

To validate the existence of this prediction-interpretationgap, we consider ADV2 targeting randomly generated predic-tions and interpretations. For a given input x◦, we randomlygenerate a target class ct and a target interpretation mt , andsearch for an adversarial input x∗ that triggers the classifier tomisclassify it as ct and also generates an interpretation similarto mt (i.e., f (x∗) = ct and g(x∗; f )≈mt ). Intuitively, if ADV2

is able to find such x∗, it indicates that the classifier and itsinterpreter can be manipulated separately; in other words, theyare only partially aligned with each other.

Random Patch Interpretation – In the first case, for agiven input, we define its target attribution map by (i) sam-pling a patch of random shape (either a rectangle or a circle),random angle, and random position over the input, and (ii)setting the elements inside the patch as ‘1’ and that outside itas ‘0’. Typically this target map deviates significantly fromits benign counterpart, due to its randomness.

We evaluate the effectiveness of ADV2 under this set-ting. Table 6 summarizes the attack success rate of ADV2

on ResNet. Observe that compared with Table 3, targetingrandom patch interpretations has little impact on the attack ef-

9

GRAD CAM MASK RTS

ADV2 100% 100% 99% 100%(0.98) (1.0) (0.95) (1.0)

Table 6. ASR (MC) of ADV2 targeting random patch interpretations.

fectiveness in terms of deceiving the classifiers, implying thatthe space of adversarial inputs is sufficiently large to containones with targeted interpretations.




AD

V2

Map

<latexit sha1_base64="5/mZiFD+Ihb0/jDnm5GodjBeKGQ=">AAACBXicbVC7TsMwFHXKq5RXgBEGiwqJqUoKEozlMbAgFYk+pCZUjuu0Vm0nsh2kKsrCwq+wMIAQK//Axt/gph2g5UiWjs+5V/feE8SMKu0431ZhYXFpeaW4Wlpb39jcsrd3mipKJCYNHLFItgOkCKOCNDTVjLRjSRAPGGkFw8ux33ogUtFI3OlRTHyO+oKGFCNtpK6973GkBypMz6+a2X3Vg/lf8vQGxVnXLjsVJwecJ+6UlMEU9a795fUinHAiNGZIqY7rxNpPkdQUM5KVvESRGOEh6pOOoQJxovw0vyKDh0bpwTCS5gkNc/V3R4q4UiMemMp851lvLP7ndRIdnvkpFXGiicCTQWHCoI7gOBLYo5JgzUaGICyp2RXiAZIIaxNcyYTgzp48T5rVintcqd6elGsX0ziKYA8cgCPgglNQA9egDhoAg0fwDF7Bm/VkvVjv1sektGBNe3bBH1ifP2IfmIA=</latexit>

AD

V2

Input

<latexit sha1_base64="+bXV59hkkJEfEzSI8p3tZABeoyk=">AAACB3icbVDLSsNAFJ3UV62vqEtBBovgqiRV0GV9LHRXwT6giWUynbRDJ5MwMxFKyM6Nv+LGhSJu/QV3/o2TNAttPTBw5px7ufceL2JUKsv6NkoLi0vLK+XVytr6xuaWub3TlmEsMGnhkIWi6yFJGOWkpahipBsJggKPkY43vsz8zgMRkob8Tk0i4gZoyKlPMVJa6pv7ToDUSPrJ+VU7va87MP+LILnhUazSvlm1alYOOE/sglRBgWbf/HIGIY4DwhVmSMqebUXKTZBQFDOSVpxYkgjhMRqSnqYcBUS6SX5HCg+1MoB+KPTjCubq744EBVJOAk9X5lvPepn4n9eLlX/mJjQ7iXA8HeTHDKoQZqHAARUEKzbRBGFB9a4Qj5BAWOnoKjoEe/bkedKu1+zjWv32pNq4KOIogz1wAI6ADU5BA1yDJmgBDB7BM3gFb8aT8WK8Gx/T0pJR9OyCPzA+fwAxb5mG</latexit>

Ben

ign

Map

<latexit sha1_base64="QJ/hLC8WDWxr5siNC9xPMgd950w=">AAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCq5JUQZelbtwIFewDmlAm09t26GQSZiZCDcVfceNCEbf+hzv/xkmbhbYeGDiccy/3zAlizpR2nG+rsLK6tr5R3Cxtbe/s7tn7By0VJZJCk0Y8kp2AKOBMQFMzzaETSyBhwKEdjK8zv/0AUrFI3OtJDH5IhoINGCXaSD37yAuJHskwrYNgQ+HhWxJPe3bZqTgz4GXi5qSMcjR69pfXj2gSgtCUE6W6rhNrPyVSM8phWvISBTGhYzKErqGChKD8dJZ+ik+N0seDSJonNJ6pvzdSEio1CQMzmWVVi14m/ud1Ez248lMm4kSDoPNDg4RjHeGsCtxnEqjmE0MIlcxkxXREJKHaFFYyJbiLX14mrWrFPa9U7y7KtXpeRxEdoxN0hlx0iWroBjVQE1H0iJ7RK3qznqwX6936mI8WrHznEP2B9fkDgsqVQQ==</latexit>

Ben

ign

Input

<latexit sha1_base64="G+UhG8bCKcMc8EpoVI3zVEXidmY=">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJWkCrosdaO7CvYBTSiT6W07dDIJMxOhxC78FTcuFHHrb7jzb5y0WWjrgYHDOfcy554g5kxpx/m2Ciura+sbxc3S1vbO7p69f9BSUSIpNGnEI9kJiALOBDQ10xw6sQQSBhzawfg689sPIBWLxL2exOCHZCjYgFGijdSzj7yQ6JEM0zoINhQevhVxoqc9u+xUnBnwMnFzUkY5Gj37y+tHNAlBaMqJUl3XibWfEqkZ5TAteYmCmNAxGULXUEFCUH46yz/Fp0bp40EkzRMaz9TfGykJlZqEgZnM0qpFLxP/87qJHlz5KctOAkHnHw0SjnWEszJwn0mgmk8MIVQykxXTEZGEalNZyZTgLp68TFrVinteqd5dlGv1vI4iOkYn6Ay56BLV0A1qoCai6BE9o1f0Zj1ZL9a79TEfLVj5ziH6A+vzB0wclkc=</latexit>

Tar

get

Map

<latexit sha1_base64="wBcq9myxKvgSYxwPBMcNVLrx7Jw=">AAAB/XicbVDLSsNAFL2pr1pf8bFzM1gEVyWpgi6LbtwIFfqCJpTJdNoOnUzCzESoofgrblwo4tb/cOffOGmz0NYDA4dz7uWeOUHMmdKO820VVlbX1jeKm6Wt7Z3dPXv/oKWiRBLaJBGPZCfAinImaFMzzWknlhSHAaftYHyT+e0HKhWLRENPYuqHeCjYgBGsjdSzj7wQ65EM0waWQ6o9dIfjac8uOxVnBrRM3JyUIUe9Z395/YgkIRWacKxU13Vi7adYakY4nZa8RNEYkzEe0q6hAodU+eks/RSdGqWPBpE0T2g0U39vpDhUahIGZjLLqha9TPzP6yZ6cOWnTMSJpoLMDw0SjnSEsipQn0lKNJ8YgolkJisiIywx0aawkinBXfzyMmlVK+55pXp/Ua5d53UU4RhO4AxcuIQa3EIdmkDgEZ7hFd6sJ+vFerc+5qMFK985hD+wPn8AoeKVVQ==</latexit>


Figure 11: Visualization of ADV2 targeting random patch interpreta-tions across different interpreters on ResNet.

We then evaluate the effectiveness of ADV2 in terms ofgenerating the target interpretations. For a given benign in-put x◦ and a target random patch map mt , ADV2 attempts togenerate an adversarial input ct with the interpretation similarto mt . Figure 11 visualizes a set of sample results. Note thatin all the cases the ADV2 maps appear visually similar to thetarget maps, highlighting the attack effectiveness. This effec-tiveness is further validated in Table 7. Observe that acrossall the interpreters, an ADV2 map is much more similar to itstarget map, compared with its benign counterpart.

GRAD CAM MASK RTS

∆bL1 0.16 0.50 0.42 0.49∆tL1 0.10 0.04 0.15 0.07

Table 7. Comparison of ADV2 and target maps (∆t) and that of ADV2

and benign maps (∆b), measured by L1 distance.

Random Class Interpretation – In the second case, for agiven input (with ct as the target class), we instantiate its targetinterpretation with the attribution map of a benign input ran-domly sampled from another class ct . We particularly enforcect 6= ct ; in other words, the adversarial input is misclassifiedinto one class but interpreted as another one.

GRAD CAM MASK RTS

ADV2 100% 100% 100% 100%(0.99) (0.99) (0.99) (1.0)

Table 8. ASR (MC) of ADV2 with random class interpretations.

The ASR of ADV2 is summarized in Table 8. Observe thattargeting random class interpretations has little influence onthe attack effectiveness of deceiving the classifiers. Figure 12visualizes a set of sample target and ADV2 inputs and theirinterpretations (DenseNet results in Appendix C4). Note thatthe target and ADV2 inputs are fairly distinct, but with highly

similar interpretations. This is quantitatively validated by theirL1 measures and IoU scores listed in Figure 13.




AD

V2

Map

<latexit sha1_base64="5/mZiFD+Ihb0/jDnm5GodjBeKGQ=">AAACBXicbVC7TsMwFHXKq5RXgBEGiwqJqUoKEozlMbAgFYk+pCZUjuu0Vm0nsh2kKsrCwq+wMIAQK//Axt/gph2g5UiWjs+5V/feE8SMKu0431ZhYXFpeaW4Wlpb39jcsrd3mipKJCYNHLFItgOkCKOCNDTVjLRjSRAPGGkFw8ux33ogUtFI3OlRTHyO+oKGFCNtpK6973GkBypMz6+a2X3Vg/lf8vQGxVnXLjsVJwecJ+6UlMEU9a795fUinHAiNGZIqY7rxNpPkdQUM5KVvESRGOEh6pOOoQJxovw0vyKDh0bpwTCS5gkNc/V3R4q4UiMemMp851lvLP7ndRIdnvkpFXGiicCTQWHCoI7gOBLYo5JgzUaGICyp2RXiAZIIaxNcyYTgzp48T5rVintcqd6elGsX0ziKYA8cgCPgglNQA9egDhoAg0fwDF7Bm/VkvVjv1sektGBNe3bBH1ifP2IfmIA=</latexit>

AD

V2

Img

<latexit sha1_base64="9diO+gSujBUebJtrADp4GjiizMA=">AAACBXicbVC7TsMwFHXKq5RXgBEGiwqJqUoKEozlMcBWJPqQmlA5rtNatZ3IdpCqKAsLv8LCAEKs/AMbf4ObdoCWI1k6Pude3XtPEDOqtON8W4WFxaXlleJqaW19Y3PL3t5pqiiRmDRwxCLZDpAijArS0FQz0o4lQTxgpBUML8d+64FIRSNxp0cx8TnqCxpSjLSRuva+x5EeqDA9v2pm91UP5n/J0xvez7p22ak4OeA8caekDKaod+0vrxfhhBOhMUNKdVwn1n6KpKaYkazkJYrECA9Rn3QMFYgT5af5FRk8NEoPhpE0T2iYq787UsSVGvHAVOY7z3pj8T+vk+jwzE+piBNNBJ4MChMGdQTHkcAelQRrNjIEYUnNrhAPkERYm+BKJgR39uR50qxW3ONK9fakXLuYxlEEe+AAHAEXnIIauAZ10AAYPIJn8ArerCfrxXq3PialBWvaswv+wPr8AWCemH8=</latexit>

Targ

etM

ap<latexit sha1_base64="wBcq9myxKvgSYxwPBMcNVLrx7Jw=">AAAB/XicbVDLSsNAFL2pr1pf8bFzM1gEVyWpgi6LbtwIFfqCJpTJdNoOnUzCzESoofgrblwo4tb/cOffOGmz0NYDA4dz7uWeOUHMmdKO820VVlbX1jeKm6Wt7Z3dPXv/oKWiRBLaJBGPZCfAinImaFMzzWknlhSHAaftYHyT+e0HKhWLRENPYuqHeCjYgBGsjdSzj7wQ65EM0waWQ6o9dIfjac8uOxVnBrRM3JyUIUe9Z395/YgkIRWacKxU13Vi7adYakY4nZa8RNEYkzEe0q6hAodU+eks/RSdGqWPBpE0T2g0U39vpDhUahIGZjLLqha9TPzP6yZ6cOWnTMSJpoLMDw0SjnSEsipQn0lKNJ8YgolkJisiIywx0aawkinBXfzyMmlVK+55pXp/Ua5d53UU4RhO4AxcuIQa3EIdmkDgEZ7hFd6sJ+vFerc+5qMFK985hD+wPn8AoeKVVQ==</latexit>

Targ

etIm

g<latexit sha1_base64="+IXxUYjQg9dE1r8J2tWR7/6XgJ0=">AAAB/XicbVDLSgMxFM3UV62v8bFzEyyCqzJTBV0W3eiuQl/QGUomzUxDk8yQZIQ6FH/FjQtF3Pof7vwbM+0stPVA4HDOvdyTEySMKu0431ZpZXVtfaO8Wdna3tnds/cPOipOJSZtHLNY9gKkCKOCtDXVjPQSSRAPGOkG45vc7z4QqWgsWnqSEJ+jSNCQYqSNNLCPPI70SPKshWREtAfveDQd2FWn5swAl4lbkCoo0BzYX94wxiknQmOGlOq7TqL9DElNMSPTipcqkiA8RhHpGyoQJ8rPZumn8NQoQxjG0jyh4Uz9vZEhrtSEB2Yyz6oWvVz8z+unOrzyMyqSVBOB54fClEEdw7wKOKSSYM0mhiAsqckK8QhJhLUprGJKcBe/vEw69Zp7XqvfX1Qb10UdZXAMTsAZcMElaIBb0ARtgMEjeAav4M16sl6sd+tjPlqyip1D8AfW5w+gYZVU</latexit>

GRAD<latexit sha1_base64="jijuc9C6B6tD5QNSewaHuI8qAjE=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqszUgi7rA3RZxT6gHUomzbShmcyYZApl6He4caGIWz/GnX9jOp2Fth4IHM65l3tyvIgzpW3728qtrK6tb+Q3C1vbO7t7xf2DpgpjSWiDhDyUbQ8rypmgDc00p+1IUhx4nLa80fXMb42pVCwUj3oSUTfAA8F8RrA2ktsNsB4qP7l9uLyZ9oolu2ynQMvEyUgJMtR7xa9uPyRxQIUmHCvVcexIuwmWmhFOp4VurGiEyQgPaMdQgQOq3CQNPUUnRukjP5TmCY1S9fdGggOlJoFnJtOQi95M/M/rxNq/cBMmolhTQeaH/JgjHaJZA6jPJCWaTwzBRDKTFZEhlpho01PBlOAsfnmZNCtl56xcua+WaldZHXk4gmM4BQfOoQZ3UIcGEHiCZ3iFN2tsvVjv1sd8NGdlO4fwB9bnD4+bkfU=</latexit>

Figure 12: Target and adversarial (ADV2) inputs and their attributionmaps on ResNet.


0

0.08

0.16

0.24

0.32

0.4


0

0.2

0.4

0.6

0.8

1

IoU

Sco

re

w.r.t. Benignw.r.t. Target


Figure 13: L1 measures (a) and IoU scores (b) of adversarial mapswith respect to benign and target cases on ResNet.

The experiments above show that it is practical to generateadversarial inputs targeting arbitrary predictions and interpre-tations. We can therefore conclude:

Observation 6

A DNN and its interpreter are often not fully aligned,allowing the adversary to exploit both models simulta-neously.

Q2. Root of Prediction-Interpretation GapNext we explore the fundamental causes of this prediction-

interpretation gap. We speculate one following possible expla-nation as: existing interpretation models do not comprehen-sively capture the dynamics of DNNs, each only describingone aspect of their behavior.

Specifically, GRAD solely relies on the gradient informa-tion; MASK focuses on the input-prediction correspondencewhile ignoring the internal representations; CAM leveragesthe deep representations at intermediate layers, but neglect-ing the input-prediction correspondence; RTS uses the in-ternal representations in an auxiliary encoder and the input-interpretation correspondence in the training data, which how-ever may deviate from the true behavior of DNNs.

Intuitively the exclusive focus on one aspect (e.g., input-prediction correspondence) of the DNN behavior results inloose constraints: when performing the attack, the adversaryonly needs to ensure that benign and adversarial inputs causeDNNs to behave similarly from one specific perspective. Wevalidate this speculation from two observations, low attack

10

transferability and disparate attack robustness.

Attack Transferability – One intriguing property of adver-sarial inputs is their transferability: an adversarial input effec-tive against one DNN is often found effective against anotherDNN, though it is not crafted on the second one [33, 36, 40].In this set of experiments, we investigate whether such trans-ferability exists in attacks against interpreters; that is, whetheran adversarial input that generates a plausible interpretationagainst one interpreter is also able to generate a probableinterpretation against another interpreter.




CA

M<latexit sha1_base64="t9qJAjypfVgyUIJ4eLByrGfLUqE=">AAACUnicbVJNTxsxEPWmfKZ8tsdeVoRKPUW7QFWOUC69VAKpASQcIe9klljxx8qeBSJr/0av7Y/qpX+FE06IVAgdydLTe2884ycXlZKesuxv0nqzsLi0vLLafru2vrG5tf3u3NvaAfbAKusuC+FRSYM9kqTwsnIodKHwohidTPSLW3ReWvODxhX2tbgxspQgKFKca0FDX4aT4+/N9VYn62bTSl+DfAY6bFan19tJxgcWao2GQAnvr/Kson4QjiQobNq89lgJGIkbvIrQCI2+H6ZLN+nHyAzS0rp4DKVT9nlHENr7sS6ic7rkvDYh/6sVem4ylYf9IE1VExp4GlzWKiWbThJJB9IhkBpHIMDJuHsKQ+EEUMytzQ3egdVamEHgANJBE/gIncm6n/Ge30J8PLrAh4W9D7vcxxsq8jRWyCfm3ab5527aMeN8PtHX4Hyvm+93984OOkdfZ2mvsA9sh31iOfvCjtg3dsp6DFjFfrJf7HfyJ3loxV/yZG0ls5737EW11h4BPu20sA==</latexit>

MA

SK

<latexit sha1_base64="mR9A3drYeDdYcEdkw7znvgB+AII=">AAACU3icbVDLThRBFK1uUXEUBVm66TiYuJp0o0aXqBsSQoLBARKqQ6rv3GYqU4+26jYyqfR3uNWPcsG3sLFmmAQdPEklJ+ee+6hTNUp6yvOrJL23cv/Bw9VHvcdP1p4+W994fuRt6wCHYJV1J5XwqKTBIUlSeNI4FLpSeFxNPs/qxxfovLTmK00bLLU4N7KWIChKJdeCxr4O+x8P97qz9X4+yOfI7pJiQfpsgYOzjSTnIwutRkOghPenRd5QGYQjCQq7Hm89NgIm4hxPIzVCoy/D/OouexWVUVZbF5+hbK7+3RGE9n6qq+icX7lcm4n/rVV6aTPVH8ogTdMSGrhZXLcqI5vNIslG0iGQmkYiwMl4ewZj4QRQDK7HDX4Hq7Uwo8ABpIMu8Ak6kw/e4SW/gPh5dIGPK3sZtriPExryNFXIZ+atrrt1d72YcbGc6F1ytD0o3gy2v7zt73xapL3KXrCX7DUr2Hu2w3bZARsyYN/YD/aT/Up+J9dpmq7cWNNk0bPJ/kG69gcUsLUV</latexit>

RT

S<latexit sha1_base64="96Kp6KWXsT6/Gx6bjlx6ge0kIMI=">AAACUnicbVJNTxsxEPWmH9CUttAee1k1VOop2oVW5YjKpUeg+ZJwhLyTWWLFHyt7lhJZ+ze4wo/qpX+FE06I1DZ0JEtP773xjJ9cVEp6yrLfSevJ02fPNzZftF9uvXr9Znvn7cDb2gH2wSrrRoXwqKTBPklSOKocCl0oHBazo4U+vETnpTU9mlc41uLCyFKCoEhxrgVNfRlOez+a8+1O1s2WlT4G+Qp02KqOz3eSjE8s1BoNgRLen+VZReMgHElQ2LR57bESMBMXeBahERr9OCyXbtKPkZmkpXXxGEqX7N8dQWjv57qIzuWS69qC/K9W6LXJVB6MgzRVTWjgYXBZq5RsukgknUiHQGoegQAn4+4pTIUTQDG3Njf4E6zWwkwCB5AOmsBn6EzW/YJX/BLi49EFPi3sVdjlPt5Qkae5Qr4w7zbNH3fTjhnn64k+BoO9br7f3Tv53Dn8tkp7k71nH9gnlrOv7JB9Z8esz4BV7JrdsNvkV3LXir/kwdpKVj3v2D/V2roHir602A==</latexit>

Sou

rce

<latexit sha1_base64="x8Qa/2d5vI4lgGkahLSIueuQjRQ=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWFRITFVSkGCsYGEsgj6kNqoc12mt2klkO0gl6pewMIAQK5/Cxt/gpBmg5UiWjs651z4+fsyZ0o7zbZXW1jc2t8rblZ3dvf2qfXDYUVEiCW2TiEey52NFOQtpWzPNaS+WFAuf064/vcn87iOVikXhg57F1BN4HLKAEayNNLSrA4H1RIr0Pr9xPrRrTt3JgVaJW5AaFGgN7a/BKCKJoKEmHCvVd51YeymWmhFO55VBomiMyRSPad/QEAuqvDQPPkenRhmhIJLmhBrl6u+NFAulZsI3k1lMtexl4n9eP9HBlZeyME40DcnioSDhSEcoawGNmKRE85khmEhmsiIywRITbbqqmBLc5S+vkk6j7p7XG3cXteZ1UUcZjuEEzsCFS2jCLbSgDQQSeIZXeLOerBfr3fpYjJasYucI/sD6/AFo2JOT</latexit>

Target<latexit sha1_base64="9nrUCVEvVYd5wBP53f43EMr1XZE=">AAAB+HicbVDLSgMxFM3UV62Pjrp0EyyCqzJTBV0W3bis0Be0Q8mkaRuaZIbkjlCHfokbF4q49VPc+Tdm2llo64HA4Zx7uScnjAU34HnfTmFjc2t7p7hb2ts/OCy7R8dtEyWashaNRKS7ITFMcMVawEGwbqwZkaFgnXB6l/mdR6YNj1QTZjELJBkrPuKUgJUGbrkvCUy0TJtEjxnMB27Fq3oL4HXi56SCcjQG7ld/GNFEMgVUEGN6vhdDkBINnAo2L/UTw2JCp2TMepYqIpkJ0kXwOT63yhCPIm2fArxQf2+kRBozk6GdzGKaVS8T//N6CYxugpSrOAGm6PLQKBEYIpy1gIdcMwpiZgmhmtusmE6IJhRsVyVbgr/65XXSrlX9y2rt4apSv83rKKJTdIYukI+uUR3dowZqIYoS9Ixe0Zvz5Lw4787HcrTg5Dsn6A+czx9ZVpOJ</latexit>


GRA

D<latexit sha1_base64="8M/5ECn0yx4s+c8rEiwRj7ahjPI=">AAACU3icbVDLThRBFK1uUXEUBVm66TiYuJp0o0aXqCSwROIACdUh1XduM5WpR1t1G5lU+jvc6ke54FvYWDNMgg6epJKTc8991KkaJT3l+VWS3lu5/+Dh6qPe4ydrT5+tbzw/8rZ1gEOwyrqTSnhU0uCQJCk8aRwKXSk8riafZ/XjC3ReWvOVpg2WWpwbWUsQFKWSa0FjX4e9w4+73dl6Px/kc2R3SbEgfbbAwdlGkvORhVajIVDC+9Mib6gMwpEEhV2Ptx4bARNxjqeRGqHRl2F+dZe9isooq62Lz1A2V//uCEJ7P9VVdM6vXK7NxP/WKr20meoPZZCmaQkN3CyuW5WRzWaRZCPpEEhNIxHgZLw9g7FwAigG1+MGv4PVWphR4ADSQRf4BJ3JB+/wkl9A/Dy6wMeVvQxb3McJDXmaKuQz81bX3bq7Xsy4WE70LjnaHhRvBttf3vZ3Pi3SXmUv2Ev2mhXsPdth++yADRmwb+wH+8l+Jb+T6zRNV26sabLo2WT/IF37A/ontQc=</latexit>

Figure 14: Visualization of attribution maps of adversarial inputsacross different interpreters on ResNet.

Specifically, for each given interpreter g, we randomly se-lect a set of adversarial inputs crafted against g (source) andcompute their interpretations on another interpreter g′ (target).Figure 14 illustrates the attribution maps of a given adver-sarial input on g and g′. Further, for each case, we comparethe adversarial map (right) against the corresponding benignmap (left). Observe that the interpretation transferability isfairly low: an adversarial input crafted against one interpreterg rarely generates highly plausible interpretation on anotherinterpreter g′.

GRAD CAM MASK RTS

GRAD 0.04 0.24 0.22 0.24CAM 0.09 0.05 0.18 0.13

MASK 0.12 0.34 0.09 0.74RTS 0.10 0.17 0.20 0.01PGD 0.10 0.22 0.28 0.22

Table 9. L1 distance between attribution maps of adversarial (ADV2,PGD) on ResNet (row/column as source/target).

We further quantitatively validate this observation. Table 9measures the L1 distance between the adversarial and benignattribution maps across different interpreters. For comparison,it also shows the L1 measure for the adversarial inputs gener-ated by PGD. Observe that the adversarial inputs crafted ong tends to generate low-quality interpretations on a differentinterpreter g′, with quality comparable to that generated byan interpretation-agnostic attack (i.e., PGD). We can thereforeconclude:

Observation 7

The transferability of adversarial inputs across differ-ent interpreters seems low.

Attack Robustness – It is observed in § 4 that the effec-tiveness of ADV2 varies with the target interpreter. As shownin Figure 6, among all the interpreters, ADV2 attains the low-est IoU scores on GRAD, suggesting that GRAD may be morerobust against ADV2. This observation may be explainedas follows: GRAD uses the gradient magnitude of each in-put feature to measure its relevance to the model prediction;meanwhile, ADV2 heavily uses the gradient information tooptimize the prediction loss `prd; it is inherently difficult tominimize `prd while keeping the gradient intact.

We validate the conjecture by analyzing the robustnessof integrated gradient (IG) [53], another back-propagation-guided interpreter, against ADV2. Due to their fundamentalequivalence [3], the discussion here also generalizes to otherback-propagation interpreters (e.g., [48, 50, 51]).

At a high level, for the i-th feature of a given input x, IG

computes its attribution m[i] by aggregating the gradient off (x) along the path from a baseline input x to x:

m[i] = (x[i]− x[i])∫ 1

0

∂ f (tx+(1− t)x)∂x[i]

dt (11)

Like other back-propagation interpretation models [3], IG

satisfies the desirable completeness axiom [48] that the attri-butions sum up to the difference between f ’s predictions forthe given input x and the baseline x.

To simplify the exposition, let us assume a binary classifica-tion setting with classes C = {+,−}. The DNN f predicts theprobability of x belonging to the positive class as f (x). Givenan input x◦ from the negative class, the adversary attempts tocraft an adversarial input x∗ to force f to misclassify x∗ as pos-itive. We define the prediction loss as `prd(x∗) = f (x∗)− f (x◦)(i.e., the increase in the probability of positive prediction),which can be computed as:

`prd(x∗) =∫ 1

0∇ f (tx∗+(1− t)x◦)>(x∗− x◦)dt (12)

Meanwhile, we define the interpretation loss as ìnt(x∗) =‖m◦−m∗‖1, where m◦ and m∗ are the attribution maps of x◦and x∗ respectively. While it is difficult to directly quantifyìnt(x∗), we may use the attribution map of x∗ with x◦ as asurrogate baseline:

∆m[i] = (x∗[i]− x◦[i])∫ 1

0

∂ f (tx∗+(1− t)x◦)∂x∗[i]

dt (13)

which quantifies the impact of the i-th input feature on thedifference of f (x◦) and f (x∗). Thus, ìnt(x∗) = ‖∆m‖1.

Proposition 1. With IG, the prediction loss is upper boundedby the interpretation loss as: `prd(x∗)≤ ìnt(x∗).

Proof. We define u as the input difference u = (x∗− x◦) andv as the integral vector with its i-th element v[i] defined as

v[i] =∫ 1

0

∂ f (tx∗+(1− t)x◦)∂x∗[i]

dt

11

According to the definitions, we have `prd(x∗) = u>v andìnt(x∗) = ‖u� v‖1, where � is the Hadamard product.

We have the following derivation: `prd(x∗) = ∑i u[i]v[i]≤∑i ‖u[i] · v[i]‖ = ìnt(x∗). Thus the prediction loss is upper-bounded by the interpretation loss.

In other words, in order to force x∗ to be misclassifiedwith high confidence, the difference of benign and adversarialattribution maps needs to be large. As the objectives of ADV2

here is to maximize the prediction loss while minimizingthe interpretation loss. The coupling between prediction andinterpretation losses results in a fundamental conflict.

Note that however this conflict does not preclude effec-tive adversarial attacks. First, the constraint of prediction andinterpretation losses may be loose. Let γprd and γint be thethresholds of effective attacks. That is, for an effective attack,`prd(x∗) ≥ γprd and ìnt(x∗) ≤ γint. There could be cases thatγprd� γint, making ADV2 still highly effective (e.g., Figure 8).Second, the adversary may pursue attacks that rely less on thegradient information to circumvent this conflict.

Overall, with the evidence of low attack transferability anddisparate attack robustness, we can conclude:

Observation 8

Existing interpreters tend to focus on distinct aspectsof DNN behavior, which may result in the prediction-interpretation gap.

Q3. Potential CountermeasuresBased on our findings, next we discuss potential counter-

measures against ADV2 attacks.Defense 1: Ensemble Interpretation – Motivated by the

observation that different interpreters focus on distinct aspectsof DNN behavior (e.g., CAM focuses on deep representationswhile MASK focuses on input-prediction correspondence),a promising direction to defend against ADV2 is to deploymultiple, complementary interpreters to provide a holisticview of DNN behavior.

Yet, two major challenges remain to be addressed. First, dif-ferent interpreters may provide disparate interpretations (e.g.,Figure 14). It is challenging to optimally aggregate such inter-pretations to detect ADV2. Second, the adversary may adaptADV2 to the ensemble interpreter (e.g. optimizing the inter-pretation loss with respect to all the interpreters). It is crucialto account for such adaptiveness in designing the ensemble in-terpreter. We consider developing the ensemble defenses andexploring the adversary’s adaptive strategies as our ongoingresearch directions.

Defense 2: Adversarial Interpretation – Along the sec-ond direction, we explore the idea of adversarial training.Recall that ADV2 exploit the prediction-interpretation gap togenerate adversarial inputs. Here we employ ADV2 as a driveto minimize this gap during training interpreters.

Specifically, we propose an adversarial interpretation dis-tillation (AID) framework. Let A be the ADV2 attack. Duringtraining an interpreter g, for a given input x◦, besides theregular loss `map(x◦), we consider an additional loss termàid(x◦) = −‖g(x◦)− g(A(x◦))‖1, which is the negative L1measure between the attribution maps of x◦ and its adversar-ial counterpart A(x◦). We encourage g to minimize this lossduring the training (details in Appendix A5).

To assess the effectiveness of AID to reduce the prediction-interpretation gap, we use RTS as a concrete case study. Recallthat RTS is a model-guided interpreter which directly predictsthe interpretation of a given input. We construct two variantsof RTS, a regular one and another with AID training (denotedby RTSA). We measure the sensitivity of the two interpretersto the underlying DNN behavior.

3% 30%Normal Uniform

3% 30%

Noise N U0.01 0.010% 0.01 0.010.01 0.013% 0.02 0.030.09 0.1030% 0.17 0.17

Figure 15: Attribution maps generated by regular R

L1 measures

RT

S

RTSA

Input+

Noi

se

RTS

RT

SA

Figure 15: Attribution maps generated by RTS and RTSA underdifferent noise levels and types (normal N, unifrom U) on ResNet.

In the first case, we inject random noise (either normal oruniform) to the inputs and compare the attribution maps gen-erated by the two interpreters. We consider two noise levels,which respectively cause 3% and 30% misclassification onthe test set. Figure 15 shows a set of misclassified samplesunder the two noise levels. Observe that compared with RTS,RTSA appears much more sensitive to the DNN’s behaviorchange, by generating highly contrastive maps. This sensi-tivity is also quantitatively confirmed by the L1 measuresbetween clean and noisy maps on RTS and RTSA. The find-ings also corroborate a similar phenomenon observed in [59]:the representations generated by robust models tend to alignbetter with salient data characteristics.

Input<latexit sha1_base64="sswdsU9b9cSNor+qUqUMAkJOums=">AAACVHicbVBNTxsxEPUuUGjaUijHXlaESj1Fu7QVHBG9tDeQGkDCKfJOJsSKP1b2LBBZ+z+4tj+qUv8LB7whUiH0SZae3rzxzLyyUtJTnv9N0qXllReray87r16/WX+7sfnuxNvaAfbBKuvOSuFRSYN9kqTwrHIodKnwtJx8beunV+i8tOYHTSscaHFp5EiCoCj95FrQ2Onw3VQ1NRcb3byXz5A9J8WcdNkcRxebSc6HFmqNhkAJ78+LvKJBEI4kKGw6vPZYCZiISzyP1AiNfhBmazfZh6gMs5F18RnKZurjjiC091NdRme7pl+steJ/a6VemEyj/UGQ7Y1o4GHwqFYZ2azNJBtKh0BqGokAJ+PuGYyFE0AxuQ43eA1Wa2GGgQNIB03gE3Qm733BG34F8Xh0gY9LexN2uI8/VORpqpC35p2m+eduOjHjYjHR5+Rkt1d86u0ef+4eHM7TXmPv2Tb7yAq2xw7YN3bE+gyYY7fsF/ud/Enu0qV05cGaJvOeLfYE6fo9+ga2CQ==</latexit>

Benign<latexit sha1_base64="I5npcQN4LAtdmDP5Sk7C8uyw2sI=">AAACV3icbVDLbhMxFPUM0IbwaAJLNiNSJFbRTAHBsiqbLotE2kp1FHlubhIrfozsO20ja76EbftR/RrqSSMBKUeydHTuuQ+fslLSU57fJemTp892djvPuy9evnq91+u/OfW2doAjsMq681J4VNLgiCQpPK8cCl0qPCuX39v62SU6L635SasKx1rMjZxJEBSlSW+Pa0ELp8MRGjk3zaQ3yIf5GtljUmzIgG1wMuknOZ9aqDUaAiW8vyjyisZBOJKgsOny2mMlYCnmeBGpERr9OKwvb7IPUZlmM+viM5St1b87gtDer3QZne2dfrvWiv+tlXprM82+jYM0VU1o4GHxrFYZ2ayNJZtKh0BqFYkAJ+PtGSyEE0AxvC43eAVWa2GmgQNIB03gS3QmH37Ba34J8fPoAl+U9jrscx8nVORppZC35v2m+eNuujHjYjvRx+T0YFh8Gh78+Dw4PNqk3WHv2Hv2kRXsKztkx+yEjRiwmv1iN+w2uUt+pztp58GaJpuet+wfpP17JmG2hw==</latexit>

ADV2<latexit sha1_base64="ZWHQm72Yv3xv1mA9WRicXNWuI64=">AAACVHicbVBdTxNBFJ1dRKEiAj7ysrGY8NTsVo0+4seDj5jYQsIUMnt7Syedj83MXaSZ7P/gVX4Uif/FB2dLE7V4kklOzj33Y05ZKekpz38m6dqj9cdPNjY7T7eebT/f2d0bels7wAFYZd1pKTwqaXBAkhSeVg6FLhWelLNPbf3kCp2X1nyjeYUjLS6NnEgQFKVzrgVNnQ4fPg/P+83FTjfv5QtkD0mxJF22xPHFbpLzsYVaoyFQwvuzIq9oFIQjCQqbDq89VgJm4hLPIjVCox+FxdlN9ioq42xiXXyGsoX6d0cQ2vu5LqOzPdOv1lrxv7VSr2ymyftRkKaqCQ3cL57UKiObtZlkY+kQSM0jEeBkvD2DqXACKCbX4Qa/g9VamHHgANJBE/gMncl7b/GaX0H8PLrAp6W9DgfcxwkVeZor5K35oGn+uJtOzLhYTfQhGfZ7xete/+ub7tHHZdobbJ+9ZIesYO/YEfvCjtmAAXPshv1gt8ld8itdS9fvrWmy7HnB/kG6/RvBGrVk</latexit>

RTS<latexit sha1_base64="lFxypesjedtwNDbnseQwEPV2zGw=">AAACUnicbVJNbxMxEPWm9IP0uxy5rEgr9RTttkX0WMGFY4CmrVRHlXcyaaz4Y2XPlkbW/g2u8KO48Fc44U0iAWlHsvT03hvP+MlFqaSnLPuVtFZerK6tb7xsb25t7+zu7R9ceVs5wD5YZd1NITwqabBPkhTelA6FLhReF5MPjX79gM5Lay5pWuJAi3sjRxIERYpzLWjsdPh8+aW+2+tk3WxW6VOQL0CHLap3t59kfGih0mgIlPD+Ns9KGgThSILCus0rj6WAibjH2wiN0OgHYbZ0nR5FZpiOrIvHUDpj/+0IQns/1UV0Nkv6Za0hn9UKvTSZRueDIE1ZERqYDx5VKiWbNomkQ+kQSE0jEOBk3D2FsXACKObW5ga/gtVamGHgANJBHfgEncm6b/GRP0B8PLrAx4V9DIfcxxtK8jRVyBvzYV3/ddftmHG+nOhTcHXSzU+7J5/OOhfvF2lvsNfsDTtmOXvHLthH1mN9Bqxk39h39iP5mfxuxV8yt7aSRc8r9l+1tv4Ali203g==</latexit>

RTSA<latexit sha1_base64="s3ZcRV6L160OV+7FOgU8N3JLhfg=">AAACVHicbVDLThRBFK1uQHFUBF266TCYuJp0o0aWqBuWCAyQUAOpvnOHqUw9OlW3gUml/8OtfJSJ/+LC6mESZeAklZyce+6jTlkp6SnPfyfp0vLKk6erzzrPX7xce7W+8frY29oB9sEq605L4VFJg32SpPC0cih0qfCknHxr6ydX6Ly05oimFQ60uDRyJEFQlM65FjR2OhwcHZ5/aS7Wu3kvnyF7SIo56bI59i82kpwPLdQaDYES3p8VeUWDIBxJUNh0eO2xEjARl3gWqREa/SDMzm6yd1EZZiPr4jOUzdT/O4LQ3k91GZ3tmX6x1oqP1kq9sJlGO4MgTVUTGrhbPKpVRjZrM8mG0iGQmkYiwMl4ewZj4QRQTK7DDV6D1VqYYeAA0kET+ASdyXuf8IZfQfw8usDHpb0JW9zHCRV5mirkrXmraf65m07MuFhM9CE53u4VH3rb3z92d7/O015lb9kme88K9pntsj22z/oMmGM/2E92m/xK/qRL6cqdNU3mPW/YPaRrfwEWq7WR</latexit>



L1 measures<latexit sha1_base64="qcjCSPy16smkXN/8tc8k+NkIMEI=">AAACa3icbVDLbhMxFHWmPEp4tKU7YGGRVmKBopnSCpYV3bBgUSTSVqqj6M7NTWPFj5HtKY2s+YB+DVv4lH4E/4AnjQSkXMvS0bnnvk5ZKelDnt90srV79x88XH/Uffzk6bONza3nJ97WDmmAVll3VoInJQ0NggyKzipHoEtFp+XsqM2fXpLz0pqvYV7RUMOFkROJEBI12uwJDWGKoOLnZlRw8Ta9lnE6agJfO/JNUuX9fBH8LiiWoMeWcTza6uRibLHWZAIq8P68yKswjOCCREVNV9SeKsAZXNB5ggY0+WFcXNPw3cSM+cS69E3gC/bvigja+7kuk7Ld1K/mWvK/uVKvTA6TD8MoTVUHMng7eFIrHixvreJj6QiDmicA6GTaneMUHGBIhnaFoW9otQYzjgJROmyimJEzef+ArsQlpuPJRTEt7VXcET51qIIPc0WiFe80zR91000eF6uO3gUne/3iXX/vy37v8OPS7XX2kr1mb1jB3rND9okdswFDds2+sx/sZ+dXtp29yF7dSrPOsmab/RPZ7m9pEr2c</latexit>

the DNN’s behavior change, by generating highly contrastivemaps. This sensitivity is also quantitatively confirmed by theL1 distance between the clean and noisy attribution maps.

Input<latexit sha1_base64="sswdsU9b9cSNor+qUqUMAkJOums=">AAACVHicbVBNTxsxEPUuUGjaUijHXlaESj1Fu7QVHBG9tDeQGkDCKfJOJsSKP1b2LBBZ+z+4tj+qUv8LB7whUiH0SZae3rzxzLyyUtJTnv9N0qXllReray87r16/WX+7sfnuxNvaAfbBKuvOSuFRSYN9kqTwrHIodKnwtJx8beunV+i8tOYHTSscaHFp5EiCoCj95FrQ2Onw3VQ1NRcb3byXz5A9J8WcdNkcRxebSc6HFmqNhkAJ78+LvKJBEI4kKGw6vPZYCZiISzyP1AiNfhBmazfZh6gMs5F18RnKZurjjiC091NdRme7pl+steJ/a6VemEyj/UGQ7Y1o4GHwqFYZ2azNJBtKh0BqGokAJ+PuGYyFE0AxuQ43eA1Wa2GGgQNIB03gE3Qm733BG34F8Xh0gY9LexN2uI8/VORpqpC35p2m+eduOjHjYjHR5+Rkt1d86u0ef+4eHM7TXmPv2Tb7yAq2xw7YN3bE+gyYY7fsF/ud/Enu0qV05cGaJvOeLfYE6fo9+ga2CQ==</latexit>

Benign<latexit sha1_base64="I5npcQN4LAtdmDP5Sk7C8uyw2sI=">AAACV3icbVDLbhMxFPUM0IbwaAJLNiNSJFbRTAHBsiqbLotE2kp1FHlubhIrfozsO20ja76EbftR/RrqSSMBKUeydHTuuQ+fslLSU57fJemTp892djvPuy9evnq91+u/OfW2doAjsMq681J4VNLgiCQpPK8cCl0qPCuX39v62SU6L635SasKx1rMjZxJEBSlSW+Pa0ELp8MRGjk3zaQ3yIf5GtljUmzIgG1wMuknOZ9aqDUaAiW8vyjyisZBOJKgsOny2mMlYCnmeBGpERr9OKwvb7IPUZlmM+viM5St1b87gtDer3QZne2dfrvWiv+tlXprM82+jYM0VU1o4GHxrFYZ2ayNJZtKh0BqFYkAJ+PtGSyEE0AxvC43eAVWa2GmgQNIB03gS3QmH37Ba34J8fPoAl+U9jrscx8nVORppZC35v2m+eNuujHjYjvRx+T0YFh8Gh78+Dw4PNqk3WHv2Hv2kRXsKztkx+yEjRiwmv1iN+w2uUt+pztp58GaJpuet+wfpP17JmG2hw==</latexit>

ADV2<latexit sha1_base64="ZWHQm72Yv3xv1mA9WRicXNWuI64=">AAACVHicbVBdTxNBFJ1dRKEiAj7ysrGY8NTsVo0+4seDj5jYQsIUMnt7Syedj83MXaSZ7P/gVX4Uif/FB2dLE7V4kklOzj33Y05ZKekpz38m6dqj9cdPNjY7T7eebT/f2d0bels7wAFYZd1pKTwqaXBAkhSeVg6FLhWelLNPbf3kCp2X1nyjeYUjLS6NnEgQFKVzrgVNnQ4fPg/P+83FTjfv5QtkD0mxJF22xPHFbpLzsYVaoyFQwvuzIq9oFIQjCQqbDq89VgJm4hLPIjVCox+FxdlN9ioq42xiXXyGsoX6d0cQ2vu5LqOzPdOv1lrxv7VSr2ymyftRkKaqCQ3cL57UKiObtZlkY+kQSM0jEeBkvD2DqXACKCbX4Qa/g9VamHHgANJBE/gMncl7b/GaX0H8PLrAp6W9DgfcxwkVeZor5K35oGn+uJtOzLhYTfQhGfZ7xete/+ub7tHHZdobbJ+9ZIesYO/YEfvCjtmAAXPshv1gt8ld8itdS9fvrWmy7HnB/kG6/RvBGrVk</latexit>





the DNN’s behavior change, by generating highly contrastivemaps. This sensitivity is also quantitatively confirmed by theL1 distance between the clean and noisy attribution maps.

RT

S<latexit sha1_base64="96Kp6KWXsT6/Gx6bjlx6ge0kIMI=">AAACUnicbVJNTxsxEPWmH9CUttAee1k1VOop2oVW5YjKpUeg+ZJwhLyTWWLFHyt7lhJZ+ze4wo/qpX+FE06I1DZ0JEtP773xjJ9cVEp6yrLfSevJ02fPNzZftF9uvXr9Znvn7cDb2gH2wSrrRoXwqKTBPklSOKocCl0oHBazo4U+vETnpTU9mlc41uLCyFKCoEhxrgVNfRlOez+a8+1O1s2WlT4G+Qp02KqOz3eSjE8s1BoNgRLen+VZReMgHElQ2LR57bESMBMXeBahERr9OCyXbtKPkZmkpXXxGEqX7N8dQWjv57qIzuWS69qC/K9W6LXJVB6MgzRVTWjgYXBZq5RsukgknUiHQGoegQAn4+4pTIUTQDG3Njf4E6zWwkwCB5AOmsBn6EzW/YJX/BLi49EFPi3sVdjlPt5Qkae5Qr4w7zbNH3fTjhnn64k+BoO9br7f3Tv53Dn8tkp7k71nH9gnlrOv7JB9Z8esz4BV7JrdsNvkV3LXir/kwdpKVj3v2D/V2roHir602A==</latexit>

Figure 16: Attribution maps of benign and adversarial (ADV2) inputson RTS and RTSA.

In the second case, we assess the resilience of RTSA againstADV2. In Figure ??, we compare the attribution maps of be-nign and adversarial inputs on RTS and RTSA. It is observedthat while ADV2 generates adversarial inputs with interpreta-tions fairly similar to benign cases on RTS, it fails to do so onRTSA: the maps of adversarial inputs are fairly distinguish-able from their benign counterparts. Moreover, RTSA behavesalmost identically to RTS on benign inputs, indicating that theAID training has little impact on benign cases. These obser-vations are confirmed by the L1 measures as well.

RTS RTSA

Benign 0.03ADV2 0.01 0.10

Table 10. Comparison of AID and ADV2 with corresponded benignmaps, measured by L1 distance.


Observation 8

It is possible to exploit ADV2 to reduce the prediction-interpretation gap in training interpreters.

6 Related WorkIn this section, we survey three categories of work rele-

vant to this work, namely, adversarial attacks and defenses,transferability, and interpretability.

Attacks and Defenses – Due to their widespread usein security-critical domains, machine learning models areincreasingly becoming the targets of malicious attacks [?].Two primary threat models are considered in literature. Poi-soning attacks – the adversary pollutes the training data toeventually compromise the target models [?, ?, ?]; Evasionattacks – the adversary manipulates the input data during in-ference to trigger target models to misbehave [?, ?, ?].

Compared with simple models (e.g., support vector ma-chines), securing deep neural networks (DNNs) in adversar-ial settings entails more challenges due to their significantlyhigher model complexity [?]. One line of work focuses ondeveloping new evasion attacks against DNNs [?, ?, ?, ?, ?].

Another line of work attempts to improve DNN resilienceagainst such attacks by inventing new training and inferencestrategies [?, ?, ?, ?]. Yet, such defenses are often circum-vented by more powerful attacks [?] or adaptively engineeredadversarial inputs [?, ?], resulting in a constant arms race be-tween attackers and defenders [?].

This work is among the first to explore attacks againstDNNs with interpretability as a means of defense.

Transferability – One intriguing property of adversarialattacks is their transferability [?]: adversarial inputs craftedagainst one DNN is often effective against another one. Thisproperty enables black-box attacks: the adversary generatesadversarial inputs based on a surrogate DNN and apply themon the target model [?, ?, ?]. To defend against such attacks,the method of ensemble adversarial training [?] has been pro-posed, which trains DNNs using data augmented with adver-sarial inputs crafted on other models.

This work complements this line of work by investigat-ing the transferability of adversarial inputs across differentinterpretation models.

Interpretability – A plethora of interpretation modelshave been proposed to provide interpretability for black-boxDNNs, using techniques based on back-propagation [?, ?, ?],intermediate representations [?, ?, ?], input perturbation [?],and meta models [?].

The improved interpretability is believed to offer a senseof security by involving human in the decision-making pro-cess. Existing work has exploited interpretability to debugDNNs [?], digest security analysis results [?], and detect ad-versarial inputs [?, ?]. Intuitively, as adversarial inputs causeunexpected DNN behaviors, the interpretation of DNN dy-namics is expected to differ significantly between benign andadversarial inputs.

However, recent work empirically shows that some inter-pretation models seem insensitive to either DNNs or data gen-eration processes [?], while transformation with no effect onDNNs (e.g., constant shift) may significantly affect the be-haviors of interpretation models [?].

This work shows the possibility of deceiving DNNs andtheir coupled interpretation models simultaneously, imply-ing that the improved interpretability only provides limitedsecurity assurance, which also complements prior work byexamining the reliability of existing interpretation modelsfrom the perspective of adversarial vulnerability.

7 ConclusionThis work represents a systematic study on the security

of interpretable deep learning systems (IDLSes). We presentADV2, a general class of attacks that generate adversarial in-puts not only misleading target DNNs but also deceiving theircoupled interpretation models. Through extensive empiricalevaluation, we show the effectiveness of ADV2 against a rangeof DNNs and interpretation models, implying that the inter-pretability of existing IDLSes may merely offer a false sense

13

L1 measures<latexit sha1_base64="qcjCSPy16smkXN/8tc8k+NkIMEI=">AAACa3icbVDLbhMxFHWmPEp4tKU7YGGRVmKBopnSCpYV3bBgUSTSVqqj6M7NTWPFj5HtKY2s+YB+DVv4lH4E/4AnjQSkXMvS0bnnvk5ZKelDnt90srV79x88XH/Uffzk6bONza3nJ97WDmmAVll3VoInJQ0NggyKzipHoEtFp+XsqM2fXpLz0pqvYV7RUMOFkROJEBI12uwJDWGKoOLnZlRw8Ta9lnE6agJfO/JNUuX9fBH8LiiWoMeWcTza6uRibLHWZAIq8P68yKswjOCCREVNV9SeKsAZXNB5ggY0+WFcXNPw3cSM+cS69E3gC/bvigja+7kuk7Ld1K/mWvK/uVKvTA6TD8MoTVUHMng7eFIrHixvreJj6QiDmicA6GTaneMUHGBIhnaFoW9otQYzjgJROmyimJEzef+ArsQlpuPJRTEt7VXcET51qIIPc0WiFe80zR91000eF6uO3gUne/3iXX/vy37v8OPS7XX2kr1mb1jB3rND9okdswFDds2+sx/sZ+dXtp29yF7dSrPOsmab/RPZ7m9pEr2c</latexit>

Figure 16: Attribution maps of benign and adversarial (ADV2) inputson RTS and RTSA.

In the second case, we assess the resilience of RTSA againstADV2. In Figure 16, we compare the attribution maps of be-nign and adversarial inputs on RTS and RTSA. It is observedthat while ADV2 generates adversarial inputs with interpreta-tions fairly similar to benign cases on RTS, it fails to do so onRTSA: the maps of adversarial inputs are fairly distinguish-able from their benign counterparts. Moreover, RTSA behavesalmost identically to RTS on benign inputs, indicating that theAID training has little impact on benign cases. These obser-vations are confirmed by the L1 measures as well.

RTS RTSA

Benign 0.03ADV2 0.01 0.10

Table 10. Comparison of AID and ADV2 with corresponded benignmaps, measured by L1 distance.

Overall we have the following conclusion.Observation 8

It is possible to exploit ADV2 to reduce the prediction-interpretation gap in training interpreters.



Attacks and Defenses – Due to their widespread usein security-critical domains, machine learning models areincreasingly becoming the targets of malicious attacks [9].Two primary threat models are considered in literature. Poi-soning attacks – the adversary pollutes the training data toeventually compromise the target models [8, 76, 46]; Evasionattacks – the adversary manipulates the input data during in-ference to trigger target models to misbehave [16, 40, 49].

Compared with simple models (e.g., support vector ma-chines), securing deep neural networks (DNNs) in adversar-ial settings entails more challenges due to their significantlyhigher model complexity [35]. One line of work focuses ondeveloping new evasion attacks against DNNs [71, 24, 54,

13, 42]. Another line of work attempts to improve DNN re-silience against such attacks by inventing new training andinference strategies [53, 44, 77, 41]. Yet, such defenses areoften circumvented by more powerful attacks [13] or adap-tively engineered adversarial inputs [12, 5], resulting in aconstant arms race between attackers and defenders [37].


Transferability – One intriguing property of adversarialattacks is their transferability [71]: adversarial inputs craftedagainst one DNN is often effective against another one. Thisproperty enables black-box attacks: the adversary generatesadversarial inputs based on a surrogate DNN and apply themon the target model [51, 14, 39]. To defend against such at-tacks, the method of ensemble adversarial training [73] hasbeen proposed, which trains DNNs using data augmentedwith adversarial inputs crafted on other models.


Interpretability – A plethora of interpretation modelshave been proposed to provide interpretability for black-boxDNNs, using techniques based on back-propagation [63, 66,67], intermediate representations [80, 60, 19], input pertur-bation [21], and meta models [15].

The improved interpretability is believed to offer a senseof security by involving human in the decision-making pro-cess. Existing work has exploited interpretability to debugDNNs [50], digest security analysis results [25], and detectadversarial inputs [38, 72]. Intuitively, as adversarial inputscause unexpected DNN behaviors, the interpretation of DNNdynamics is expected to differ significantly between benignand adversarial inputs.

However, recent work empirically shows that some inter-pretation models seem insensitive to either DNNs or data gen-eration processes [1], while transformation with no effect onDNNs (e.g., constant shift) may significantly affect the be-haviors of interpretation models [33].



of interpretable deep learning systems (IDLSes). We presentADV2, a general class of attacks that generate adversarial in-puts not only misleading target DNNs but also deceiving theircoupled interpretation models. Through extensive empiricalevaluation, we show the effectiveness of ADV2 against a rangeof DNNs and interpretation models, implying that the inter-pretability of existing IDLSes may merely offer a false sense

13

Figure 16: Attribution maps of benign and adversarial (ADV2) inputswith respect to RTS and RTSA on ResNet.

In the second case, we assess the resilience of RTSA against

12

ADV2. In Figure 16, we compare the attribution maps of be-nign and adversarial inputs on RTS and RTSA. It is observedthat while ADV2 generates adversarial inputs with interpre-tations fairly similar to benign cases on RTS, it fails to do soon RTSA: the maps of adversarial inputs are fairly distinguish-able from their benign counterparts. Moreover, RTSA behavesalmost identically to RTS on benign inputs, indicating that theAID training has little impact on benign cases. These findingsare confirmed by the L1 measures as well.


Observation 9

It is possible to exploit ADV2 to reduce the prediction-interpretation gap during training interpreters.



Attacks and Defenses – Due to their widespread use insecurity-critical domains, machine learning models are in-creasingly becoming the targets of malicious attacks. Twoprimary threat models are considered in literature. Poisoningattacks – the adversary pollutes the training data to eventuallycompromise the target models [7]; Evasion attacks – the ad-versary manipulates the input data during inference to triggertarget models to misbehave [11].

Compared with simple models (e.g., support vector ma-chines), securing deep neural networks (DNNs) in adversar-ial settings entails more challenges due to their significantlyhigher model complexity [29]. One line of work focuses ondeveloping new evasion attacks against DNNs [19, 35, 56].Another line of work attempts to improve DNN resilienceagainst such attacks by inventing new training and inferencestrategies [34, 42, 61]. Yet, such defenses are often circum-vented by more powerful attacks [9] or adaptively engineeredadversarial inputs [5], resulting in a constant arms race be-tween attackers and defenders [31].


Transferability – One intriguing property of adversarialattacks is their transferability [56]: adversarial inputs craftedagainst one DNN is often effective against another one. Thisproperty enables black-box attacks – the adversary generatesadversarial inputs based on a surrogate DNN and then appliesthem on the target model [33, 40]. To defend against suchattacks, the method of ensemble adversarial training [58] hasbeen proposed, which trains DNNs using data augmentedwith adversarial inputs crafted on other models.


Interpretability – A plethora of interpretation models

have been proposed to provide interpretability for black-box DNNs, using techniques based on back-propagation[50, 52, 53], intermediate representations [14, 47, 64], inputperturbation [16], and meta models [10].

The improved interpretability is believed to offer a senseof security by involving human in the decision-making pro-cess. Existing work has exploited interpretability to debugDNNs [39], digest security analysis results [20], and detectadversarial inputs [32, 57]. Intuitively, as adversarial inputscause unexpected DNN behaviors, the interpretation of DNNdynamics is expected to differ significantly between benignand adversarial inputs.

However, recent work empirically shows that some inter-pretation models seem insensitive to either DNNs or datageneration processes [1], while transformation with no effecton DNNs (e.g., constant shift) may significantly affect thebehaviors of interpretation models [27].



of interpretable deep learning systems (IDLSes). We presentADV2, a general class of attacks that generate adversarial in-puts not only misleading target DNNs but also deceiving theircoupled interpretation models. Through extensive empiricalevaluation, we show the effectiveness of ADV2 against a rangeof DNNs and interpretation models, implying that the inter-pretability of existing IDLSes may merely offer a false senseof security. We identify the prediction-interpretation gap asone possible cause of this vulnerability, raising the criticalconcern about the current assessment metrics of interpreta-tion models. Further, we discuss potential countermeasuresagainst ADV2, which sheds light on designing and operatingIDLSes in a more robust and informative fashion.

AcknowledgmentsThis material is based upon work supported by the National

Science Foundation under Grant No. 1846151 and 1910546.Any opinions, findings, and conclusions or recommendationsexpressed in this material are those of the author(s) and donot necessarily reflect the views of the National Science Foun-dation. Shouling Ji is partially supported by NSFC under No.61772466 and U1836202, the Zhejiang Provincial Natural Sci-ence Foundation for Distinguished Young Scholars under No.LR19F020003, and the Provincial Key Research and Devel-opment Program of Zhejiang, China under No. 2017C01055.Xiapu Luo is partially supported by Hong Kong RGC Project(No. PolyU 152279/16E, CityU C1008-16G).

13

References[1] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow,

M. Hardt, and B. Kim. Sanity Checks for SaliencyMaps. In Proceedings of Advances in Neural Informa-tion Processing Systems (NIPS), 2018.

[2] Rima Alaifari, Giovanni S. Alberti, and Tandri Gauks-son. ADef: an Iterative Algorithm to Construct Adver-sarial Deformations. In Proceedings of InternationalConference on Learning Representations (ICLR), 2019.

[3] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross. To-wards Better Understanding of Gradient-based Attribu-tion Methods for Deep Neural Networks. In Proceedingsof International Conference on Learning Representa-tions (ICLR), 2018.

[4] Marcin Andrychowicz, Misha Denil, Sergio Gómez,Matthew W Hoffman, David Pfau, Tom Schaul, BrendanShillingford, and Nando de Freitas. Learning to learnby gradient descent by gradient descent. In Proceedingsof Advances in Neural Information Processing Systems(NIPS), 2016.

[5] Anish Athalye, Nicholas Carlini, and David Wagner.Obfuscated Gradients Give a False Sense of Security:Circumventing Defenses to Adversarial Examples. InProceedings of International Conference on LearningRepresentations (ICLR), 2018.

[6] Sebastian Bach, Alexander Binder, Grégoire Montavon,Frederick Klauschen, Klaus-Robert Müller, and Woj-ciech Samek. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise RelevancePropagation. PLoS ONE, 10(7):e0130140, 2015.

[7] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poi-soning Attacks against Support Vector Machines. InProceedings of IEEE Conference on Machine Learning(ICML), 2012.

[8] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu,Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang,Chang Huang, Wei Xu, Deva Ramanan, and Thomas SHuang. Look and Think Twice: Capturing Top-DownVisual Attention with Feedback Convolutional NeuralNetworks. In Proceedings of IEEE International Con-ference on Computer Vision (ICCV), 2015.

[9] Nicholas Carlini and David A. Wagner. Towards Evalu-ating the Robustness of Neural Networks. In Proceed-ings of IEEE Symposium on Security and Privacy (S&P),2017.

[10] P. Dabkowski and Y. Gal. Real Time Image Saliency forBlack Box Classifiers. In Proceedings of Advances inNeural Information Processing Systems (NIPS), 2017.

[11] Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai,and Deepak Verma. Adversarial Classification. In Pro-ceedings of ACM International Conference on Knowl-edge Discovery and Data Mining (KDD), 2004.

[12] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical imagedatabase. In Proceedings of IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2009.

[13] Mengnan Du, Ninghao Liu, Qingquan Song, and XiaHu. Towards Explanation of DNN-based Predictionwith Guided Feature Inversion. In Proceedings of ACMInternational Conference on Knowledge Discovery andData Mining (KDD), 2018.

[14] Mengnan Du, Ninghao Liu, Qingquan Song, and XiaHu. Towards Explanation of DNN-based Predictionwith Guided Feature Inversion. ArXiv e-prints, 2018.

[15] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of DeepNetworks. In Proceedings of IEEE Conference on Ma-chine Learning (ICML), 2017.

[16] Ruth C Fong and Andrea Vedaldi. Interpretable Ex-planations of Black Boxes by Meaningful Perturbation.In Proceedings of IEEE International Conference onComputer Vision (ICCV), 2017.

[17] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov,S. Chaudhuri, and M. Vechev. AI2: Safety and Robust-ness Certification of Neural Networks with AbstractInterpretation. In Proceedings of IEEE Symposium onSecurity and Privacy (S&P), 2018.

[18] Nils Gessert, Thilo Sentker, Frederic Madesta, Rüdi-ger Schmitz, Helge Kniep, Ivo M. Baltruschat, RenéWerner, and Alexander Schlaefer. Skin lesion diagnosisusing ensembles, unscaled multi-crop evaluation andloss weighting. ArXiv e-prints, abs/1808.01694, 2018.

[19] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and Harnessing Adversarial Examples. InProceedings of International Conference on LearningRepresentations (ICLR), 2015.

[20] Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, GangWang, and Xinyu Xing. LEMNA: Explaining DeepLearning Based Security Applications. In Proceedingsof ACM SAC Conference on Computer and Communica-tions (CCS), 2018.

[21] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proceedings of IEEE International Conferenceon Computer Vision (ICCV), 2017.

14

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition.In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016.

[23] Warren He, James Wei, Xinyun Chen, Nicholas Carlini,and Dawn Song. Adversarial Example Defenses: En-sembles of Weak Defenses are not Strong. In USENIXWorkshop on Offensive Technologies, 2017.

[24] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Wein-berger. Densely Connected Convolutional Networks. InProceedings of IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017.

[25] Max Jaderberg, Karen Simonyan, Andrew Zisserman,and Koray Kavukcuoglu. Spatial transformer networks.In Proceedings of Advances in Neural Information Pro-cessing Systems (NIPS), 2015.

[26] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing andUnderstanding Recurrent Networks. In Proceedings ofInternational Conference on Learning Representations(ICLR), 2016.

[27] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo,Maximilian Alber, Kristof T. Schütt, Sven Dähne, Du-mitru Erhan, and Been Kim. The (Un)reliability ofSaliency Methods. ArXiv e-prints, 2017.

[28] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio.Adversarial Machine Learning at Scale. In Proceedingsof International Conference on Learning Representa-tions (ICLR), 2017.

[29] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton.Deep learning. Nature, 521(7553):436–444, 2015.

[30] Min Lin, Qiang Chen, and Shuicheng Yan. Network inNetwork. In Proceedings of International Conferenceon Learning Representations (ICLR), 2014.

[31] X. Ling, S. Ji, J. Zou, J. Wang, C. Wu, B. Li, and T. Wang.DEEPSEC: A Uniform Platform for Security Analysisof Deep Learning Model. In Proceedings of IEEE Sym-posium on Security and Privacy (S&P), 2019.

[32] Ninghao Liu, Hongxia Yang, and Xia Hu. AdversarialDetection with Model Interpretation. In Proceedings ofACM International Conference on Knowledge Discoveryand Data Mining (KDD), 2018.

[33] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song.Delving into Transferable Adversarial Examples andBlack-Box Attacks. In Proceedings of InternationalConference on Learning Representations (ICLR), 2017.

[34] Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Su-danthi Wijewickrema, Grant Schoenebeck, Dawn Song,Michael E. Houle, and James Bailey. CharacterizingAdversarial Subspaces Using Local Intrinsic Dimension-ality. In Proceedings of International Conference onLearning Representations (ICLR), 2018.

[35] Aleksander Madry, Aleksandar Makelov, LudwigSchmidt, Dimitris Tsipras, and Adrian Vladu. TowardsDeep Learning Models Resistant to Adversarial Attacks.In Proceedings of International Conference on LearningRepresentations (ICLR), 2018.

[36] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, andP. Frossard. Universal Adversarial Perturbations. InProceedings of IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017.

[37] W. J. Murdoch, P. J. Liu, and B. Yu. Beyond WordImportance: Contextual Decomposition to Extract Inter-actions from LSTMs. In Proceedings of InternationalConference on Learning Representations (ICLR), 2018.

[38] W. J. Murdoch and A. Szlam. Automatic Rule Ex-traction from Long Short Term Memory Networks. InProceedings of International Conference on LearningRepresentations (ICLR), 2017.

[39] A. Nguyen, J. Yosinski, and J. Clune. Deep Neural Net-works are Easily Fooled: High Confidence Predictionsfor Unrecognizable Images. In Proceedings of IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR), 2014.

[40] Nicolas Papernot, Patrick McDaniel, and Ian Goodfel-low. Transferability in Machine Learning: from Phenom-ena to Black-box Attacks Using Adversarial Samples.ArXiv e-prints, 2016.

[41] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow,Somesh Jha, Z Berkay Celik, and Ananthram Swami.Practical Black-Box Attacks Against Machine Learn-ing. In Proceedings of ACM Symposium on Information,Computer and Communications Security (AsiaCCS),2017.

[42] Nicolas Papernot, Patrick McDaniel, Xi Wu, SomeshJha, and Ananthram Swami. Distillation as a Defenseto Adversarial Perturbations Against Deep Neural Net-works. In Proceedings of IEEE Symposium on Securityand Privacy (S&P), 2016.

[43] Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. "Why Should I Trust You?": Explaining thePredictions of Any Classifier. In Proceedings of ACMInternational Conference on Knowledge Discovery andData Mining (KDD), 2016.

15

[44] O. Ronneberger, P.Fischer, and T. Brox. U-Net: Con-volutional Networks for Biomedical Image Segmenta-tion. In Proceedings of Medical Image Computing andComputer-Assisted Intervention (MICCAI), 2015.

[45] S. Sabour, N. Frosst, and G. E Hinton. Dynamic Rout-ing Between Capsules. In Proceedings of Advances inNeural Information Processing Systems (NIPS), 2017.

[46] F.M. Scherer. Heinrich von Stackelberg’s Marktformund Gleichgewicht. Journal of Economic Studies,23(5/6):58–70, 1996.

[47] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,D. Parikh, and D. Batra. Grad-CAM: Visual Explana-tions from Deep Networks via Gradient-Based Localiza-tion. In Proceedings of IEEE International Conferenceon Computer Vision (ICCV), 2017.

[48] Avanti Shrikumar, Peyton Greenside, and Anshul Kun-daje. Learning Important Features Through PropagatingActivation Differences. In Proceedings of IEEE Confer-ence on Machine Learning (ICML), 2017.

[49] David Silver, Aja Huang, Chris J. Maddison, ArthurGuez, Laurent Sifre, George van den Driessche, JulianSchrittwieser, Ioannis Antonoglou, Veda Panneershel-vam, Marc Lanctot, Sander Dieleman, Dominik Grewe,John Nham, Nal Kalchbrenner, Ilya Sutskever, TimothyLillicrap, Madeleine Leach, Koray Kavukcuoglu, ThoreGraepel, and Demis Hassabis. Mastering the Game ofGo with Deep Neural Networks and Tree Search. Na-ture, (7587):484–489, 2016.

[50] Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. Deep Inside Convolutional Networks: VisualisingImage Classification Models and Saliency Maps. InProceedings of International Conference on LearningRepresentations (ICLR), 2014.

[51] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wat-tenberg. SmoothGrad: Removing Noise by AddingNoise. In International Conference on Machine Learn-ing Workshop, 2017.

[52] Jost Tobias Springenberg, Alexey Dosovitskiy, ThomasBrox, and Martin Riedmiller. Striving for Simplicity:The All Convolutional Net. Proceedings of InternationalConference on Learning Representations (ICLR), 2015.

[53] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Ax-iomatic Attribution for Deep Networks. In Proceed-ings of IEEE Conference on Machine Learning (ICML),2017.

[54] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequenceto Sequence Learning with Neural Networks. In Pro-ceedings of Advances in Neural Information ProcessingSystems (NIPS), 2014.

[55] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jonathon Shlens, and Zbigniew Wojna. Rethinking theInception Architecture for Computer Vision. In Pro-ceedings of IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2016.

[56] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, and RobFergus. Intriguing Properties of Neural Networks. InProceedings of International Conference on LearningRepresentations (ICLR), 2014.

[57] Guanhong Tao, Shiqing Ma, Yingqi Liu, and XiangyuZhang. Attacks Meet Interpretability: Attribute-SteeredDetection of Adversarial Samples. In Proceedings ofAdvances in Neural Information Processing Systems(NIPS), 2018.

[58] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow,D. Boneh, and P. McDaniel. Ensemble AdversarialTraining: Attacks and Defenses. In Proceedings of In-ternational Conference on Learning Representations(ICLR), 2018.

[59] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom,Alexander Turner, and Aleksander Madry. RobustnessMay Be at Odds with Accuracy. In Proceedings ofInternational Conference on Learning Representations(ICLR), 2019.

[60] Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He,Mingyan Liu, and Dawn Song. Spatially TransformedAdversarial Examples. In Proceedings of InternationalConference on Learning Representations (ICLR), 2018.

[61] W. Xu, D. Evans, and Y. Qi. Feature Squeezing: Detect-ing Adversarial Examples in Deep Neural Networks. InProceedings of Network and Distributed System SecuritySymposium (NDSS), 2018.

[62] Q. Zhang, Y. Nian Wu, and S.-C. Zhu. Interpretable Con-volutional Neural Networks. In Proceedings of IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR), 2018.

[63] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu.Interpretable Convolutional Neural Networks. In Pro-ceedings of IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2018.

[64] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba. Learning Deep Features for Discriminative Lo-calization. In Proceedings of IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2016.

16

AppendixA. Implementation DetailsA1. ADV222 against Grad-CAM

Gradient-weighted class activation mapping (GRADCAM)[47] is another feature-guided interpretation model. Similarto CAM, it approximates the model predictions as a weightedsummation of the feature maps of the last convolutional layer.Differently, it projects global averaged gradients back to theconvolutional feature maps:

wk,c =1Z ∑

i, j

∂ fc(x)∂ak[i, j]

(14)

where fc(x) is the model prediction with respect to input xand class c, ak[i, j] is the activation of the k-th channel of thelast convolutional layer at the spatial position (i, j), and Z isa normalization constant. The attribution map is defined as:

mc[i, j] = ReLU

(∑k

wk,c ak[i, j]

)(15)

To attack GRADCAM, we apply the same optimizationprocedure in § 3.3. Note that although the gradient of wk,y(x)with respect to x is zero almost everywhere, it is feasible tofind high-quality solutions using stochastic gradient descentmethods, since ak has non-zero gradients with respect to x.

Attack ASR ∆ (L1) ∆ (L2)

P 100.0% (1.0) 0.22 0.28A 100.0% (0.98) 0.07 0.09

Table 10. Effectiveness of PGD and ADV2 against GRADCAM.

Table 10 summarizes the attack success rate, misclassifi-cation confidence, L1 and L2 distance between benign andadversarial maps (by PGD and ADV2).A2. Optimized MASK Formulation

The complete optimization objective for MASK in [16] isgiven as follows:

minm

λ1rtv(m)+λ2‖1−m‖1 +Eτ [ fc (φ(x(·− τ),m))]

s.t. 0≤ m≤ 1 (16)

Here the term rtv(m) is the total variation of m, which reducesits noise and artifacts; the term ‖1−m‖1 encourages the spar-sity of m; φ(x;m) is the perturbation operator which blends xwith Gaussian noise (controlled by the parameter τ); while λ1and λ2 are the regularized coefficients for total variation andsparsity respectively.A3. Analysis of ADV222 against MASK

Here we provide simple analysis for the effectiveness ofADV2 in Algorithm 1. We have the following proposition.

Proposition 2. Let f (x,y) : Rm×Rn→R be a function withcontinuous second-order derivative in the neighborhood N =Nx×Ny of a point (x0,y0). Assume the following conditionshold:

• C1. For each x ∈ Nx, there is a unique g(x) , y ∈ Nysuch that y is a local minimizer of f (x, ·);

• C2. y0 is a local minimizer of f (x0, ·) (i.e., g(x0) = y0);

• C3. The Hessian of f (x0, ·),H(x0, ·)=∇2y f (x0,y0) is non-

degenerate at y0 (i.e., det(H(x0,y0))> 0).

Then for every g0 ∈Ny, the gradient of G(x) = 12‖g(x)−g0‖2

2at x = x0 is given by

−∇xy f (x0,y0)(∇2y f (x0,y0))

−1(g(x0)−g0). (17)

Proof. Let Q be the partial derivative of f with respect to y:Q(x,y), ∇y f (x,y). Then Eqn (17) is equivalent to

−∇xQ(x0,y0)(∇yQ(x0,y0))−1(g(x0)−g0) (18)

Since g(x0) = y0 (by C2), Q(x0,y0) = 0. Based on C3,for J , ∇yQ(x0,y0), det(J) 6= 0. According to the implicitfunction theorem, there exists a neighborhood of x0, Nx ⊂Nx ⊂Rm, a neighborhood of y0, Ny ⊂Ny ⊂Rn, and a uniquesmooth function h(x) : Nx→ Ny such that

Q(x,h(x)) = 0 ∀x ∈ Nx (19)

According to C1, for each x ∈ Nx, f (x, ·) has a unique localminimizer g(x) near y0. Then the local minimizer must be h(x)due to the first-order optimality condition. Thus h(x) = g(x)for all x ∈ Nx. Computing the gradient with respect to x forEqn (19), we have

∇xg(x0) =−∇xQ(x0,y0)(∇yQ(x0,y0))−1 (20)

which leads to Eqn (18) by the product rule of gradient.

Back to our case, let `map(m;x) be the objective functiondefined in Eqn (9) (or Eqn (16)) and mt be the target map.Although, strictly speaking, `map(m;x) is not continuouslydifferentiable, we assume f (x,m) = `map(m;x) satisfies theconditions of Proposition 2 for analysis purpose.

Note that m∗(x) = argminm `map(m;x) is the optimal mapfound by MASK for given x. At a given iteration, let m∗ bethe optimal map of the current input x. We can take a steptowards the direction of

∆ =−∇xm`map(m∗;x)(∇2m`map(m∗;x))−1(m∗−mt)

to make the map generated in the next iteration approachmt . In practice we only have m, an estimate of m∗. Let H =∇2

m`map(m;x) be the Hessian matrix for fixed x. By pluggingm into ∆ and setting a proper step size α > 0, we have

α∆≈−α∇xm`map(m;x)(∇2m`map(m;x))−1(m−mt)

=−α∇xm`map(m;x)H−1(m−mt)

=∇x (m−αH−1∇m`map(m;x))︸︷︷︸

inner update step

(m−mt)

17

In our attack, we instantiate the inner update step with anAdam step to get fast convergence and to avoid the issue ofvanishing Hessian for DNNs with ReLU activation.A4: Details of StAdv-based ADV222

We first briefly introduce the concept of spatial transforma-tion. Let xi be the i-th pixel of adversarial input x and (ui, vi)be its spatial coordinates. With flow-based transformation, xis generated from another input x by a per-pixel flow vector r,where ri = (∆ui,∆vi). The corresponding coordinates of xi inx are given by (ui,vi) = (ui +∆ui, vi +∆vi). As (ui,vi) do notnecessarily lie on the integer grid, bilinear interpolation [25]is used to compute xi:

xi =∑j

x j max(0,1−|ui+∆ui−u j|)max(0,1−|vi+∆vi−v j|)

where j iterates over the pixels adjacent to (ui,vi) in x. WithSTADV as the underlying attack framework, ADV2 can beconstructed as optimizing the following objective:

minr

`prd( f (x+ r),ct)+λìnt(g(x+ r; f ),mt)+ τ`flow(r) (21)

where `flow(r) = ∑i ∑ j∈N (i)

√‖∆ui−∆u j‖2

2 +‖∆vi−∆v j‖22

measures the magnitude of spatial transformation and τ is ahyper-parameter controlling its importance. In implementa-tion, we solve Eqn (21) using an Adam optimizer.A5: Details of AID

We use RTS as a concrete example to show the implemen-tation of AID. In RTS, one trains a DNN g (parameterized byθ) to directly predict the attribution map g(x;θ) for a giveninput x. To train g, one minimizes the interpretation loss:

ìnt(θ),λ1rtv(g(x;θ))+λ2rav(g(x;θ))− log( fc (φ(x;g(x;θ))))

+λ3 fc (φ(x;1−g(x;θ)))λ4 (22)

with all the terms defined similarly as in Eqn (8).In AID, let A denote the ADV2 attack. We further consider

an adversarial distillation loss:

àid(θ),−‖g(x;θ)−g(A(x);θ)‖1 (23)

which measures the difference of attribution maps of benignand adversarial inputs under the current interpreter g(·;θ).

AID trains g by alternating between minimizing ìnt(θ) andminimizing àid(θ) until convergence.

B. Parameter SettingHere we summarize the default parameter setting for the

attacks implemented in this paper.B1. PGD-based ADV222

For regular PGD, we set the learning rate α = 1./255 andthe perturbation threshold ε = 0.031. Table 11 list the param-eter setting of PGD-based ADV2.

Parameter ResNet DenseNet

GRAD# iterations ntotal 800 800ìnt coefficient (λ) 0.007 0.007

CAM# iterations ntotal 1200 1200ìnt coefficient (λ) 0.204/0.02 0.204

MASK

# iterations ntotal 1000 1000# gradient descent steps nstep 4 4# iterations per reset nreset 50 50max. search step size αmax 0.08 0.08# max. search steps nbs 12 12

RTS# iterations ntotal 1200 1200ìnt coefficient (λ1) 0.002/0.006 0.008`prd coefficient (λ2) 0.1/- 0.1

Table 11. Parameter setting of PGD-based ADV2. The setting for Q4in § 4 (if different) is shown after ‘/’.

Parameter ResNet DenseNet

GRAD# iterations ntotal 800 800ìnt coefficient (λ) 0.023 0.023`flow coefficient (τ) 0.0005 0.0005

CAM# iterations ntotal 600 600ìnt coefficient (λ) 0.653 0.653`flow coefficient (τ) 0.0005 0.0005

MASK

# iterations ntotal 1000 1000# gradient descent steps nstep 4 4# iterations per reset nreset 50 50ìnt coefficient (λ) 500 500`flow coefficient (τ) 0.004 0.005

RTS

# iterations ntotal 600 600ìnt coefficient (λ1) 0.0408 0.0612`prd coefficient (λ2) 0.1 0.1`flow coefficient (τ) 0.0005 0.0005

Table 12. Parameter setting of STADV-based ADV2.

B2. StAdv-based ADV222

Table 12 list the parameter setting of STADV-based ADV2.The Adam optimizer in our experiments uses the hyper-

parameter setting of (α,β1,β2) = (0.01,0.9,0.999).

C. Additional Experimental ResultsBelow we include more experimental results that comple-

ment the ones presented in § 4 and § 5.

C1. § 4 Q2 Attack Effectiveness (Interpretation)Figure 17 shows a set of sample inputs (benign and adver-

sarial) and their attribution maps generated by GRAD, CAM,MASK, and RTS on DenseNet.

Image

AD

V2






PG


Ben

ign


Imag


Figure 17: Attribution maps of benign and adversarial (PGD, ADV2)inputs with respect to GRAD, CAM, MASK, and RTS on DenseNet.

Table 13 lists the average Lp distance (p = 1,2) between

18

the attribution maps of benign and adversarial (PGD, ADV2)inputs, which complements the results in Figure 5. We nor-malize the L2 measures by dividing them by the square rootof the number of pixels.

Attack ResNet DenseNet∆ (L1) ∆ (L2) ∆ (L1) ∆ (L2)

GRADP 0.10 0.14 0.11 0.15A 0.04 0.07 0.05 0.07

CAMP 0.22 0.28 0.31 0.36A 0.05 0.06 0.04 0.05

MASKP 0.28 0.38 0.27 0.37A 0.09 0.16 0.09 0.17

RTSP 0.22 0.42 0.26 0.48A 0.01 0.06 0.02 0.09

Table 13. Lp distance between attribution maps of benign and adver-sarial (P-PGD, A-ADV2) inputs.

C2. § 4 Q4 Real ApplicationIn § 4, we use the dataset from the ISIC 2018 challenge2,

and adopt a competition-winning model [18] (with ResNet asits backbone), which achieves the second place in the chal-lenge. The confusion matrix in Figure 18 shows the perfor-mance of the classifier in our study.

MEL NV BCC AKIEC BKL DF VASC

ME

LN

VB

CC

AK

IE

CB

KL

DF

VA

SC

ISIC Confusion Matrix

1.0

0.8

0.6

0.4

0.2

0.0

Figure 18: Confusion matrix of the classifier used in § 4 Q4 on theISIC 2018 challenge dataset [18].

Table 14 lists the average Lp distance (p = 1,2) betweenthe attribution maps of benign and adversarial (PGD, ADV2)inputs in the case study of skin cancer diagnosis.

Attack ∆ (L1) ∆ (L2)

GRADP 0.19 0.23A 0.06 0.09

CAMP 0.25 0.31A 0.06 0.08

MASKP 0.23 0.30A 0.08 0.11

RTSP 0.15 0.26A 0.02 0.07

Table 14. Lp distance of attribution maps of benign and adversarial(PGD, ADV2) inputs in the case study of skin cancer diagnosis.

C3. § 4 Q5 Alternative Attack FrameworkFigure 19 visualizes attribution maps of benign and adver-

sarial (STADV, STADV-based ADV2) inputs on DenseNet.

2https://challenge2018.isic-archive.com/task3/

Figure 20 further compares the L1 measures and IoU scores(w.r.t. benign cases) of adversarial inputs on DenseNet.




Ben

ign


Imag


StA

dv

<latexit sha1_base64="NzR0vFw1/rtnavTKvEaAIpMYkWM=">AAACVHicbVDLThRBFK1uRHBUBFy66TiYuJp0g0aWiBuXGB0goUZSffs2U5l6dKpuj0wq/R9s8aNM/BcX1gyTqIMnqeTk3HMfdcpGSU95/jNJ1x6sP9zYfNR7/OTp1rPtnd1Tb1sHOASrrDsvhUclDQ5JksLzxqHQpcKzcvJhXj+bovPSmi80a3CkxZWRtQRBUfrKtaCxr8Nnel9Nu8vtfj7IF8juk2JJ+myJk8udJOeVhVajIVDC+4sib2gUhCMJCrsebz02AibiCi8iNUKjH4XF2V32KipVVlsXn6Fsof7dEYT2fqbL6FycuVqbi/+tlXplM9WHoyBN0xIauFtctyojm80zySrpEEjNIhHgZLw9g7FwAigm1+MGv4HVWpgqcADpoAt8gs7kg7d4zacQP48u8HFpr8Me93FCQ55mCvncvNd1f9xdL2ZcrCZ6n5zuD4qDwf6nN/2j42Xam+wFe8les4K9Y0fsIzthQwbMsRt2y74nP5Jf6Vq6fmdNk2XPc/YP0q3fl1611Q==</latexit>

AD

V2

<latexit sha1_base64="NqmC7Em/oF1WRetJaXKkRYdecS4=">AAAB9XicbVDLSsNAFL2pr1pfUZduBovgqiRV0GV9LFxWsA9o0zKZTtqhk0mYmSgl9D/cuFDErf/izr9xmmahrQcGDufcyz1z/JgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU0WJJLRBIh7Jto8V5UzQhmaa03YsKQ59Tlv++Gbmtx6pVCwSD3oSUy/EQ8ECRrA2Uq8bYj1SQXp12+xVp3277FScDGiZuDkpQ4563/7qDiKShFRowrFSHdeJtZdiqRnhdFrqJorGmIzxkHYMFTikykuz1FN0YpQBCiJpntAoU39vpDhUahL6ZjJLuejNxP+8TqKDSy9lIk40FWR+KEg40hGaVYAGTFKi+cQQTCQzWREZYYmJNkWVTAnu4peXSbNacc8q1fvzcu06r6MIR3AMp+DCBdTgDurQAAISnuEV3qwn68V6tz7mowUr3zmEP7A+fwAuIZJM</latexit>

Figure 19: Attribution maps of benign and adversarial (STADV andSTADV-based ADV2) inputs on DenseNet.

GRAD CAM MASK RTS0

0.08

0.16

0.24

0.32

0.4

GRAD CAM MASK RTS0

0.2

0.4

0.6

0.8

1

IoU

Sco

re

(a) (b)

StAdv

ADV2

StAdv

ADV2

Figure 20: L1 measures and IoU scores of adversarial (STADV,STADV-based ADV2) inputs w.r.t. benign maps on DenseNet.

C4. § 5 Q1 Random Class InterpretationFigure 21 visualizes attribution maps of target and adver-

sarial (ADV2) inputs on DenseNet, which complements theresults shown in Figure 12. Figure 22 compares the L1 mea-sures and IoU scores of adversarial maps w.r.t. benign andtarget cases on DenseNet.




Tar

get

Img

<latexit sha1_base64="+IXxUYjQg9dE1r8J2tWR7/6XgJ0=">AAAB/XicbVDLSgMxFM3UV62v8bFzEyyCqzJTBV0W3eiuQl/QGUomzUxDk8yQZIQ6FH/FjQtF3Pof7vwbM+0stPVA4HDOvdyTEySMKu0431ZpZXVtfaO8Wdna3tnds/cPOipOJSZtHLNY9gKkCKOCtDXVjPQSSRAPGOkG45vc7z4QqWgsWnqSEJ+jSNCQYqSNNLCPPI70SPKshWREtAfveDQd2FWn5swAl4lbkCoo0BzYX94wxiknQmOGlOq7TqL9DElNMSPTipcqkiA8RhHpGyoQJ8rPZumn8NQoQxjG0jyh4Uz9vZEhrtSEB2Yyz6oWvVz8z+unOrzyMyqSVBOB54fClEEdw7wKOKSSYM0mhiAsqckK8QhJhLUprGJKcBe/vEw69Zp7XqvfX1Qb10UdZXAMTsAZcMElaIBb0ARtgMEjeAav4M16sl6sd+tjPlqyip1D8AfW5w+gYZVU</latexit>

Tar

get

Map

<latexit sha1_base64="wBcq9myxKvgSYxwPBMcNVLrx7Jw=">AAAB/XicbVDLSsNAFL2pr1pf8bFzM1gEVyWpgi6LbtwIFfqCJpTJdNoOnUzCzESoofgrblwo4tb/cOffOGmz0NYDA4dz7uWeOUHMmdKO820VVlbX1jeKm6Wt7Z3dPXv/oKWiRBLaJBGPZCfAinImaFMzzWknlhSHAaftYHyT+e0HKhWLRENPYuqHeCjYgBGsjdSzj7wQ65EM0waWQ6o9dIfjac8uOxVnBrRM3JyUIUe9Z395/YgkIRWacKxU13Vi7adYakY4nZa8RNEYkzEe0q6hAodU+eks/RSdGqWPBpE0T2g0U39vpDhUahIGZjLLqha9TPzP6yZ6cOWnTMSJpoLMDw0SjnSEsipQn0lKNJ8YgolkJisiIywx0aawkinBXfzyMmlVK+55pXp/Ua5d53UU4RhO4AxcuIQa3EIdmkDgEZ7hFd6sJ+vFerc+5qMFK985hD+wPn8AoeKVVQ==</latexit>

AD

V2

Img

<latexit sha1_base64="CEPcawL314a/zk2rHhd0jhsoHYw=">AAAB/HicbVC7TsMwFHV4lvIKdGSxqJCYqqQgwVgeA2xFog+pCZXjOq1V24lsBymKyq+wMIAQKx/Cxt/gtBmg5UiWjs65V/f4BDGjSjvOt7W0vLK6tl7aKG9ube/s2nv7bRUlEpMWjlgkuwFShFFBWppqRrqxJIgHjHSC8VXudx6JVDQS9zqNic/RUNCQYqSN1LcrHkd6JHl2cd1+qHvwlg8nfbvq1Jwp4CJxC1IFBZp9+8sbRDjhRGjMkFI914m1nyGpKWZkUvYSRWKEx2hIeoYKxInys2n4CTwyygCGkTRPaDhVf29kiCuV8sBM5lHVvJeL/3m9RIfnfkZFnGgi8OxQmDCoI5g3AQdUEqxZagjCkpqsEI+QRFibvsqmBHf+y4ukXa+5J7X63Wm1cVnUUQIH4BAcAxecgQa4AU3QAhik4Bm8gjfryXqx3q2P2eiSVexUwB9Ynz/xDZRO</latexit>

AD

V2

Map

<latexit sha1_base64="RFMUNIxydYewbANgey7W/YwDMdM=">AAAB/HicbVDLSgMxFM34rPU12qWbYBFclZkq6LI+Fm6ECvYBnbFk0kwbmmSGJCMMQ/0VNy4UceuHuPNvzLSz0NYDgcM593JPThAzqrTjfFtLyyura+uljfLm1vbOrr2331ZRIjFp4YhFshsgRRgVpKWpZqQbS4J4wEgnGF/lfueRSEUjca/TmPgcDQUNKUbaSH274nGkR5JnF9fth7oHb1E86dtVp+ZMAReJW5AqKNDs21/eIMIJJ0JjhpTquU6s/QxJTTEjk7KXKBIjPEZD0jNUIE6Un03DT+CRUQYwjKR5QsOp+nsjQ1yplAdmMo+q5r1c/M/rJTo89zMq4kQTgWeHwoRBHcG8CTigkmDNUkMQltRkhXiEJMLa9FU2JbjzX14k7XrNPanV706rjcuijhI4AIfgGLjgDDTADWiCFsAgBc/gFbxZT9aL9W59zEaXrGKnAv7A+vwB8o6UTw==</latexit>

GRAD<latexit sha1_base64="jijuc9C6B6tD5QNSewaHuI8qAjE=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqszUgi7rA3RZxT6gHUomzbShmcyYZApl6He4caGIWz/GnX9jOp2Fth4IHM65l3tyvIgzpW3728qtrK6tb+Q3C1vbO7t7xf2DpgpjSWiDhDyUbQ8rypmgDc00p+1IUhx4nLa80fXMb42pVCwUj3oSUTfAA8F8RrA2ktsNsB4qP7l9uLyZ9oolu2ynQMvEyUgJMtR7xa9uPyRxQIUmHCvVcexIuwmWmhFOp4VurGiEyQgPaMdQgQOq3CQNPUUnRukjP5TmCY1S9fdGggOlJoFnJtOQi95M/M/rxNq/cBMmolhTQeaH/JgjHaJZA6jPJCWaTwzBRDKTFZEhlpho01PBlOAsfnmZNCtl56xcua+WaldZHXk4gmM4BQfOoQZ3UIcGEHiCZ3iFN2tsvVjv1sd8NGdlO4fwB9bnD4+bkfU=</latexit>

Figure 21: Target and adversarial (ADV2) inputs and their attributionmaps on DenseNet.


0

0.1

0.2

0.3

0.4

0.5


0

0.2

0.4

0.6

0.8

1

IoU

Sco

re



Figure 22: L1 measures (a) and IoU scores (b) of adversarial mapswith respect to benign and target cases on DenseNet.

19

https://challenge2018.isic-archive.com/task3/

interpretable deep learning under fire - arxiv · and even playing go [49]), enabling use cases...

Documents