interpretable adversarial perturbation in input embedding...

21
Interpretable Adversarial Perturbation in Input Embedding Space for Text Motoki Sato [1] , Jun Suzuki [2,4] , Hiroyuki Shindo [3,4] , Yuji Matsumoto [3,4] [1] Preferred Networks, Inc. [2] NTT Communication Science Laboratories [3] Nara Institute of Science and Technology [4] RIKEN Center for Advanced Intelligence Project

Upload: others

Post on 08-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

Interpretable Adversarial Perturbation in Input Embedding Space for Text

Motoki Sato [1], Jun Suzuki [2,4], Hiroyuki Shindo[3,4], Yuji Matsumoto[3,4]

[1] Preferred Networks, Inc. [2] NTT Communication Science Laboratories[3] Nara Institute of Science and Technology [4] RIKEN Center for Advanced Intelligence Project

Page 2: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

2 / 30

Overview of this paper

lAdversarial Training for Text

better

+ →+ →Word Vector

??? (symbol)

Adversarial Example (continuous)

Continuous

Discrete We can not interpret this vector in discrete space

Adversarial Perturbation

Page 3: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

3 / 30

Adversarial Perturbationl 2 , 2 5 5C4 2 2 5

10H 5G ������ 5 E ����� � 2

l 10H 5G ������ 2 I C A C 3 E :A . l 1 5 E ������ 2 I 3 3 5 ,3 AA 5 A3 3 3 A .

0 ������

5 1 2, 2 5

2

Page 4: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

4 / 30

Adversarial Perturbationl 2 , 2 5 0 2 2 2 5

5 4 ������ 11 , 1 ����� �

0 ������

5 1 2, 2 5

2

The gradient of loss function

Page 5: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

5 / 30

Adversarial Training

l 1 171 A 7 1 2 7 1 Goodfellow et al .,2015]

5

0 1 7 7

Original Training Data Adversarial Example

l 0 1 1 7 0 1 ������ ,– , - , -

- , - -

0 1 7 7 2 1

Page 6: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

6 / 30

Our main idea

better

worse bad

good

better

worse bad

good

u Restrict the directions of the perturbations u Restrict toward the locations of existing words.

Previous Method Ours

[Miyato et al., 2017]

Page 7: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

7 / 30

Related Work

Page 8: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

8 / 30

Related Work

– Human Knowledge [Jia and Liang, 2017]

Fooling Reading Comprehension System using crowdsourcing

– Random search [Belinkov and Bisk, 2018]

Random character-level swaps can break output of Neural

Machine Translation

– Synonym dictionary [Samanta and Mehta, 2017]

Replacing a word with its synonym

l Existing methods require human knowledge or heuristic8

Page 9: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

9 / 30

Previous Method , ������

l Takeru Miyato, Andrew M Dai, and Ian Goodfellow, ICLR 2017“Adversarial training methods for semi-supervised text classification.”

l Uni-directional LSTM + Pre-Training (Language Model) + Adversarial Training

Word VectorAdversarial Perturbation

Page 10: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

10 / 30

Word Vector�Adversarial Perturbation�

�Word vector with perturbation

Definition

� Find to increase the lossfunction

How to obtain the perturbation

� Compute the gradient with L2 normalization

Previous Method ������

�hyper-parameter (e.g.: 1.0)

Page 11: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

11 / 30

Our Method

Definition

�Direction Vector

�Compute α with the gradient

�Compute the perturbation

How to obtain the perturbation?

: is a weight for the direction

Word Vector�Adversarial Perturbation�

Page 12: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

12 / 30

Experiments

Page 13: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

13 / 30

Experiments

l Setting

– Text Classification : IMDB (Sentiment Analysis)

– Sequence Labeling : FCE-public (Grammatical Error Detection)

13Word Vector�Adversarial Perturbation�

Page 14: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

14 / 30

Evaluation by task performance

14• 5 5 0 .5

5) (

Baseline 7.05 (%) 39.21

Random Perturbation 6.74 (%) 39.90

AdvT-Text [Miyato et al., 2017] 6.12 (%) 42.28

iAdvT-Text (Ours) 6.08 (%) 42.26

VAT-Text [Miyato et al., 2017] 5.69 (%) 41.81

iVAT-Text (Ours) 5.66 (%) 41.88

SEC: Sentiment Classification Task (IMDB)GED: Grammatical Error Detection Task (FCE-public)

Page 15: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

15 / 30

Model Analysis

Page 16: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

16 / 3016

( ) )

Right Words reconstructed fromperturbations

We visualized the perturbations for understanding its behavior.

(( )

) )

Words reconstructed from perturbations

Visualization of sentence-level perturbationsOurs

Page 17: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

17 / 3017

) ( ( ( ( ) (

Right Most cosine similar words

Cosine similarities betweenperturbation and words.

( ( )( ) () )

( ( (

) ) ) () )

Words reconstructed from perturbations

Visualization of sentence-level perturbations[Miyato et al., 2017]

Page 18: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

18 / 30

Creating Adversarial Example

18

Test Text

Predict: PositiveAdversarial Example

Predict: Negative

There is really but one thing to say about this sorry movie It should never have been made The first one one of my favourites An American Werewolf in London is a great movie with a good plot good actors and good FX But this one It stinks to heaven with a cry of helplessness <eos>

There is really but one thing to say about that sorry movie It should never have been made The first one one of my favourites An American Werewolf in London is a great movie with a good plot good actors and good FX But this one It stinks to heaven with a cry of helplessness <eos>

Find the largest perturbation and replace the original word with one that matches the largest perturbation.

Page 19: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

19 / 30

Conclusion

l We discussed the interpretability of adversarial perturbation in

the NLP (Text) field.l Our methods can generate reasonable adversarial texts and

interpretable visualizations.

19

Code: https://github.com/aonotas/interpretable-adv

Thank you!

Page 20: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

20 / 30

Page 21: Interpretable Adversarial Perturbation in Input Embedding ...sato-motoki.com/pdf/ijcai2018_slide.pdf · have been made The first one one of my favourites An American Werewolf in London

21 / 30

Adversarial ExampleOriginal Text (Incorrect)

Adversarial Example

Prediction

Prediction

Incorrect → Correct