interpretable adversarial perturbation in input embedding...
TRANSCRIPT
Interpretable Adversarial Perturbation in Input Embedding Space for Text
Motoki Sato [1], Jun Suzuki [2,4], Hiroyuki Shindo[3,4], Yuji Matsumoto[3,4]
[1] Preferred Networks, Inc. [2] NTT Communication Science Laboratories[3] Nara Institute of Science and Technology [4] RIKEN Center for Advanced Intelligence Project
2 / 30
Overview of this paper
lAdversarial Training for Text
better
+ →+ →Word Vector
??? (symbol)
Adversarial Example (continuous)
Continuous
Discrete We can not interpret this vector in discrete space
Adversarial Perturbation
3 / 30
Adversarial Perturbationl 2 , 2 5 5C4 2 2 5
10H 5G ������ 5 E ����� � 2
l 10H 5G ������ 2 I C A C 3 E :A . l 1 5 E ������ 2 I 3 3 5 ,3 AA 5 A3 3 3 A .
0 ������
5 1 2, 2 5
2
4 / 30
Adversarial Perturbationl 2 , 2 5 0 2 2 2 5
5 4 ������ 11 , 1 ����� �
0 ������
5 1 2, 2 5
2
The gradient of loss function
5 / 30
Adversarial Training
l 1 171 A 7 1 2 7 1 Goodfellow et al .,2015]
5
0 1 7 7
Original Training Data Adversarial Example
l 0 1 1 7 0 1 ������ ,– , - , -
- , - -
0 1 7 7 2 1
6 / 30
Our main idea
better
worse bad
good
better
worse bad
good
u Restrict the directions of the perturbations u Restrict toward the locations of existing words.
Previous Method Ours
[Miyato et al., 2017]
7 / 30
Related Work
8 / 30
Related Work
– Human Knowledge [Jia and Liang, 2017]
Fooling Reading Comprehension System using crowdsourcing
– Random search [Belinkov and Bisk, 2018]
Random character-level swaps can break output of Neural
Machine Translation
– Synonym dictionary [Samanta and Mehta, 2017]
Replacing a word with its synonym
l Existing methods require human knowledge or heuristic8
9 / 30
Previous Method , ������
l Takeru Miyato, Andrew M Dai, and Ian Goodfellow, ICLR 2017“Adversarial training methods for semi-supervised text classification.”
l Uni-directional LSTM + Pre-Training (Language Model) + Adversarial Training
Word VectorAdversarial Perturbation
10 / 30
Word Vector�Adversarial Perturbation�
�Word vector with perturbation
Definition
� Find to increase the lossfunction
How to obtain the perturbation
� Compute the gradient with L2 normalization
Previous Method ������
�hyper-parameter (e.g.: 1.0)
11 / 30
Our Method
Definition
�Direction Vector
�Compute α with the gradient
�Compute the perturbation
How to obtain the perturbation?
: is a weight for the direction
Word Vector�Adversarial Perturbation�
12 / 30
Experiments
13 / 30
Experiments
l Setting
– Text Classification : IMDB (Sentiment Analysis)
– Sequence Labeling : FCE-public (Grammatical Error Detection)
13Word Vector�Adversarial Perturbation�
14 / 30
Evaluation by task performance
14• 5 5 0 .5
5) (
Baseline 7.05 (%) 39.21
Random Perturbation 6.74 (%) 39.90
AdvT-Text [Miyato et al., 2017] 6.12 (%) 42.28
iAdvT-Text (Ours) 6.08 (%) 42.26
VAT-Text [Miyato et al., 2017] 5.69 (%) 41.81
iVAT-Text (Ours) 5.66 (%) 41.88
SEC: Sentiment Classification Task (IMDB)GED: Grammatical Error Detection Task (FCE-public)
15 / 30
Model Analysis
16 / 3016
( ) )
Right Words reconstructed fromperturbations
We visualized the perturbations for understanding its behavior.
(( )
) )
Words reconstructed from perturbations
Visualization of sentence-level perturbationsOurs
17 / 3017
) ( ( ( ( ) (
Right Most cosine similar words
Cosine similarities betweenperturbation and words.
( ( )( ) () )
( ( (
) ) ) () )
Words reconstructed from perturbations
Visualization of sentence-level perturbations[Miyato et al., 2017]
18 / 30
Creating Adversarial Example
18
Test Text
Predict: PositiveAdversarial Example
Predict: Negative
There is really but one thing to say about this sorry movie It should never have been made The first one one of my favourites An American Werewolf in London is a great movie with a good plot good actors and good FX But this one It stinks to heaven with a cry of helplessness <eos>
There is really but one thing to say about that sorry movie It should never have been made The first one one of my favourites An American Werewolf in London is a great movie with a good plot good actors and good FX But this one It stinks to heaven with a cry of helplessness <eos>
Find the largest perturbation and replace the original word with one that matches the largest perturbation.
19 / 30
Conclusion
l We discussed the interpretability of adversarial perturbation in
the NLP (Text) field.l Our methods can generate reasonable adversarial texts and
interpretable visualizations.
19
Code: https://github.com/aonotas/interpretable-adv
Thank you!
20 / 30
21 / 30
Adversarial ExampleOriginal Text (Incorrect)
Adversarial Example
Prediction
Prediction
Incorrect → Correct