arxiv:2005.14108v3 [cs.cl] 14 jun 2020 · 2020. 6. 16. · timent analysis, machine translations,...

Adversarial Attacks and Defense on Texts: A Survey

Aminul Huq & Mst. Tasnim PervinID: 2019280161 & 2019280162

Dept. of Computer Science & TechnologyTsinghua University

{huqa10,pervinmt10}@mails.tsinghua.edu.cn

Abstract

Deep learning models have been used widelyfor various purposes in recent years in ob-ject recognition, self-driving cars, face recog-nition, speech recognition, sentiment analysis,and many others. However, in recent yearsit has been shown that these models possessweakness to noises which force the model tomisclassify. This issue has been studied pro-foundly in the image and audio domain. Verylittle has been studied on this issue concern-ing textual data. Even less survey on thistopic has been performed to understand differ-ent types of attacks and defense techniques. Inthis manuscript, we accumulated and analyzeddifferent attacking techniques and various de-fense models to provide a more comprehensiveidea. Later we point out some of the interest-ing findings of all papers and challenges thatneed to be overcome to move forward in thisfield.

1 Introduction

From the beginning of the past decade, the studyand application of Deep Neural Network (DNN)models have sky-rocketed in every research field.It is currently being used for computer vision(Buch et al., 2011; Borji et al., 2014), speechrecognition(Deng et al., 2013; Huang et al., 2015),medical image analysis(Litjens et al., 2017; Shenet al., 2017), natural language processing(Zhanget al., 2018; Otter et al., 2020), and many more.DNNs are capable of solving large scale com-plex problems with relative ease which, tends tobe difficult for regular statistical machine learn-ing models. That is why in many real-world ap-plications, DNNs are used profoundly and explic-itly. In a study by C. Szegedy(2013), showed thatDNN models are not that robust in the image do-main. They are in fact, quite easy to fool and canbe tampered in such a way to obey the will of anadversary. This study caused an uproar in the re-searcher community and researchers started to ex-

plore this issue in other research areas as well. Dif-ferent researchers worked tirelessly and showedthat DNN models were vulnerable in object recog-nition systems (Goodfellow et al., 2014), audiorecognition(Carlini and Wagner, 2018), malwaredetection (Grosse et al., 2017), and sentiment anal-ysis systems (Ebrahimi et al., 2017) as well. Anexample of the adversarial attacks is shown in fig-ure 1.

The number of studies of adversarial attacksand defenses in the image domain outnumbersthe number of studies performed in textual data(Wang et al., 2019a). In Natural Language Pro-cessing (NLP), for various applications like sen-timent analysis, machine translations, question-answering, and in many others, different attacksand defense have been employed. In the field ofNLP, Papernot (2016) paved the way by showingthat adversarial attacks can be implemented fortextual data as well. After that, various researchhas been performed to explore adversarial attacksand defense in the textual domain.

Adversarial attacks are a security concern for allreal-world applications that are currently runningon DNNs. It is the same scenario for NLP as well.There are many real-world programs launched,which is based on DNNs like sentiment analy-sis (Pang and Lee, 2008), text question-answering(Gupta and Gupta, 2012), machine translation (Wuet al., 2016), and many others. Users in the phys-ical world use these applications in their lives toget suggestions about a product, movies, or restau-rant or to translate texts. An Adversary could eas-ily use attack techniques for ill-will and providewrong recommendations and falsify texts. Sincethese attacks are not observed by the models due tothe lack of robustness these programs would losevalues. Thus, the study of adversarial attacks anddefense with respect to text is of utmost impor-tance.

In this review, our major contributions can be

arX

iv:2

005.

1410

8v3

[cs

.CL

] 1

4 Ju

n 20

20

Figure 1: Adversarial attacks on image and audio data(Gong and Poellabauer, 2018)

summarized as follows.

• We provide a systematic analysis and studyof different adversarial attacks and defensetechniques which are shown in different re-search projects related to classification, ma-chine translations, question-answering, andmany others.

• We present here adversarial attack and de-fense techniques by considering different at-tack levels.

• After going through all the research works wehave tried to answer which attack and defensetechnique has the advantage over other tech-niques.

• Finally, we present some exciting findings af-ter going through all the research works andpoint out challenges that need to be overcomein the future.

Our manuscript is organized as follows. Relatedresearch works are mentioned in section 2. In sec-tion 3 we start by providing preliminary informa-tion about adversarial machine learning concern-ing both image and textual data. Here we also pro-vide a classification of different attack and defenseapproaches as well. Following this, in section 4 wediscuss various attack techniques based on the tax-onomy presented in section 3. To defend modelsfrom these attack techniques we discuss differentdefense techniques in section 5. We provide anin-depth discussion of our findings in these topicsand some challenges in section 6. We conclude ourmanuscript by providing a conclusion in section 7.

2 Related Works

For textual data this topic has not been exploredthat much but for the image domain, it has been

TextDomain

Attacks Defense

Belinkov(2019)Partly

******

Xu(2019)Zhang(2020)

Fully**

Ours ****

Table 1: Comparison with recent surveys. (No. of starsrepresent how much that topic is discussed)

explored much more. Since this topic hasnt beenexplored that much small number of publicationsabout it have been found and a smaller number ofreview works have been done. We were able to gothrough three review papers related to this topic(Belinkov and Glass, 2019; Xu et al., 2019; Zhanget al., 2020). In the manuscript of Belinkov(2019),they mainly studied different analysis in NLP, vi-sualization and what type of information neuralnetworks capture. They introduced adversarial ex-amples to explain that traditional DNN models areweak. Xu (2019) explained adversarial examplesin all domains i.e. image, audio, texts etc. It wasnot specialized for text but mostly related to im-age domain. Zhang (2020) in their manuscript de-scribed different publications related to adversar-ial examples. Unfortunately it was focused largelyon attack strategies and shed little light on defensetechnique. They mainly discussed data augmenta-tion, adversarial training and one distillation tech-nique proposed by (Papernot et al., 2016). Table1 provides a comparative analysis between thesesurvey papers and ours.

3 Adversarial Machine Learning

Modern machine learning and deep learning hasachieved a whole new height because of its highcomputational power and fail-proof architecture.However, recent advances in adversarial traininghave broken this illusion. A compelling model canmisbehave by a simple attack by adversarial exam-ples. An adversarial example is a specimen of in-put data that has been slightly transformed in sucha way that can fool a machine learning classifierresulting in misclassification. The main idea be-hind this attack is to inject some noise to the inputto be classified that is unnoticeable to the humaneye so that the resulting prediction is changed fromactual class to another class. Thus we can under-stand the threat of this kind of attack to classifica-tion models.

3.1 Definition

For a given input data and its labels (x, y) and aclassifier F which is capable of mapping inputs xto its designated labels y in the general case we candefine them as F (x) = y. However, for adversar-ial attack techniques apart from input data a smallperturbation δ is also added to the classifier F .Note that, this perturbation is imperceptible to hu-man eyes and it is limited by a threshold ||δ|| < ε.In this case, the classifier is unable to map it to theoriginal labels. Hence, F (x + δ) 6= y. The con-cept of imperceptibility is discussed in length insection 5.2.

A robust DNN model should be able to look be-yond this added perturbation and be able to clas-sify input data properly. i.e. F (x+ δ) = y.

3.2 Existence of Adversarial Noises

Since adversarial examples have been uncovered,a growing and difficult question have been loom-ing over the research community. Why adversarialexamples exist in real-life examples. Several hy-potheses have been presented in an attempt to an-swer this question. However, none of them haveachieved a unanimous agreement of the overall re-searcher community. The very first explanationcomes from C. Szegedys paper (2013). In it, hesays that adversarial examples exist because thereare too much non-linearity and the network is notregularized properly. Opposing this hypothesis I.Goodfellow says that it exists because of too muchlinearity in machine learning and deep learningmodels (2014). Various activation functions thatwe use today like ReLU and Sigmoid are straightlines in the middle parts. He argues that sincewe want to protect our gradients from vanishingor exploding we tend to keep our activation func-tions straight. Hence, if a small noise is addedto the input because of the linearity it perpetuatesin the same direction and accumulates at the endof the classifier and produces miss-classification.Another hypothesis that is present today is calledtilted boundary (Tanay and Griffin, 2016). Sincethe classifier is never able to fit the data exactlythere is some scope for the adversarial examplesto exist near the boundaries of the classifier. Arecent paper argues that adversarial examples arenot bugs but they are features and that is how deepneural networks visualize everything (Ilyas et al.,2019). They classified the features into two cate-gories called robust and non-robust features and

showed that by adding small noises non-robustfeatures can make the classifier to provide a wrongprediction.

3.3 Classification of Adversarial ExamplesHere, we provide a basic taxonomy of adversarialattacks and adversarial defense techniques basedon different metrics. For adversarial attack tech-niques, we can classify different attacks based onhow much the adversary has knowledge about themodel. We can divide it into two types.

• White-Box Attacks: In order to execute thesetypes of attacks the adversary needs to havefull access to the classifier model. Usingthe model parameters, architectures, inputs,and outputs the adversary launch the attack.These types of attacks are the most effectiveand harmful since it has access to the wholemodel.

• Black-Box Attacks: These attacks representreal-life scenarios where the adversary hasno knowledge about the model architecture.They only know about the input and outputof the model. In order to obtain further infor-mation, they use queries.

We can classify attacks based on the goal ofthe adversary as well. We can classify it into twotypes.

• Non-Targeted Attack: Adversary in this sce-nario does not care about the labels that themodel produces. They are only interestedin reducing the accuracy of the model. i.e.F (x+ δ) 6= y.

• Targeted Attack: In targeted attacks, the ad-versary forces the model to produce a spe-cific output label for given images. i. e.F (x+ δ) = y∗.

These are the general classifications of differ-ent attacks. However, for NLP tasks, we can clas-sify attacks differently. Since the data in text-domain is different from the data in the imageor audio domain attack strategy and attack typesare somewhat different. Based on which compo-nents are modified in the text we can classify at-tack techniques into four different types. They arecalled character-level attacks, word-level attacks,sentence-level attacks, and multi-level attacks. Inthese adversarial attacks text data are generally

inserted, removed, swapped/replaced, or flipped.Though not all of these options are explored in dif-ferent levels of attacks.

• Character-Level Attack: Individual charac-ters in this attack are either modified withnew characters, special characters, and num-bers. These are either added to the texted,swapped with a neighbor, removed from theword, or flipped.

• Word-Level Attack: In this attacks wordsfrom the texts are changed with their syn-onyms, antonyms, or changed to appear as atyping mistake or removed completely.

• Sentence-Level Attack: Generally new sen-tences are inserted as adversarial examples inthese types of attacks. No other approach hasbeen explored yet.

• Multi-Level Attack: Attacks which can beused in a combination of character, word, andsentence level are called multi-level attack.

Figure 2: Adversarial attacks classification

Adversarial defense techniques have been stud-ied in mainly three directions (Akhtar and Mian,2018). They are

• Modified Training/Input: In these cases, themodel is trained differently to learn more ro-bust features and become aware of adversar-ial attacks. During testing, inputs are alsomodified to make sure no adversarial pertur-bation is added to it.

• Modifying Networks: By adding more layersor sub-networks and changing loss or activa-tion functions defense is sought in this sce-nario.

• Network Add-on: Using external networksas additional sections for classifying unseendata.

Reference Attack-type Application

Ebrahimi(2018)White-box

andBlack-box

MachineTranslation

Belinkov(2017) Black-boxMachine

TranslationGao(2018) Black-box Classification

Li(2018)White-box

andBlack-box

Classification

Gil(2019) Black-box ClassificationHosseini(2017) Black-box Classification

Table 2: Character-level attack type with applications.

4 Adversarial Attacks

Most of the literature is about attack techniquesthat are where we start our discussion. In this sec-tion, we will be analyzing different attack tech-niques published in recent years in detail. In orderto provide a clear understanding, we are diving ourexplanations based on the taxonomy for the NLPthat we mentioned in section 2.

4.1 Character-Level AttackAs mentioned before character level attacks in-cludes attack schemes which try to insert, mod-ify, swap, or remove a character, number, orspecial character. Ebrahimi (2018) in his paperworked with generating adversarial examples forcharacter-level neural machine translation. Theyprovided white and black box attack techniquesand showed that white-box attacks were moredamaging than black-box attacks. They proposeda controlled adversary that tried to mute a partic-ular word for translation and targeted adversarywhich aimed to push a word into it. They usedgradient-based optimization and in order to editthe text, they performed four operations insert,swap two characters, replace one character withanother and delete a character. For black-box at-tack, they just randomly picked a character andmade necessary changes.

Belinkov (2017) worked with character-basedneural machine translation as well. In their paper,they didnt use or assume any gradients. They re-lied on natural and synthetic noises for generatingadversarial noises. For natural noises, they col-lected different errors and mistakes from variousdatasets and replaced correct words with wrongones. In order to generate synthetic noises they

Figure 3: Example of character-level attack(TEXTBUGGER) (Li et al., 2018)

relied on swapping characters, randomized char-acters of a word except the first and last one, ran-domized all the characters, and replaced one char-acter with a character from its neighbor in the key-board.

Another black-box attack was proposed by Gao(2018). They worked with Enron spam emails andIMBD dataset for classification tasks. Since in theblack-box settings, an adversary does not have ac-cess to gradients they proposed a two-step processto determine which words are the most significantones. Temporal score and temporal tail scores areto be calculated to determine the most significantword. This approach was coined as DEEPWORD-BUG by the authors. To calculate the temporalscore, they checked how much effect each wordhad on the classification result. The Temporal tailscore is the complement of temporal scores. Fortemporal tail score, they compared results for twotrailing parts of sentences, one had a particularword and another did not have it.

TEXTBUGGER is both a white-box and black-box attack framework that was proposed by Li(2018). For generating bugs or adversarial exam-ples they focused on five kinds of edits: insertion,deletion, swapping, substitution with a visuallysimilar word, substitution with semantically simi-lar meaning. For white-box attacks, they proposeda two-step approach. The first step is to determinewhich words are most significant with the help ofdetermining the Jacobian matrix. Then generateall five bugs and choose the one which is the mostoptimal for reducing accuracy. In order to generateblack-box attacks in this framework, they proposea three-step approach. Since there is no access tothe gradient thus they propose to determine firstwhich sentence is the most important one. Thendetermine which word is the most significant andfinally generate five bugs for it and choose whichone is the most optimal.

Gil (2019) was able to transform a white-box attack technique to a black-box attack tech-nique. They generated adversarial examples from

Reference Attack-type ApplicationPapernot(2016) White-box ClassificationSamanta(2017) White-box Classification

Liang(2017)White-box

andBlack-box

Classification

Alzantot(2018) Black-box ClassificationKulesov(2008) White-box Classification

Zang(2019) Black-box Classification

Table 3: Word-level attack type with applications.

a white-box attack technique and then trained aneural network model to imitate the overall pro-cedure. They transferred the adversarial examplesgeneration by the HotFlip approach to a neuralnetwork. They coined these distilled models asDISTFLIP. Their approach had the advantages ofnot being dependent on the optimization processwhich made their adversarial example generationfaster.

Perspective is an API that is built by Google andJigsaw to detect toxicity in comments. Hosseini(2017) showed that it can be deceived by modi-fying inputs. They didnt have any calculated ap-proach, mainly modified toxic words by adding adot (.) or space between two characters or swap-ping two characters, and thus they showed the APIgot lower toxicity score than before.

4.2 Word Level Attack

Papernot (2016) was the first one to generate ad-versarial examples for texts. They used a compu-tational graph unfolding technique to calculate theforward derivative and with its help the Jacobian.It helps to generate adversarial examples using theFGSM technique. The words they choose to re-place with are chosen randomly so the sentencedoesnt keep original meaning or grammatical cor-rectness.

To change a particular text classification labelwith the minimum number of alterations Samanta

Figure 4: Example of word-level attack (Alzantot et al., 2018)

(2017) proposed a model. In their model, theyeither inserted a new word or deleted one or re-placed one. They at first determined which wordsare highly contributing to the classifier. They de-termined a word is highly contributing if removingit changes the class probability to a large extend.To replace the words they created a candidate poolbased on synonyms, typos that produce meaning-ful words and genre-specific words. They changeda particular word based on the following condi-tions.

if Word is an adverb and highly contributingthen

remove itelse

Choose a word from the candidate poolif the word is an adjective and candidate wordis adverbthen

Insertelse

replace particular word with candidateword

end ifend ifLiang (2017) proposed a white-box and black-

box attack strategy based on insertion, deletion,and modification. To generate adversarial ex-amples they used natural language watermarkingtechnique (Atallah et al., 2001). To perform awhite-box attack they provided a concept of HotTraining Phrase (HTP) and Hot Sample Phrase(HSP). These are obtained with the help of back-propagation to get all the cost gradients of eachcharacter. HTP helps to determine what needs tobe inserted while HSP helps to determine whereto insert, delete, and modify. For black-box at-tacks, they borrowed the idea of fuzzing technique(Sutton et al., 2007) for implementing a test to getHTPs and HSPs.

To preserve syntactical and semantic meaning

Kuleshov (2008) used thought vectors. They tookthe inspiration from (Bengio et al., 2003; Mikolovet al., 2013) which mapped sentences to vectors.Those who had similar meanings were placed to-gether. To ensure semantic meaning they intro-duce syntactic constraint. Their approach was iter-ative and in each iteration, they replaced only oneword with their nearest neighbor which changedthe objective function the most.

Genetic algorithm based black-box attack tech-niques are proposed by Alzanot (2018). They triedto generate adversarial examples that were seman-tically and syntactically similar. For a particularsentence, they randomly select a word and replaceit with a suitable replacement word that fits thecontext of the sentence. For this, they calculatethe first few nearest neighbor words according tothe GloVe embedding space. Next using Googles1 billion words language model they try to removeany words which do not match the context. Af-ter that, they select a particular word which max-imizes the predication. This word is then insertedinto the sentence.

An improvement of the genetic algorithm basedattack was proposed by Wang (2019b). They mod-ified it by allowing a single word of a particularsentence to be changed multiple times. To ensurethat the word is indeed a synonym of the originalword it needs to be fixed.

A sememe based word substitution method us-ing particle swarm optimization technique wasproposed by (Zang et al., 2019). A sememe is theminimum semantic unit in human language. Theyargued that word embedding and language modelbased substitution methods can find many replace-ments but they are not always semantically cor-rect or related to the context. They compared theirwork with (Alzantot et al., 2018) attack techniqueand showed their approach was better.

Reference Attack-type Application

Jia(2017)White-box

andBlack-box

QuestionAnswering

Wang(2018) White-box Classification

Zhao(2017) Black-boxNatural

LanguageInference

Cheng(2019) White-boxMachine

Translation

Micheal(2019) White-boxMachine

Translation

Table 4: Sentence-level attack types with applications.

4.3 Sentence Level Attack

In the domain of question answering Robin Jia(2017) introduced two attack techniques calledADDSENT and ADDANY. They also introducedtwo variants of this ADDONESENT and ADD-COMMON randomly. Here, ADDONESENT isa model-independent attack i.e. black-box attack.Using these attacks they generated an adversarialexample that does not contradict the original an-swer and inserts it at the end of the paragraph.To show the effectiveness of ADDSENT and AD-DANY they used it on 16 different classifiers andshowed that all of them got reduced F1 score. Fig-ure 5. provides a visual representation of howADDSENT and ADDANY attack is used for gen-erating texts.

There is another group of researchers namedYicheng Wang (2018) who have worked on themodification of the ADDSENT model. They pro-posed two modifications of the ADDSENT modeland named their model as ADDSENTDIVERSE.ADDSENT model creates fake answers that aresemantically irrelevant but follows similar syn-tax as a question. In ADDSENTDIVERSE, theytargeted to generate adversarial examples with ahigher variance where distractors will have ran-domized placements so that the set of fake an-swers will be expanded. Moreover, to addressthe antonymstyle semantic perturbations that areused in ADDSENT, they added semantic rela-tionship features enabling the model to identifythe semantic relationship among contexts of ques-tions with the help of WordNet. The papershows that the ADDSENTDIVERSE model beatsADDSENT trained model by an average improve-ment of 24.22% in F1 score across three different

classifiers indicating an increase in robustness.

Zhao (2017) Proposed a new framework uti-lizing Generative Adversarial Networks (GAN)on Stanford Natural Language Interface (SNLI)dataset to generate grammatically legible and nat-ural adversarial examples that are valid and se-mantically close to the input and can detect localbehavior of input by searching in semantic spaceof continuous data representation. They have im-plemented these adversaries in different applica-tions such as image classification, machine trans-lation, and textual entailment to evaluate the per-formance of their proposed approach on black-boxclassifiers such as ARAE ( Adversarially Regu-larized Autoencoder), LSTM and TreeLSTM. Bytheir work, they have proved that their model issuccessful to generate adversaries that can passcommon-sense reasoning by logical inference anddetect the vulnerability of the Google Translatemodel during machine translation.

Cheng (2019) worked with neural machinetranslation and proposed a gradient-based white-box attack technique called AdvGen. Guided bythe training loss they used a greedy choice basedapproach to find the best solution. They also usedthe language model into it as well because it iscomputationally easy for solving an intractablesolution and it also retains somewhat semanticmeaning. Their research paper is based on us-ing adversarial examples for both attack genera-tion and using these adversarial examples to im-prove the robustness of the model.

Michael(2019) worked with neural machinetranslation as well and in their manuscript, theyproposed a natural criterion for untargeted attacks.It is adversarial examples should be meaning pre-serving on the source side but meaning destroyingon the target side. From it, we can see that they arefocusing on the point about preserving the mean-ing of the sentences while pushing adversarial ex-amples into it. They propose a white-box attackusing the gradients of the model which replacesone word from the sentences to maximize the loss.To preserve the meaning of the sentences theyused KNN to determine the top 10 words whichare similar to a given word. This approach has theadvantage of preserving the semantic meaning ofthe sentence. They allowed swapping characters tocreate substitute words but if the word is out of thevocabulary then they repeated the last character togenerate the substitute word.

Figure 5: ADDANY and ADDSENT Attack Generation (Jia and Liang, 2017)

4.4 Multi-level Attack

Reference Attack-type ApplicationEbrahimi(2017) White-box Classification

Blohm(2018)White-box

andBlack-box

Question-Answering

Wallace(2019) White-boxClassification,

Question-Answering

Table 5: Multi-level attack types with applications.

HotFlip is a very popular, fast, and simple attacktechnique that was proposed by Ebrahimi (2017).This is a white-box gradient-based attack. In thecore of the attack lies a simple flip operation whichis based on the directional derivatives of the modelwith respect to one-hot encoding input. Only oneforward and backward pass is required to predictthe best flip operation. This attack can also in-clude insertion and deletion as well if they arerepresented as character sequences. After estimat-ing which changes ensure the highest classifica-tion errors a beam search algorithm finds a set ofmanipulations that works together to ensure theclassifier is confused. Their original manuscriptwas on character level adversarial attack but theyalso showed that their approach can be extendedto word-level attack as well. Since flipping a wordto another has the possibility of losing its originalvalue they flipped a word only if it satisfied certain

conditions. They flipped a word if the cosine sim-ilarity of the word embeddings were higher thana given threshold and if they were members of thesame parts of speech. They didnt allow stop wordsto be removed.

On the topic of question answering systemBlohm (2018) implemented word and sentencelevel white-box and black-box attacks. Theystarted by achieving the state of the art score onthe MovieQA dataset and then investigate differ-ent attacks effect.

• Word-level Black-box Attack: For this typeof attack the authors substituted the wordsmanually by choosing lexical substitutionsthat preserved their meanings. To ensurethe words were inside of the vocabulary theyonly switched words which were included inthe pretrained GloVe embeddings.

• Word-level White-box Attack: With the helpof the attention model they used for classi-fication they determined which sentence andwhich word was the most important one. Thishad a huge impact on the prediction results.

• Sentence-level Black-box Attack: Adoptingthe strategy of ADDANY attack proposed by(Jia and Liang, 2017) they initialized a sen-tence with ten common English words. Theneach word is changed to another word whichreduces the prediction confidence the most.

• Sentence-level White-box Attack: Similar tothe word-level white-box attack they targetthe sentence which has the highest attentioni.e. the plot sentence. They removed the plotsentence to see if the classifier was indeed fo-cusing on it and its prediction capability.

Wallace (2019) proposed a technique in whichthey added tokens at the beginning or end of asentence. They attempt to find universal adversar-ial triggers that are optimized based on the white-box approach but which can also be transferred toother models. At the very beginning, they startby choosing trigger lengths as this is an importantcriterion. Longer triggers are more effective butmore noticeable than shorter ones. To replace thecurrent tokens they took inspiration from the Hot-Flip approach proposed by (Ebrahimi et al., 2017).They showed that for text classification tasks thetriggers caused targeted errors for sentiment anal-ysis and reading comprehension task triggers cancause paragraphs to generate arbitrary target pre-diction.

5 Adversarial Defense

As mentioned earlier most of the researchers fo-cused on attacking DNN models in the field ofNLP few focused on defending it. Here we divideour studied manuscripts into two sections one be-ing the most common approach called adversarialtraining found in (Goodfellow et al., 2014). In thesecond section, we include all the research papersthat try to tackle attacks by working on a specificdefense technique.

5.1 Adversarial TrainingAdversarial training is the process of training amodel on both the original dataset and also ad-versarial example with correct labels. The generalidea behind this approach is that, since the classi-fier model is now introduced to both original andadversarial data the model will now look beyondthe perturbations and recognize the data properly.

Belinkov (2017) in their experiments showedthat training the model with different types ofmixed noises improves the model’s robustness todifferent kinds of noises. In the experiments of Li(2018) they also showed for TEXTBUGGER at-tack adversarial training can improve model per-formance and robustness against adversarial ex-amples. In the experiments of Zang (2019) theyshowed that their sememe based substitution and

PSO based optimization improved classifiers’ ro-bustness to attacks. By using CharSwap during ad-versarial training on their attack Micheal showedthat adversarial training can also improve the ro-bustness of the model. Ebrahimi (2017) in theirmanuscript of HotFlip also performed adversarialtraining. During their testing phase, they imple-mented beam search which wasnt used for traininghence the adversary in the training wasnt strong asthe testing ones. This reflects in their adversarialtraining experimental results as well. Though af-ter training with adversarial examples the modelattains certain robustness its accuracy isnt as highas the original testing.

5.2 Topic Specific Defense Techniques

One of the major problems with adversarial train-ing is that during training different types of attacksneed to be known. Since adversaries dont publi-cize their attack strategies adversarial training islimited by the users knowledge. If a user triesto perform adversarial training against all attacksknown to him then the model would not be ableto perform classification properly as it would havevery low information on the original data.

In the research work of Alzanot (2018) theyfound that their attack approach which was basedon genetic algorithm was indifferent to adversar-ial training. A good reason for this would be thatsince their attack diversified the input so much ad-versarial training did not affect them.

To protect models from synonym based attacktechniques Wang (2019b) proposed the synonymencoding method (SEM) which puts an encodernetwork before the classifier model and checks forperturbations. In this approach, they cluster andencode all the synonyms to a unique code so thatthey force all the neighboring words to have sim-ilar codes in the embedding space. They com-pared their approach with adversarial training onfour different attack techniques and showed thatthe SEM-based technique was better in the syn-onym substitution attack method.

Adversarial spelling mistakes were the primeconcern of Pruthi (2019) in their research work.Through their approach, they were able to han-dle adversarial examples which included inser-tion, deletion, swapping of characters, and key-board mistakes. They used a semi character-based RNN model with three different back-offstrategies for a word recognition model. They

proposed three back-off strategies pass-through,back-off to a neutral word, back-off to the back-ground model. They tested their approach againstadversarial training and data augmentation baseddefense and found out that ScRNN with pass-through back-off strategy provided the highest ro-bustness.

A defense framework was proposed by Zhou(2019) to determine whether a particular token isa perturbation or not. The discriminator providessome candidate perturbations and based on thecandidate perturbations they used an embeddingestimator to restore the original word and basedon the context using the help of KNN search. Theauthors named this framework as DISP. This dis-criminator is trained on the original corpus dur-ing training time for figuring out which one isthe perturbation. Token of the embedding cor-pus is fed to the embedding estimator to train itand recover the original word. During the testingphase, the discriminator provides candidate tokenswhich are perturbations, and for each of the can-didate perturbation, the estimator provides an ap-proximate embedding vector and attempts to re-store the word. After this, the overall restored textcan be passed to the model for prediction. To eval-uate their frameworks’ ability to identify pertur-bation tokens they compared their results againstspell checking technique on three character levelattacks and two-word level attacks. Results showthat their approach was more successful in achiev-ing better results. To test the robustness of theirapproach they compared against adversarial train-ing, spell checking, and data augmentation. Theirapproach was able to perform in this experimentas well.

6 Discussion

We will be providing a discussion on some of theinteresting findings that we found while studyingdifferent manuscripts. We are also going to shedsome light on the challenges in this area.

6.1 Interesting Findings

Based on the papers that we had studied we sum-marize and list out here some interesting findings.

• Character-level Perturbations: It can be as-tounding to see that changing a single char-acter can affect the model’s prediction. Hos-seini showed that adding dots(.) or space

in words can be enough to confuse perspec-tive api(Hosseini et al., 2017). Not onlythis in HotFlip we have seen that the authorsswapped a character based on the gradients tofool the model. So, while designing defensestrategies a mere character level manipulationneeds to be considered as well.

• Research Direction: From the papers that westudied we found out that most of the pa-pers were based on different attack strategies.Very few papers were focused on defendingthe model. The same can be said for multi-stage attacks as well. Only a few researchersproduced manuscripts for multi-stage attacks.

• Adversarial Example Generation: Throughthe studies of different manuscripts we foundthat to generate adversarial examples most ofthe researchers followed a two-step approach.The first being finding out which word wasthe most significant in providing predictionand the second step was to replace it withsuitable candidates that benefit the adversary.

6.2 Challenges

In our study, we found several challenges in thisfield. They are mentioned below.

• This phenomenon was first found in existencein the image domain and it has gained a lotof attention. Many research works have beenpublished on it. It can be an easy assump-tion that we can use it in the text domain aswell. However, there is a significant differ-ence between them. In the image domain,the data is continuous but text data is discrete.Hence, the attacks proposed in the image do-main cannot be utilized in text-domain.

• Another limitation of textual data is the con-cept of imperceptibility. In the image do-main, the perturbation can often be madevirtually imperceptible to human perception,causing humans and state-of-the-art modelsto disagree. However, in the text domain,small perturbations are usually clearly per-ceptible, and the replacement of a singleword may drastically alter the semantics ofthe sentence and be noticeable to human be-ings. So, the structure of imperceptibility isan open issue.

Task Dataset Reference

Classification

IMBD, SST-2, MR, AG News,Yelp Review, DB-pedia,

Amazon Review, Trec07p,Enron Spam Detection

Gao(2018), Li(2018), Ebrahimi(2017),Kulesov(2008),Samanta(2017), Zang(2019),

Liang(2017)

Machine Translation

Ted Talks parallelcorpus for IWSLT2016,

LDC corpus, NIST,WMT’14

Ebrahimi(2018), Belinkov(2017),Cheng(2019), Michel(2019)

Natural LanguageInference

SNLI Alzantot(2018), Zang(2019), Zhao(2017)

Question-Answering SQuAD, MovieQA Jia(2017), Blohm(2018)

Table 6: Dataset used by different researchers for attack generation

• Till now no defense strategy can handle alldifferent types of attacks that were mentionedhere. Each defense strategy worked on a sin-gle type of attack approach. For example,for spelling mistakes, we can use the defensetechnique proposed by (Pruthi et al., 2019).For synonym based attacks we can use theSEM model. A unified model that can tackleall these issues has not been proposed yet.

• The concept of universal perturbation hasstill not been introduced for textual data. Inthe image domain, researchers established amethod that was able to generate a single per-turbation that can fool the model.

• Whenever a new attack technique is pro-posed researchers use different classifiers anddatasets as there is no benchmark. From ta-ble 6 we can see for a particular applicationdifferent datasets are used no ideal dataset isbeing used for attack generation. Since thereis no benchmark it is not easy to compare dif-ferent attack and defense strategies with eachother. Lacking such a benchmark is a big gapin this field of research.

• There is no standard toolbox that can be usedto easily reproduce different researchers’work. There are many toolbox which can beused in the image domain like cleverhans (Pa-pernot et al., 2018), art (Nicolae et al., 2018),foolbox (Rauber et al., 2017) etc but there isno standard toolbox for text-domain.

7 Conclusion

In this review, we discussed the adversarial attackand defense techniques for textual data. Since the

inception of adversarial examples, it has been avery important research topic for many aspects ofdeep learning applications. DNNs perform verywell on a standard dataset but perform poorly inthe presence of adversarial examples. We triedto present an accumulated view of why they ex-ist, different attack and defense strategies basedon their taxonomy. Also, we pointed out severalchallenges that can be tended to for getting futuredirection about research works in the future.

ReferencesNaveed Akhtar and Ajmal Mian. 2018. Threat of ad-

versarial attacks on deep learning in computer vi-sion: A survey. IEEE Access, 6:14410–14430.

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.2018. Generating natural language adversarial ex-amples. arXiv preprint arXiv:1804.07998.

Mikhail J Atallah, Victor Raskin, Michael Crogan,Christian Hempelmann, Florian Kerschbaum, DinaMohamed, and Sanket Naik. 2001. Natural lan-guage watermarking: Design, analysis, and aproof-of-concept implementation. In InternationalWorkshop on Information Hiding, pages 185–200.Springer.

Yonatan Belinkov and Yonatan Bisk. 2017. Syntheticand natural noise both break neural machine transla-tion. arXiv preprint arXiv:1711.02173.

Yonatan Belinkov and James Glass. 2019. Analysismethods in neural language processing: A survey.Transactions of the Association for ComputationalLinguistics, 7:49–72.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of machine learning research,3(Feb):1137–1155.

Matthias Blohm, Glorianna Jagfeld, Ekta Sood, Xi-ang Yu, and Ngoc Thang Vu. 2018. Com-paring attention-based convolutional and recurrentneural networks: Success and limitations in ma-chine reading comprehension. arXiv preprintarXiv:1808.08744.

Ali Borji, Ming-Ming Cheng, Qibin Hou, HuaizuJiang, and Jia Li. 2014. Salient object detection: Asurvey. Computational Visual Media, pages 1–34.

Norbert Buch, Sergio A Velastin, and James Orwell.2011. A review of computer vision techniques forthe analysis of urban traffic. IEEE Transactions onIntelligent Transportation Systems, 12(3):920–939.

Nicholas Carlini and David Wagner. 2018. Audio ad-versarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops(SPW), pages 1–7. IEEE.

Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019.Robust neural machine translation with doubly ad-versarial inputs. arXiv preprint arXiv:1906.02443.

Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao,Dong Yu, Frank Seide, Michael Seltzer, GeoffZweig, Xiaodong He, Jason Williams, et al. 2013.Recent advances in deep learning for speech re-search at microsoft. In 2013 IEEE InternationalConference on Acoustics, Speech and Signal Pro-cessing, pages 8604–8608. IEEE.

Javid Ebrahimi, Daniel Lowd, and Dejing Dou.2018. On adversarial examples for character-level neural machine translation. arXiv preprintarXiv:1806.09030.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and De-jing Dou. 2017. Hotflip: White-box adversarialexamples for text classification. arXiv preprintarXiv:1712.06751.

Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yan-jun Qi. 2018. Black-box generation of adversarialtext sequences to evade deep learning classifiers. In2018 IEEE Security and Privacy Workshops (SPW),pages 50–56. IEEE.

Yotam Gil, Yoav Chai, Or Gorodissky, and JonathanBerant. 2019. White-to-black: Efficient distilla-tion of black-box adversarial attacks. arXiv preprintarXiv:1904.02405.

Yuan Gong and Christian Poellabauer. 2018. Pro-tecting voice controlled systems using sound sourceidentification based on acoustic cues. In 2018 27thInternational Conference on Computer Communica-tion and Networks (ICCCN), pages 1–9. IEEE.

Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014. Explaining and harnessing adver-sarial examples. arXiv preprint arXiv:1412.6572.

Kathrin Grosse, Nicolas Papernot, Praveen Manoharan,Michael Backes, and Patrick McDaniel. 2017. Ad-versarial examples for malware detection. In Euro-pean Symposium on Research in Computer Security,pages 62–79. Springer.

Poonam Gupta and Vishal Gupta. 2012. A survey oftext question answering techniques. InternationalJournal of Computer Applications, 53(4).

Hossein Hosseini, Sreeram Kannan, Baosen Zhang,and Radha Poovendran. 2017. Deceiving google’sperspective api built for detecting toxic comments.arXiv preprint arXiv:1702.08138.

Jui-Ting Huang, Jinyu Li, and Yifan Gong. 2015. Ananalysis of convolutional neural networks for speechrecognition. In 2015 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pages 4989–4993. IEEE.

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras,Logan Engstrom, Brandon Tran, and AleksanderMadry. 2019. Adversarial examples are not bugs,they are features. In Advances in Neural Informa-tion Processing Systems, pages 125–136.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.arXiv preprint arXiv:1707.07328.

V Kuleshov, S Thakoor, T Lau, and S Ermon. 2008.Adversarial examples for natural language clas-sification problems. In URL https://openreview.net/forum.

Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and TingWang. 2018. Textbugger: Generating adversarialtext against real-world applications. arXiv preprintarXiv:1812.05271.

Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian,Xirong Li, and Wenchang Shi. 2017. Deeptext classification can be fooled. arXiv preprintarXiv:1704.08006.

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi,Arnaud Arindra Adiyoso Setio, Francesco Ciompi,Mohsen Ghafoorian, Jeroen Awm Van Der Laak,Bram Van Ginneken, and Clara I Sanchez. 2017. Asurvey on deep learning in medical image analysis.Medical image analysis, 42:60–88.

Paul Michel, Xian Li, Graham Neubig, andJuan Miguel Pino. 2019. On evaluation of ad-versarial perturbations for sequence-to-sequencemodels. arXiv preprint arXiv:1903.06620.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran,Beat Buesser, Ambrish Rawat, Martin Wistuba,Valentina Zantedeschi, Nathalie Baracaldo, Bryant

Chen, Heiko Ludwig, Ian Molloy, and Ben Ed-wards. 2018. Adversarial robustness toolbox v1.2.0.CoRR, 1807.01069.

Daniel W Otter, Julian R Medina, and Jugal K Kalita.2020. A survey of the usages of deep learning fornatural language processing. IEEE Transactions onNeural Networks and Learning Systems.

Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis. Foundations and trends in infor-mation retrieval, 2(1-2):1–135.

Nicolas Papernot, Fartash Faghri, Nicholas Carlini, IanGoodfellow, Reuben Feinman, Alexey Kurakin, Ci-hang Xie, Yash Sharma, Tom Brown, Aurko Roy,Alexander Matyasko, Vahid Behzadan, Karen Ham-bardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li,Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato,Willi Gierke, Yinpeng Dong, David Berthelot, PaulHendricks, Jonas Rauber, and Rujun Long. 2018.Technical report on the cleverhans v2.1.0 adversarialexamples library. arXiv preprint arXiv:1610.00768.

Nicolas Papernot, Patrick McDaniel, AnanthramSwami, and Richard Harang. 2016. Crafting adver-sarial input sequences for recurrent neural networks.In MILCOM 2016-2016 IEEE Military Communica-tions Conference, pages 49–54. IEEE.

Danish Pruthi, Bhuwan Dhingra, and Zachary C Lip-ton. 2019. Combating adversarial misspellingswith robust word recognition. arXiv preprintarXiv:1905.11268.

Jonas Rauber, Wieland Brendel, and Matthias Bethge.2017. Foolbox: A python toolbox to benchmark therobustness of machine learning models. In ReliableMachine Learning in the Wild Workshop, 34th Inter-national Conference on Machine Learning.

Suranjana Samanta and Sameep Mehta. 2017. Towardscrafting text adversarial samples. arXiv preprintarXiv:1707.02812.

Dinggang Shen, Guorong Wu, and Heung-Il Suk. 2017.Deep learning in medical image analysis. Annualreview of biomedical engineering, 19:221–248.

Michael Sutton, Adam Greene, and Pedram Amini.2007. Fuzzing: brute force vulnerability discovery.Pearson Education.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. 2013. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199.

Thomas Tanay and Lewis Griffin. 2016. A boundarytilting persepective on the phenomenon of adversar-ial examples. arXiv preprint arXiv:1608.07690.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,and Sameer Singh. 2019. Universal adversarial trig-gers for nlp. arXiv preprint arXiv:1908.07125.

Wenqi Wang, Benxiao Tang, Run Wang, Lina Wang,and Aoshuang Ye. 2019a. A survey on adversar-ial attacks and defenses in text. arXiv preprintarXiv:1902.07285.

Xiaosen Wang, Hao Jin, and Kun He. 2019b. Naturallanguage adversarial attacks and defenses in wordlevel. arXiv preprint arXiv:1909.06723.

Yicheng Wang and Mohit Bansal. 2018. Robust ma-chine comprehension models via adversarial train-ing. arXiv preprint arXiv:1804.06473.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Han Xu, Yao Ma, Haochen Liu, Debayan Deb, Hui Liu,Jiliang Tang, and Anil Jain. 2019. Adversarial at-tacks and defenses in images, graphs and text: Areview. arXiv preprint arXiv:1909.08072.

Y Zang, C Yang, F Qi, Z Liu, M Zhang, Q Liu,and M Sun. 2019. Textual adversarial attackas combinatorial optimization. arXiv preprintarXiv:1910.12196.

Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deeplearning for sentiment analysis: A survey. Wiley In-terdisciplinary Reviews: Data Mining and Knowl-edge Discovery, 8(4):e1253.

Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi,and Chenliang Li. 2020. Adversarial attacks ondeep-learning models in natural language process-ing: A survey. ACM Transactions on Intelligent Sys-tems and Technology (TIST), 11(3):1–41.

Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2017.Generating natural adversarial examples. arXivpreprint arXiv:1710.11342.

Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and WeiWang. 2019. Learning to discriminate perturbationsfor blocking adversarial attacks in text classification.arXiv preprint arXiv:1909.03084.

https://arxiv.org/pdf/1807.01069

http://arxiv.org/abs/1707.04131

http://arxiv.org/abs/1707.04131

arxiv:2005.14108v3 [cs.cl] 14 jun 2020 · 2020. 6. 16. · timent analysis, machine translations,...

Documents