camouflaged chinese spam content detection with semi ... · has glyph and phonetic variations...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3080–3085July 5 - 10, 2020. c©2020 Association for Computational Linguistics

3080

Camouflaged Chinese Spam Content Detection withSemi-supervised Generative Active Learning

Zhuoren Jiang1∗, Zhe Gao2∗, Yu Duan2, Yangyang Kang2,Changlong Sun2, Qiong Zhang2, Xiaozhong Liu3†

1School of Public Affairs, Zhejiang University, Hangzhou, China2Alibaba Group, Hangzhou & Sunnyvale, China & USA

3School of Informatics, Computing and Engineering, IUB, Bloomington, [email protected],[email protected]

{derrick.dy,yangyang.kangyy,qz.zhang}@alibaba-inc.com,[email protected],[email protected]

Abstract

We propose a Semi-supervIsed GeNerativeActive Learning (SIGNAL) model to addressthe imbalance, efficiency, and text camou-flage problems of Chinese text spam detec-tion task. A “self-diversity” criterion is pro-posed for measuring the “worthiness” of acandidate for annotation. A semi-supervisedvariational autoencoder with masked attentionlearning approach and a character variationgraph-enhanced augmentation procedure areproposed for data augmentation. The prelim-inary experiment demonstrates the proposedSIGNAL model is not only sensitive to spamsample selection, but also can improve the per-formance of a series of conventional activelearning models for Chinese spam detectiontask. To the best of our knowledge, this is thefirst work to integrate active learning and semi-supervised generative learning for text spamdetection.

1 Introduction

The recent successes of learning-based models allshare the same prerequisite: a decent labeled train-ing dataset is available for a given task (Jiang et al.,2019b; Arora and Agarwal, 2007). However, theannotating process can be “a tedious, laborious,and time consuming task for humans” (Sharmaet al., 2015). To achieve high task performancewith low labeling cost, (pool-based) active learning(Cohn et al., 1996) algorithms are proposed to se-lect the most representative and informative sampleto be labeled by human oracles (Druck et al., 2009).Although effective in general, in Chinese text spamdetection context, the following reasons make theactive learning a challenging task:

∗These two authors contributed equally to this research.†Corresponding author

Imbalance: in reality, the ratio of spam sam-ples to normal ones is very imbalanced. For in-stance, in North America, “much less than 1% ofSMS messages were spam” (Almeida et al., 2013).As a result, the active learning model should bemore sensitive to spam samples. The general ac-tive learning methods, e.g., (Lewis and Gale, 1994;Li and Guo, 2013; Roth and Small, 2006), canhardly address this problem. Efficiency: whencompeting with anti-spam models, spammers areconstantly creating new forms for spam texts (Xieet al., 2012; Jiang et al., 2019a). The amount ofunlabeled samples is huge and keeps increasing.Classical diversity-based approach (Brinker, 2003;Xu et al., 2003), which iteratively compares eachunlabeled sample with each labeled sample to se-lect the most “diverse” ones for annotating, willperform poorly as its computational complexity isO(n2). An efficient-oriented active learning algo-rithm is needed. Camouflage1: Chinese characterhas glyph and phonetic variations (Norman, 1988),e.g., “账 (account)” and “帐(curtain)” have the sim-ilar structure and pronunciation. Spammers cantake advantage of this characteristic to escape fromthe detection algorithms (Jindal and Liu, 2007;Jiang et al., 2019a). It is important to propose anovel active learning model that can predict the newChinese character variation patterns not appearingin the labeled dataset.

To address these challenges, we propose a novelsolution, Semi-supervIsed GeNerative ActiveLearning (SIGNAL) model to naturally integrateactive learning and semi-supervised generativelearning into a unified framework. SIGNAL is

1“Camouflaged text spam” refers to the intentional muta-tion of Chinese character to escape from the spam detectionalgorithms. The variation-based spam text is purposely cre-ated and highly camouflaged for machine learning algorithms.Typos of normal text is not spam.

3081

inspired by a simple yet powerful observation incomputer vision domain (Zhou et al., 2017) : thepatches generated from the same image share thesame label, and are naturally expected to have simi-lar predictions by the classifier. Hence, the diversityof predictions of patches can successfully measurethe “power” of a candidate image in elevating theperformance of the current classifier. Similarly, inthis study, a set of semantically similar texts foreach candidate sample is automatically generatedthrough data augmentation. We hypothesize that:the diversity of predictions of augmented texts is auseful indicator to predict the boost ability of a can-didate text sample for the performance of the clas-sifier. We define this strategy as a “self-diversity”based active learning strategy.

Algorithmically, unsupervised generative mod-els, such as variational autoencoder (Kingma andWelling, 2013), only learn to generate similartexts without considering the labeling information.Therefore, we utilize a Semi-supervised VariationalAutoEncoder (S-VAE) (Kingma et al., 2014) to au-tomatically generate semantically similar texts foreach candidate sample, while trying to keep thelabel-consistency. To enable S-VAE to gain theability of perceiving the sensitive positions of thecandidate sample, we enrich the human annotationfeedback. The annotator is required to providenot only a label for the candidate but also a ra-tionale (critical terms in the candidate) (Sharmaet al., 2015) for the chosen spam label. Based onthe human-annotated rationales, we introduce apseudo-mask distribution Pm to guide the attentionlearning in S-VAE. A character variation graph-enhanced augmentation procedure is then appliedto integrate the Chinese character variation knowl-edge and simulate the glyph and phonetic variationmutations in further data augmentation.

Compared with conventional active learning,SIGNAL offers three advantages: (1) SIGNAL ismore sensitive to seek the spam samples2. (2) SIG-NAL does not need to compare with the labeledsamples, which reduces its computational complex-ity to O(N). (3) SIGNAL considers the heteroge-neous variation knowledge of Chinese charactersfor spam detection.

The major contributions of this paper can besummarized as follows:

1. We propose a SIGNAL model, in the context

2More detailed information can be found in the experimentsection.

of Chinese text spam detection, to address the im-balance, efficiency, and text camouflage problems.To the best of our knowledge, this is the first workto integrate active learning and semi-supervisedgenerative learning for text spam detection task.

2. The preliminary experiments on the ChineseSMS dataset demonstrate the efficacy and poten-tial of SIGNAL for Chinese spam detection. Aseries of conventional active learning models canbe improved after merging the SIGNAL model.

3. While focusing on the Chinese spam detectiontask in this study; theoretically, SIGNAL has agreat potential to be applied in other NLP tasks. Itcan mitigate the data-hungry problem by cuttingthe labeling cost.

2 SIGNAL Model

Figure 1 depicts the proposed SIGNAL frame-work3. It starts with a small set of labeled sam-ples, a large set of unlabeled samples, and an initialclassifier trained on the labeled samples. The goalof SIGNAL is to seek “salient” samples from thepool of unlabeled samples for annotation. Thenthe classifier can be continuously improved by in-crementally enlarging the training set with newlyannotated samples. The pseudocode of SIGNAL isdescribed as Algorithm 1.

Self-Diversity Based Active Learning. Asaforementioned, in SIGNAL, we develop a“self-diversity” criterion for active candidateselection. Formally, for a candidate sam-ple xi, a set of augmented texts ATi ={at1i , at

2i , · · · , at

ji · · · , atMi

}is generated. The

self-diversity SDi of xi can be defined as:

SDi =

∑j=1M

(pji − pi

)2M

(1)

pji is the prediction of the current classifier for aug-mented text atji ; pi is the arithmetic mean of allpredictions for ATi; M is the total number of aug-mented texts. SD suggests the “worthiness” of acandidate for annotation. A large SD indicatesthat the current classifier’s prediction for the targetcandidate is unstable. With a slight mutation, theprediction will change drastically. Such a candi-date is worthy of annotation. This criterion hasthe potential to locate the vital samples and also toreduce the computational complexity. Furthermore,

3https://github.com/Giruvegan/generative-camouflaged-spam-detector

3082

Figure 1: An Illustration of “SIGNAL” Framework

in the context of Chinese text spam detection, spamcandidate has a greater possibility to gain a largerSD. For instance, if the spam candidate mutates atthe critical positions, the label of the augmentedtext is likely to change. On the contrary, normalcandidates are less likely to be affected by this situ-ation.

S-VAE with Masked Attention Learning. Asshown in Figure 1, we utilize S-VAE with maskedattention learning to generate similar texts at thesemantic level. In this study, with annotated ra-tionales R (a set of critical terms), a pseudo-maskdistribution Pm is generated for each candidatesample. For ith term ti of the candidate sample,the pseudo-mask probability Pri can be calculatedas:

Pri =ρIR(ti)

∆(2)

where IR(ti) is an indicator function to determinewhether ti belongs to R; ∆ is used for normaliza-tion; ρ is the weight to ensure the critical terms willhave less attention, in other words, it can have agreater possibility to be “masked” during the gen-erative process.

Following (Kingma et al., 2014), the genera-tive semi-supervised model with masked attentionlearning can be defined as:

Pr(y) = Cat(y|π);

Pr(z) = N (z|0, I);Prω(x′|fr(x)) = fa(x

′; fr(x), ω);

Prθ(x′|y, z) = f(x′; y, z, θ)

(3)

where x is a sample (labeled or unlabeled); fr(x) isa matrix generated by a non-linear transformation

of x. x′ is a representation of fr(x) with an atten-tion calculation, x′ =

∑ωifr(x)i; ω denotes the

attention distribution, ωi = softmax(fc(fr(x))i,which is scalar; fc is an single-dimensional non-linear transformation; Cat(y|π) is the multinomialdistribution, if x is unlabeled, the class labels y aretreated as latent variables; z is the latent variable;θ denotes the parameters of a non-linear transfor-mation. Labeled samples can be used to train aclassifier that predicts class labels y. During theinference process, we can predict the missing classfor an unlabeled sample from the inferred posteriordistribution Prθ(y|x′).

The loss function of S-VAE with masked atten-tion learning is defined as:

L = LS−VAE + αDKL(Patt||Pm) (4)

where LS−VAE is the loss of original S-VAE(Kingma et al., 2014); DKL(Pm||Patt) is the KLdivergence of the attention distribution Patt fromthe pseudo-mask distribution Pm.

Character Variation Graph-enhanced Aug-mentation. In this study, a random-walk basedgraph-enhanced augmentation procedure is usedfor integrating the Chinese character variationknowledge and simulating the glyph and phoneticvariation mutations. A Chinese character vari-ation graph G (Jiang et al., 2019a) is utilized.G = (C,R). C denotes the Chinese character(vertex) set. R denotes the variation relation (edge)set, and edge weight is the similarity of two char-acters given the target relation (variation) type. Forcritical positions in a piece of text, we adopt a ran-dom walk based graph exploration to predict thepossible Chinese character variation patterns. For

3083

Algorithm 1 Semi-supervised Generative ActiveLearning

Self-Diversity Based Active Learning (Labeled set: L,Unlabeled set: U = {x1, · · · , xN}, Initial Classifier:Ct, t = 0, Chinese Character Variation Graph: G, An-notated Rationales: R)R = ∅repeat

for all xi ∈ U doWith R, generate a pseudo-mask distribution P i

m

using Eq.2SSi = S-VAE(xi, P

im)

ATi = GraphAugmentation(SSi, G, P im)

With ATi and Ct, calculate SDi using Eq.1end forSelect top K unlabeled samples Q from U

Get L and R from enriched human annotationL← L

⋂L, R← R

⋂R, U← U/ Q

t++, Ct ← Train(L, Ct−1)until Convergencereturn Ct,L

GraphAugmentation(Similar text set: SS, Chinese Char-acter Variation Graph: G, pseudo-masked distribution Pm,)AT = ∅for all ssj ∈ SS do

Probabilistically generate a position list POS with Pm

for all posk ∈ POS doGet the character Chposk at position poskCho ← Chposk

Randomly generate a walking step Tp ∈ (0, T ]Chn = RandomWalk(Cho, Tp, G)Chposk ← Chn

end forAppend ssj to AT

end forreturn AT

more detailed information on this procedure, pleaserefer to Algorithm 1.

3 Preliminary Experiment

Dataset and Experiment Setting. A ChineseSMS dataset4 was used for the experiment. Therewere 48,896 testing samples, including 23,891spam samples and 25,005 normal samples. Thesize of the active learning sample set was 48884,including 23,891 spam samples and 24,993 nor-mal samples. 200 samples were randomly selectedas the initial labeled set. The remaining sampleswere used as an unlabeled sample pool. For eachiteration, 100 samples were selected by different ac-tive learning models. The iterative active learningprocess repeated 10 times. For evaluation, a single-layer CNN classifier was trained on the labeledsamples. Uncertainty (Lewis and Gale, 1994),Margin (Roth and Small, 2006), and Entropy (Li

4https://github.com/Giruvegan/generative-camouflaged-spam-detector

Figure 2: Preliminary Experiment Result: (A) the num-ber of selected spam samples after 10 iteration of ac-tive learning; (B) the classifier performance (accuracy)comparison between “Uncertainty” and “Uncertaintymerging SIGNAL”; (C) the classifier performance (ac-curacy) comparison between “Entropy” and “Entropymerging SIGNAL”; (D) the classifier performance (ac-curacy) comparison between “Margin” and “Marginmerging SIGNAL”

and Guo, 2013) were chosen as baseline models.Similar baseline-settings can be found in (Zhouet al., 2017; Huang et al., 2018; Yoo and Kweon,2019).

In SIGNAL model, for S-VAE training4, wechose “BiGRU+ Attention + MLP” as encoderstructure, a “single-layer GRU” as decoder struc-ture, and a “single-layer CNN+MLP” as classifier.For each candidate sample, 10 augmented texts isgenerated for “self-diversity” calculation.

Sensitivity of Spam Sample Selection. Asshown in Figure 2 (A), compared with baselinemodels, SIGNAL can be more sensitive to spamsamples. The selected spam samples from SIGNALwere significantly more than those from other base-lines. This observation indicated the potential ofSIGNAL for addressing the “imbalance” problemin Chinese text spam detection.

The Elevating “Power” of SIGNAL. Asshown in Figure 2 (B), (C), and (D), after merging5

SIGNAL, all baseline models had been improvedto varying degrees. Especially for margin-basedactive learning (Roth and Small, 2006), SIGNALcan improve the performance in all active learningiterations. Averagely, by merging SIGNAL, Mar-gin can be improved by 10% in the metric of the

5In the preliminary experiment, we apply a simple yet ef-fective merging strategy: in each iteration, the baseline modeland SIGNAL model select 50 samples respectively.

3084

Figure 3: Case study: augmented texts from SIGNAL

classification performance.Case Study. To gain a straightforward under-

standing of the generation quality of SIGNAL, wepresent two augmented texts in Figure 3. Fromthese two cases, we have the following observa-tions: (1) the augmented texts are semanticallysimilar to the original sample. (2) Although theoriginal sample has no variation character, the aug-mented texts can simulate the phonetic or glyphvariation mutations. (3) If the critical terms in theoriginal sample are replaced, the label of text canbe different.

4 Conclusion

In this paper, we propose a SIGNAL model for Chi-nese text spam detection. SIGNAL integrates activelearning and semi-supervised generative learninginto a unified framework. As an exploration studyfor this newly proposed problem, the preliminaryresults have revealed the potential of SIGNAL toaddress the critical problems in the proposed task.For instance, Figure 2 (A) proves that SIGNALcan be more sensitive to spam samples (ImbalanceChallenge); case study (Figure 3) shows the gen-eration capacity of SIGNAL to simulate the pho-netic or glyph variation mutations (CamouflageChallenge); comparing to classical diversity-basedapproach, we integrate self-diversity based activelearning and generative learning which can greatlyreduce the computational complexity (O (N) →O (N), Efficiency Challenge).

In the future, we plan to enable the glyph andphonetic variation detection by integrating the vari-ation graph representation learning, which may im-

prove SIGNAL’s performance.

Acknowledgments

We are thankful to the anonymous reviewers fortheir helpful comments. This work is supportedby Alibaba Group through Alibaba Research Fel-lowship Program, the National Natural ScienceFoundation of China (61876003), GuangdongBasic and Applied Basic Research Foundation(2019A1515010837), and the Opening Project ofState Key Laboratory of Digital Publishing Tech-nology (cndplab-2020-Z001).

ReferencesTiago Almeida, Jose Marıa Gomez Hidalgo, and

Tiago Pasqualini Silva. 2013. Towards sms spamfiltering: Results under a new dataset. InternationalJournal of Information Security Science, 2(1):1–18.

Shilpa Arora and Sachin Agarwal. 2007. Active learn-ing for natural language processing. LanguageTechnologies Institute School of Computer ScienceCarnegie Mellon University.

Klaus Brinker. 2003. Incorporating diversity in activelearning with support vector machines. In Proceed-ings of the 20th international conference on machinelearning (ICML-03), pages 59–66.

David A Cohn, Zoubin Ghahramani, and Michael I Jor-dan. 1996. Active learning with statistical models.Journal of artificial intelligence research, 4:129–145.

Gregory Druck, Burr Settles, and Andrew McCallum.2009. Active learning by labeling features. InProceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume

3085

1-Volume 1, pages 81–90. Association for Computa-tional Linguistics.

Sheng-Jun Huang, Jia-Wei Zhao, and Zhao-Yang Liu.2018. Cost-effective training of deep cnns with ac-tive model adaptation. In Proceedings of the 24thACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining, pages 1580–1588.

Zhuoren Jiang, Zhe Gao, Guoxiu He, Yangyang Kang,Changlong Sun, Qiong Zhang, Luo Si, and Xi-aozhong Liu. 2019a. Detect camouflaged spam con-tent via stoneskipping: Graph and text joint embed-ding for chinese character variation representation.In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 6188–6197.

Zhuoren Jiang, Jian Wang, Lujun Zhao, ChanglongSun, Yao Lu, and Xiaozhong Liu. 2019b. Cross-domain aspect category transfer and detection viatraceable heterogeneous graph representation learn-ing. In Proceedings of the 28th ACM InternationalConference on Information and Knowledge Manage-ment, pages 289–298. ACM.

Nitin Jindal and Bing Liu. 2007. Review spam detec-tion. In Proceedings of the 16th international confer-ence on World Wide Web, pages 1189–1190. ACM.

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. In Proceedings of theInternational Conference on Learning Representa-tions (ICLR).

Durk P Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. 2014. Semi-supervisedlearning with deep generative models. In Advancesin neural information processing systems, pages3581–3589.

David D Lewis and William A Gale. 1994. A sequen-tial algorithm for training text classifiers. In SI-GIR’94, pages 3–12. Springer.

Xin Li and Yuhong Guo. 2013. Adaptive active learn-ing for image classification. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 859–866.

Jerry Norman. 1988. Chinese. Cambridge UniversityPress.

Dan Roth and Kevin Small. 2006. Active learning withperceptron for structured output. In ICML Workshopon Learning in Structured Output Spaces.

Manali Sharma, Di Zhuang, and Mustafa Bilgic. 2015.Active learning with rationales for text classification.In Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 441–451.

Sihong Xie, Guan Wang, Shuyang Lin, and Philip SYu. 2012. Review spam detection via temporal pat-tern discovery. In Proceedings of the 18th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 823–831. ACM.

Zhao Xu, Kai Yu, Volker Tresp, Xiaowei Xu, and JizhiWang. 2003. Representative sampling for text classi-fication using support vector machines. In Europeanconference on information retrieval, pages 393–407.Springer.

Donggeun Yoo and In So Kweon. 2019. Learning lossfor active learning. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recogni-tion, pages 93–102.

Zongwei Zhou, Jae Shin, Lei Zhang, SuryakanthGurudu, Michael Gotway, and Jianming Liang.2017. Fine-tuning convolutional neural networksfor biomedical image analysis: actively and incre-mentally. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages7340–7351.

camouflaged chinese spam content detection with semi ... · has glyph and phonetic variations...

Documents