detecting hate speech on twitter -...

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2017

Detecting hate speech on TwitterA comparative study on the naive Bayes classifier

SAM HAMRA

BORAN SAHINDAL

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

c

Upptäcka hatspråk på TwitterEn jämförande studie om naiv Bayesianskklassificierare

SAM HAMRA

BORAN SAHINDAL

Computer ScienceDate: June 6, 2017Supervisor: Mårten BjörkmanExaminer: Örjan EkebergSchool of Computer Science and Communication

ii

Abstract

Hate speech and cyberbullying on social media platform Twitter is a grow-ing issue, and to combat this they turn to machine learning and computerscience. This study will investigate and compare different configurationsfor the naive Bayes classifier when classifying hate speech on Twitter. Wehave achieved a data set of nearly 13000 tweets, some containing hatespeech, and trained and tested our classifier with different configurations.The study shows that character level n-grams outperform word level n-grams, and the optimal size n-gram for character level is using combina-tions between 1-3.

iii

Sammanfattning

Hatspråk och mobbning på Twitter är ett ökande problem, och för att be-kämpa det har man vänt sig mot maskininlärning och datavetenskap.Denna studie undersöker och jämför olika konfigurationer för en naivBayesiansk klassificerare för att klassificera hatspråk på Twitter. Vi harsamlat nästan 13000 tweets som vi tränar och testar våran klassifierare på.Studien visar att n-grams på karaktärsnivå presterar bättre än n-grams påordnivå, och den optimala storleken på n-gram för karaktärsnivå är kom-binationer mellan 1-3.

Contents

1 Introduction 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 31.2 Scope and Constraints . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Hate speech on Twitter . . . . . . . . . . . . . . . . . . . . . 42.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 52.4 N-gram model . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . 72.7 Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.8 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.9 Performance Measures . . . . . . . . . . . . . . . . . . . . . 112.10 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Method 153.1 The data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . 17

4 Results 184.1 Overall Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 Word level n-grams . . . . . . . . . . . . . . . . . . . 194.2.2 Character level n-grams . . . . . . . . . . . . . . . . 19

4.3 Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . 21

iv

CONTENTS v

5 Discussion and Conclusion 235.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Bibliography 26

Appendices 30

A Confusion Matrices 31

1. Introduction

As we approach the pinnacle of the information age, instant global com-munication has brought with it some amazing perks. But seeing as internet-based communication no longer has a physical element to it, it brings usfurther away from how we communicate in real life . And also consid-ering the fact that the internet provides the possibility of anonymity onsocial media platforms. These two facts combined brings us to a situa-tion where virtually anyone in the world can write distasteful and hatefulcomments to others, seemingly without repercussions. On Twitter, oneof the world’s largest social media networks with 300 million monthlyusers [32], users can freely message anyone in the network with very fewlimitations. If it is not clear already, this can lead to huge problems.

A recent study done by anti bullying organization Ditch The Labelon the prevalence of hate speech on Twitter shows that of the 19 milliontweets they analyzed, over 7 million contained racially insensitive lan-guage[4]. Another study done by researchers at the University of Wis-consin show that about 15000 bullying-related tweets gets posted everysingle day [37].

So what can be done to prevent it? Firstly, to prevent hateful com-ments, one must be able to detect it. Luckily the information age alsobrought us machine learning, methods and algorithms that can be taughthow to predict future events given some previously acquired knowledge.Machine learning is an umbrella term for all algorithms that have the abil-ity to learn without being explicitly programmed[25]. Machine learningis usually divided into two sub-branches, supervised and unsupervisedmachine learning. Unsupervised learning is used to unravel hidden struc-ture in the data, and the data fed to the algorithms should be unlabeled.Supervised learning algorithms on the other hand are fed labeled data,e.g. instances marked with hateful/not hateful. This is called the trainingphase, where the algorithm is adjusting its formulas to adapt to the given

1

CHAPTER 1. INTRODUCTION 2

training data. After the training phase, the algorithm is ready for the test-ing phase, where you feed it unlabeled data, and tries to determine whatclass the instance belongs to. This can be used in a variety of scenarios,but more importantly for this study, it can be used to categorize texts.Tweets are, for all practical purposes, short texts that in theory can be cat-egorized using text classification algorithms. Given previous knowledgeabout a set of data, it can classify future instances as some predeterminedclasses. Under the hood, this is done by fitting some formula to the be-forehand labeled data, and then using this formula to make predictionson future unlabeled data.

Now there are many different algorithms that could perform this task,for example Support Vector Machines(SVM), Random Forests, Neural Net-works and Decision Trees just to mention a few. But there’s another inter-esting algorithm called the Naive Bayes classifier. It is in reality a family ofclassifiers that uses feature frequency probabilities to perform predictionson data. A feature in this context just means a variable or property of eachinstance of the data set. It is known for its relative simplicity, comparedto other alternatives in the same realm [30]. The features used to train aNaive Bayes classifier, are usually a so called bag-of-words representationof a document, meaning the document becomes a list of words and howmany times they occurred. Now clearly this does not capture the contextof each word as sentence structure is overseen., The algorithm doesn’t dif-ferentiate between the sentences “I hate you” and “You hate I”, they willbe represented the exact same way. But there are other ways to representa document as a feature vector, by pairing words together and countingeach pair, you introduce some context to each word, this is called a bi-gram representation, the sentence “I hate you” would be represented as[“I hate”, “hate you”] for example. You can also instead of grouping eachword or pair of words, use character-level n-gram which uses groups ofcharacters as features. The character and word level n-gram documentrepresentations are approaches that will be the focus of this study.

CHAPTER 1. INTRODUCTION 3

1.1 Problem Definition

The goal of this thesis is to explore the effectiveness of the Naive Bayesfamily of classifiers when trying to predict hate-speech on Twitter. Morespecifically we will delve deeper into the Multinomial and Bernoulli vari-ants of the Naive Bayes classifier and compare their performances. Forfeature selection we will try word and character level n-gram representa-tion of length 1-6 to find out which method and which n-gram length ismost appropriate when classifying hate speech in shorter texts. To mea-sure and compare the performances of the different approaches, we willpresent graphs displaying the change in accuracy, precision and recall.When classifying hate speech on twitter, does the Bernoulli or Multi-nomial Naive Bayes classifier yield better results? Does character leveln-grams perform better than word level n-grams? Which n-gram lengthgives the best results?

1.2 Scope and Constraints

The scope of this study is strictly limited to studying the Bernoulli andMultinomial variants of the Naive Bayes classifier. We will be focusingour effort on comparing the performance of word level n-grams and char-acter level n-grams.

Because of time limitations, we will not be able to spend an extensiveamount of time on maximizing the performance of our classifier, neither isit the purpose of the study. Thus we will not be performing any sophisti-cated data pre-processing like stemming, lemmatization, creating customstop word lexicons and so forth.

Since we are using supervised machine learning algorithms, we are inneed of a data set of labeled tweets. And so naturally we are constrainedby the data we are able to obtain and we have to adapt and mold ourstudy according to the data. Thus we will only be focusing on classifyinghate speech on Twitter, and we are also slightly constrained by the formatof the data set and the hate speech definition the data set authors use.

2. Background

2.1 Twitter

Twitter is a micro-blogging service that allows users to post up to 140-character long messages called tweets. Although it is referred as being ablog service it is often considered as being a closer relative of more equi-pped social media services like Facebook [19].

The home page is presented as a "time-line" where a stream of tweetsfrom accounts the user have chosen to follow on twitter [33]. Followinganother user means subscribing to its tweets and the tweets that it inter-acts by re-posting or liking [35]. This network of twitter that is based onfollowing doesn’t restrict users from seeing other users tweets. User-builtstrings called hash tags that have prefix symbol ’#’ are used for creatingtopics that allow sharing tweets according to subjects. Anyone can seepublic tweets without creating an account or logging in [36]. Twitter hasapproximately 313 million monthly active users, 1 billion unique visitsto sites with embedded twitter frames and 3860 employees around theworld [32].

2.2 Hate speech on Twitter

Twitter’s rules clearly state that hateful conduct is not tolerated on thesite. But there are no sanctions against using offensive language outsideof the rules stated for hateful conduct.

4

CHAPTER 2. BACKGROUND 5

"Hateful conduct: You may not promote violence against or directlyattack or threaten other people on the basis of race, ethnicity, national ori-gin, sexual orientation, gender, gender identity, religious affiliation, age,disability, or disease. We also do not allow accounts whose primary pur-pose is inciting harm towards others on the basis of these categories." [34]

According to Silva et al. [28] tweets containing hate speech havephrases in common such as ’I hate’, ’I can’t stand’, ’I am so sick of’ whichconsist of a few words. Two main categories that the hate speech on twit-ter is targeting are Race with 48.73% of observed tweets targeting andBehavior with 37.05%.

2.3 Machine Learning

Machine learning is the field of computer science that covers the topicof the computer’s ability to learn without being explicitly programmed[25]. Machine learning algorithms are data-driven, meaning instead of theprogrammer defining rules and structures, the programmer only feeds itdata, and the algorithm adjusts its machinery to perform some task [26].It is currently being used in a wide variety of fields including robotics,text classification, search engines, optical character recognition and manymore.

Now the term document classification is a broad topic, it can be usedto assign classes to images, music, texts and more. But for this study,we will be focusing on text classification to classify text documents suchas tweets. This can be achieved using one of two general strategies de-pending on what kind of analysis one is looking for [5]. On the one handthere’s unsupervised learning, its goal is to model the underlying struc-ture in the data in order to learn more about the data. In unsupervisedmachine learning, the input data is not labeled, and thus there are no cor-rect answers and no teacher[1]. In supervised machine learning, the inputdata is labeled, and therefore there is a way to measure the accuracy of thealgorithm. Simplistically, a supervised machine learning algorithm triesto fit it’s inner machinery to match the mapping function of the labeleddata. The data set is split into two parts, the training set and the testingset [20]. The algorithm iteratively tries to make predictions on the train-ing data until a sufficient level of performance has been achieved. This


is the learning phase, now going on to the testing phase, the algorithmmakes predictions on the testing set, and compares the acquired classifi-cations with the actual labels. There are many different supervised ma-chine learning algorithms out there, but we will be focusing on a specificfamily of classification algorithms, the Naive Bayes classifiers, that will becovered more in the following sections.

2.4 N-gram model

There are many different feature selection techniques that can be used inmachine learning, one of them is called the bag of words or bag of n-gramrepresentation [13]. In the case of text classification it works by tokenizingstrings and giving an integer id for each possible token. The tokenizingcan be done in several ways, either chopping text into sequences of wordsor characters, or even pairs and larger combinations of words and charac-ters [7]. After the tokenizing step, you end up with a set of tokens whichis commonly called the corpus [18]. Now for each instance of the dataset, you count the occurrences of each token, and this vector of integersbecomes the feature vector.

We will refer to the different approaches as word level and characterlevel n-grams. Where the ‘n’ signifies the size of each token, i.e. the string“Hello” in a character level 2-gram representation of the string wouldbecome “-h”, “he”, “el”, “ll”, “lo”, “o-”. Whereas in a word level 2-gramrepresentation the string “I love you” would become “I love” “love you”“you -“.1

As mentioned previously a word level 1-gram representation does nottake context into account, simply treating each word as an individual en-tity. But by increasing the size of the tokens, a word level 3-gram represen-tation actually brings some context into the sequence as it differentiatesbetween the strings “Me hate you” and “You hate me”. Character-leveln-grams on the other hand are well suited for handling texts containinggrammatical errors[8], which is often the case on Twitter, where grammarhas low priority [9].

1The dashes symbolizes the white space character


2.5 Bayes’ rule

Bayes’ rule involves manipulation of conditional probabilities.It is derived from joint probability of two events which is formulated

as:[21]

Pr (AB) = Pr (A | B)× Pr (B) (2.1)

These two events A and B are considered as a hypothesis, H, and adata D. According to Bayes’ rule we judge the relative truth of the hy-pothesis given the data via the relation:

Pr (H | D) =Pr (D | H)× Pr (D)

Pr (D)(2.2)

2.6 Naive Bayes Classifiers

The Naive Bayes Classifier(NBC) is a widely used framework for classifi-cation that is based on Bayes’ rule: [15]

Pr (C = ck | X = x) =Pr (C = ck)× Pr (X = x | C = ck)

Pr (x)(2.3)

where:

Pr (X = x) =

|C|∑k=1

Pr (X = x | C = ck)× Pr (C = ck) (2.4)

X is a vector random variable whose values are vectors of feature val-ues x = (x1,...,xj,...xd) of the documents to be classified and C is a randomvariable whose values are classes (c1,...,ck,...cc) that we want to classify thedocuments with. The posterior probability

Pr (C = ck | X = x)

which builds the core of NBC can be interpreted as "What is the proba-bility that a particular object belongs to class ck given its observed feature


values x?".[23] It is unknown in the beginning and the idea behind NBC2

is estimating it by combining the conditional probability:

Pr (X = x) | Pr (C = ck)

prior probability:Pr (C = ck)

and evidence Pr(x). However, even estimating the conditional probability

Pr (X = x | C = ck)

poses problems since there are usually too many numbers of possiblevalues for x = (x1,...,xj,...xd). This problem is solved by making the as-sumption that the occurrence of a particular value of xj is statistically in-dependent of the occurrence of any other xj’ given a document type ck

[24]. Thereby if we assume:

Pr (X = x | C = ck) =d∏

j=1

Pr (X = xj | C = ck) (2.5)

then formula 2.3 becomes

Pr (C = ck | X = x) =Pr (C = ck)×

∏dj=1 Pr (X = xj | C = ck)

Pr (x′j | c′k)(2.6)

where:

Pr (X = x) =

|C|∑k′=1

Pr (C = ck)×|X|∏j′=1

Pr (x′j | c′k) (2.7)

These unknown variables are estimated by processing training datathat consists of documents for which we create feature vectors and labels(c1,...,ck,...cc) attached to these documents [23].

The goal of NBC. is finding the maximum posteriori probability(MAP)hypothesis given example x by minimizing number of errors [24]. Thiscan be accomplished by assigning a document with feature vector x tothe class ck such that

Pr (C = ck | X = x)

is highest.2Naive Bayes Classifier


2.7 Bernoulli

Bernoulli Naive Bayes Classifier (BNBC) is a Naive Bayes variant that rep-resents a document by a feature vector with binary elements taking value1 if the corresponding feature is present and 0 if it is not present[27]. Intext classification application of this method the feature vector has m di-mensions where m is the number of words in the corpus. When creatinga feature vector for a document the value 1 is placed in the correspondingindex j in the vector if j:th word in the corpus is present in the document.The rest of the positions in the feature vector takes the value 0 [23].

LetPr (xt | C)

be the probability of feature xt being present in a document of class C and

1− Pr (xt | C)

be the probability of feature xt not being present. If we again assume in-dependence between features then we can write the document likelihood

Pr (X | C)

in terms of individual feature likelihoods

Pr (xt | C)

as

Pr (X | C) ≈|X|∏t=1

Pr (xt | C)b × (1− Pr (xt | C)1−b) (2.8)

where b is the t:th value of the feature vector [27].In text classification the maximum-likelihood estimate that a particu-

lar word xt occurs in class C is formulated as:

Pr (xt | C) =dxt + 1

dc+ 2


where:

• dxt is the number of documents in the training data set that containthe feature xt and belong to class C

• dc is the number of documents in the training data set that belongto class C

• +1 and +2 are the parameters of Laplace smoothing [23]. These areplaced to avoid probabilities of zero or one in case of 0 occurrenceof a word within a particular class or zero occurrence of a particularclass in the training data [16].

2.8 Multinomial

Multinomial Naive Bayes Classifier (MNBC) is based on multinomial dis-tribution(MD). A vector Y = (y1...,yi,...yd) with parameters p1,...,pd repre-senting Pr(y1),...,Pr(yd) has multinomial distribution defined by:

Pr (Y ) =n!∏dt=1 yt!

×d∏

t=1

pytt

MNBC differs from BNBC in that the document feature vectors cap-ture the frequency of features, not just their presence[16]. The featurevectors are constructed like in BNBC but number of occurrences of wordsare stored instead of binary values. This yields bag of words representa-tion for documents in text classification.

Let X = (x1...,xi,...xd) be a MNBC feature vector to a document D. LetW = w1,..,wi,...,wd be words in the vocabulary and

Pr (wt | C)

be the probability of word wt occurring in class C. By the naive Bayesassumption, assuming MD and assuming the number of words in a doc-ument is independent of the class[27] the document likelihood Pr(X|C)can be written as:

Pr (X | C) =ni!∏|W |

t=1 xt!×∏|W |

i=1 Pr (wi | C)xi≈|W |∏i=1

Pr (wi | C)xi


The normalization termni!∏|W |t=1 xt!

can be deleted without any change in the results because it doesn’t de-pend on the class, C [27, 11].

As in BNC, Pr(wi|C) for each word is the probability of that particularword and is estimated by:[23]

Pr (wi | C) =Cc + |W |Cwi + 1

where:- cwi is the total number of occurrences of wi with class C in the trainingdata- cc is the total number of words appearing in documents that belong toclass C in the training data- +1 and |V| are the parameters of laplace smoothing for avoiding thezero probability and keeping probabilities normalised[27].

2.9 Performance Measures

When tasked with the creation of a classifying algorithm, one is often con-cerned with the effectiveness of the classifier. There is no point of havingusing a classifier that gets it wrong most of the time, thus there is a needfor a way to measure the effectiveness of classifiers. One performancemetric is the accuracy of the classifier, which can be defined as the ratiobetween the correct predictions and the total predictions made [6]. De-pending on the application, the accuracy is not a good enough metric, itis often misleading when having a high success rate at predicting the “notso important” classes, but very limited success in predicting the impor-tant classes [31]. As an example, say that your color prediction algorithmhas a 80% accuracy. It gets white and black right 100% of the time, butit classifies yellow and red wrongly 50% of the time. Just looking at theaccuracy you would say the classifier performs reasonably well, and if itwas the only measure you had been looking at, you would never have re-alized how poorly it performed in reality. Thus we require more inclusive


performance measures, that gives us a better indication of the algorithmstrue performance.

Precision and recall are two widely used performance measures, notonly in classification but also in pattern recognition and information re-trieval. To understand it we must first present the terms true positive,true negatives, false positives and false negatives. In the case of a binaryclassifier, true and false can be considered as the two classes as labeled bythe “jury’”. The positive and negative corresponds to the label that theclassifier gave the instance. So for example, true positives are all the in-stances of the “true” class that were labeled correctly by the classifier, andthe false negatives are all the instances that belonged to the class “false”but was incorrectly labeled as true by the classifier[31].

The same idea holds for calculating recall and precision for multi-classclassifiers[31]. But the definition of true and false changes depending onwhat class you are currently viewing. When calculating recall and preci-sion for class X, we define X as the “true” class, and any other classes as“false”. And so the false negatives(FN(X)) are the instances that shouldhave been labeled as X but were not. In the same fashion the false neg-atives(FP(X)) are the instances that were incorrectly labeled as X. And sothe recall for a class X is defined as:

TP (X)

TP (X) + FN(X)(2.9)

[29]I.e. of all the instances that should have been labeled as X how many

were captured by the classifier? The precision for a class X is defined as:

TP (X)

TP (X) + FP (X)(2.10)

[29]I.e. of all the instances labeled as X how many were in fact correctly

labeled by the classifier?

2.10 Related Work

NBC is widely used in software for classifying documents based on theircontent [17]. It is specifically very popular in anti-spam filters. Although


different machine algorithms have also been used for email filtering pur-poses NBC’s simplicity and linear time complexity made it a more widelyused solution [8, 17].

According to Davidson et al. [3] bag-of-words approach is commonin NBC implementations . A natural consequence of this approach is theincrease in number of false-positive results, meaning not-hate-speech in-stances being labeled as hate-speech, in our case. This is a result of thatthe context in which the words appear is basically ignored by NBC’s in-dependence assumption.

Davidson’s study[3] is the origin of the data set we are using. Thisstudy makes the distinction between hate speech and offensive languageand tests a variety of models that have been used in prior work: logisticregression, NBC, decision trees, random forests and linear SVMs. The re-sults of the best performing model (SVM) are listed in the figure 1 below.Looking at the figure one see that almost 40% of hate speech is misclassi-fied. The model is likely to classify tweets less hateful or offensive thanhuman coders as the data set is labeled by human workers. We also seethat only 5% of offensive and 2% of innocuous tweets have been misclas-sified as hate speech. After a review of misclassified tweets they come tothe conclusion that Lexical methods are effective ways for identifying of-fensive terms but are likely to fail when detecting hate speech while auto-mated classification methods can achieve relatively high accuracy. David-son repeatedly points the uncertain effects of offensive words. Presence orabsence of particular offensive or hateful term can both help and hinderaccurate classification according to him.


A study conducted by Kwok and Wang uses NBC for detecting tweetsagainst blacks [12]. This study doesn’t make the distinction between hatespeech and offensive language and uses a binary classifier. They obtained76% accuracy and observed the classified tweets. In this experiment 86%of the tweets that were racist against blacks were labeled as racist becausethey contained offensive words. Only unigrams were employed in thisstudy. Their study also points that small changes in words when us-ing on twitter has remarkable effect on the chances a tweet being hatespeech(e.g. The word-level token “Nigg**” is widely used in hate speechwhile tweets containing the token “Nigg*” are less likely to be classifiedas hate speech).

The independence assumptions NBC makes is an interesting area tooptimize and improve. Many work were conducted for relaxing this inde-pendence assumption and enabling some context to be included [22, 10].Variations of NBC ,e.g., Tree Augmented Naive Bayes(TAN) which is con-sidered to be non-trivial to learn and Chain Augmented Naive Bayes(CAN)which is a simplified version of TAN are present. These variations areaiming relaxing the independence assumption and thereby improving theaccuracy[22, 10].

One of the main challenges in classifying hate-speech is pointed to bemaking the distinction between hate-speech instances and instances con-taining offensive language but not hate speech. Lexical detection methodstend to have low precision because of the similarity between the wordsrelated words hate-speech and offensive language[3].

We have studied a study conducted by Kanaris that experiments theeffects of word and character based n-grams approaches on a spam-filterthat based on SVM(instead of NBC in our case) [8]. Character basedn-grams model is highlighted by Kanaris. He mentions that characterbased n-grams approach is able to capture information on various lev-els: lexical (|the-|, |word|), word-class (|ed-|, |ing-|), punctuationmark usage (|!!!|, |f.r.|) etc.It at some degree is robust to grammaticalerrors (e.g., the word-tokens ‘assignment’ and ‘asignment’ share the ma-jority of character n-grams). In addition, it does not require too muchtext-preprocessing(tokenizer, lemmatizer or other ‘deep’ NLP tools). Theresults from this study shows 4-grams seems to provide the more reliablerepresentation.

3. Method

3.1 The data set

Since we are using supervised machine learning algorithms, we need adata set with marked entries, that is, tweets already marked as eithercontaining or not containing hate speech. For this, we found a data setof nearly 12800 tweets each classified by 3 independent contributors viacrowd sourcing [3]. The data set makes distinctions between tweets con-taining hate speech, and tweets containing offensive language, but notconsidered hate speech, and tweets that doesn’t contain any type of of-fensive or hateful language. Of the 12800 tweets in the data set, 14% waslabeled as hateful, 32% as offensive, and 54% as being non offensive.

There is some level of subjectivity that comes with the task of distin-guishing hate-speech from offensive language, but going off the author’sdefinition, hate speech is “language that is used to expresses hatred to-wards a targeted group or is intended to be derogatory, to humiliate, orto insult the members of the group” [3].

3.2 Experiments

NBC implementation in scikit-learn, which is a python library contain-ing machine learning tools [14], was used in the experiments in order tolist out how different NB variants such as Bernoulli and Multinomial areaffected by different n-grams models. Experiments were conducted ontwo different n-grams models, namely char-based and word-based mod-els.The reason of we also included char-based n-grams approach is thatit gave significant results in an earlier study using SVM[8]. Thereforewe conducted experiments on both Bernoulli and Multinomial classifierswith using both character and word level n-grams models.

15

CHAPTER 3. METHOD 16

Instances of a specific class can be misclassified very often. For exam-ple, in Davidson[3]’s study, 40% of hate-speech tweets were misclassifiedeven if the overall accuracy was around 90%. We therefore decided to cal-culate the percentage of each class being classified as each other class andserve the results in form of confusion matrices (Appendix). Observe thatthe left diagonals of the matrices contains recalls for the three classes bydefinition where the upper left cell contains recall value for “hate” classand the lower right cell contains recall value “not-offensive” class.

When applying n-grams approach all preceding n-grams slices wereincluded in the feature vectors. For example, word level 3-grams featurevector for a tweet includes all words, all 2-word-slices and all 3-word-slices in the tweet. As an example the elements in the following vectorwill be counted for the tweet “I hate you Justin Bieber” in the featurevector:

[“I”, “hate” ,“you” ,“Justin” ,“Bieber” ,“I hate” ,“hate you” ,“you Justin”,“Justin Bieber”, “I hate you”, “hate you Justin”, “you Justin Bieber”]

Same analogy follows for the character based n-grams model as well.After some tests, we have observed that the results converged to spe-

cific values when maximum n-gram size to be included in the featurevectors was increased. We therefore decided to take 6 (6-grams) as thismaximum number.

We also observed changes in results by a few percent when the testand training instances were shuffled. We therefore decided to run eachexperiment 1000 times and use average values as results. Each single runof the experiment used same training and test tweets on both Bernoulliand Multinomial classifiers. After each run the test and training data wereshuffled randomly. The conducted experiments can be listed as follows:

CHAPTER 3. METHOD 17

• Bernoulli classifier with changing word level n-grams size 1-6

• Multinomial classifier with changing word level n-grams size 1-6

• Bernoulli classifier with changing character level n-grams size 1-6

• Multinomial classifier with changing character level n-grams size1-6

3.3 Data preprocessing

The data set contained some duplicate entries which had to be removed,so as to not skew the results. We also transformed all the tweets intoexclusively lower case strings. NBC for text classification simply com-pares strings of characters, therefore it is often useful not to distinguishbetween upper and lower case letters. The context in which they are usedseldomly changes the meaning of the word or sentence. Emojis, and nonUTF-8 symbols, that were not adding, in this context, meaningful infor-mation to the tweets, as well as links, retweets and other text that didn’tprovide evidence of the tweets class were removed.

4. Results

4.1 Overall Accuracy

The figure shows the change in overall accuracy of Bernoulli and Multi-nomial NBC:s with an increasing n-gram size.

The overall accuracy for Multinomial NBC drops by 23% when therange for word level approach is increased from 1 to 6. The decrease issignificant in each step and it is continuous. On the other hand, the accu-racy of Bernoulli NBC doesn’t seem to be affected by changing word leveln-gram size.

The character level n-grams approach does seem to have a constantpositive effect on the overall accuracy. The overall accuracy of Multino-mial NBC is increased from 56% to 75% while Bernoulli NBC’s accuracyis increased from 62% to 76% with increasing character level n-gram size.

This information does not give enough information to judge the ef-fects of changing n-gram size as we need to investigate how well specificclasses were classified.

When we look at the precision and recall values for each class we get abetter picture of the predictions made by different classifiers and n-gramsmodels.

18

CHAPTER 4. RESULTS 19

4.2 Precision and Recall

4.2.1 Word level n-grams

Performance measures of word level n-grams model on Multinomial NBCdoes not change significantly with changing n-gram size. There is an im-provement of 8% in hate class when we increase the n-gram size from 1to 3, but the recall drops by 4%. We observe that the recall of hate classis very low and ranging between 14% and 10%. Here it can be worthpointing out that the data set contains 14% hate speech instances.

The word level n-grams approach has quite a different effect on theperformance measures of the Bernoulli NBC than it has on the Multino-mial NBC. We observe that the precision of the hate speech class is veryhigh with 1-grams and drops dramatically after that. But as we take therecall for hate speech, which ranges between 1% and 0%, into account,we see that very few tweets were classified as hate speech. Furthermore,as a decisive factor we observe that the recall for the not-offensive classreaches 99% already at 2-grams while the precision for it decreases con-stantly with increasing n-gram size This founding points that hate or of-fensive instances gets classified less offensive than they are as we increasen-gram size.

4.2.2 Character level n-grams

Character level n-grams approach gave more consistent results than wordlevel. Here we should recall that 1-grams character level n-grams onlycounts the letters in tweets when building feature vectors. The significantincrease in the recall values of both Bernoulli and Multinomial NBC from1-grams to (1,2) grams was therefore awaited. Furthermore we observethat both Bernoulli and NBC classifiers recall values of the problematichate class reaches their peak point at range (1,3) with results 48% respec-tive 51%. These are the best results obtained for hate class from our con-ducted experiments.

Even if there is significant improvement in not-offensive and offensiveclasses, hate the recall of hate class drops dramatically after 3-grams.


Table 4.1: Precision and Recall for Multinomial naive Bayes

Word level n-grams Character level n-grams


Table 4.2: Precision and Recall for Bernoulli naive Bayes


Even if we get a good picture of n-grams approach - classifier combi-nation performances from the graphs above, it is still meaningful to takea look at how instances of different classes gets classified as each other.

4.3 Confusion matrices

When we look at the confusion matrices in Appendix A, we observe thattweets containing hate speech tend to get classified less hateful than theyare which agrees with the results of Davidson’s [3] work using SVM. Asone can follow from the gray-scaled confusion matrices, the black tonesdominates as we continue increasing n-gram size.

We also see that character level n-grams approach does much betterthan word level n-grams in general. This result agrees with the work ofKanaris [8] that is named in the related work section. Bernoulli NBC moresignificantly, but both classifiers in general classifies the hate class much


better with the character level approach. Word level n-grams seem to dobetter when classifying not-offensive instances but we also know aboutthe tendency of classifying less offensive. Since we did not conduct ex-periments for observing this point, we can not judge how this tendencydiffers between Bernoulli and Multinomial. This lack of information af-fects other judgments of ours as well.

5. Discussion and Conclusion

5.1 Discussion

As the results show, the NBC performs poorly when classifying tweetscontaining hateful speech. We observe the best results for hateful in-stances when using character level n-grams of length 2-3, with recallsranging between 48% and 51% for both the Bernoulli and Multinomialvariants. One possible factor for these poor results may be the distribu-tion of the data set, with only 14% of the data being hate instances. Thiscan be improved in future experiments by further balancing the data set.

According to our results, the character level n-gram approach per-forms significantly better than the word level approach, for both the Bernoulliand Multinomial variant. We think the lack of grammatical concerns intweets has strong effects on this. As we investigated in related work sec-tion character level n-grams approach is durable to misspellings to someextent (e.g. ‘Assignment’ and ‘Asignment’ has most parts common) andthat might be the reason why the character level approach outperformsthe word level approach significantly.

To improve on the results from the word level n-gram approach, lemma-tization and normalizing words in the tweets should be considered in thepreprocessing phase. A custom stop-word vocabulary can also be criticalfor achieving good results with word level n-grams. Because of previ-ously mentioned anomalies about misspelling of words on twitter anystop words cleaning has not been done in our experiments. The conse-quences of an eventual stop words cleaning on n-grams approach shouldbe investigated separately.

We see an improvement in hate recall for character level n-grams fromlength 1 to 2, and after that the recall drops off continuously. But eventhis improvement is irrelevant as character level 1-grams corresponds toa feature vector of letters only. Recall values for other classes improve

23

CHAPTER 5. DISCUSSION AND CONCLUSION 24

constantly but we know that the tendency of classifying tweets less hate-ful than they are plays a role in this results.

We observe a constant decrease in the recall value of hate class after3 character level n-gram size and the tendency of classifying tweets lesshateful increases. This pattern can be observed in word level n-gramsresults even from 1-grams as well.

Since the recall for word level n-grams experiments is almost 0% forBernoulli and around 10% for Multinomial NBC we can not make com-ments about the effects of changing word level n-gram size.

Another issue that we think may have affected the test results is thesubjectivity of hate speech. As we mentioned before the data set waslabeled by human workers and their understanding of hate speech mighthave had an impact on the labeling.

We have ignored punctuation symbols in our work but this has itsconsequences in the n-grams approach. For instance, when we ignore thedots in the text and let 2-grams slices be built containing last word of ameaning and first word of the next meaning we basically violate the mainpoint of n-grams approach. We therefore suggest consideration of thisaspect in future work.

Emoji symbols are widely used for expressing emotions on Twitter. Wehave not considered these tokens as meaningful and therefore removedthem during the data preprocessing phase. As a recent trend, many havestarted to express themselves using emojis instead of words[2]. We thinkthis this aspect of Twitter communication can also be crucial in classifica-tions like this.

5.2 Conclusion

We implemented Bernoulli and Multinomial NBC and performed exper-iments to observe the effects of changing word-level and character leveln-grams sizes. We used a dataset that had been used in previous researchand gave consistent results then. However because of usage of incorrectgrammar, emojis etc. as we have discussed in previous sections, exper-iments with word level approach resulted with low classification ratesand effects of n-grams approach or different NBC variants could not besuccesfully observed.

CHAPTER 5. DISCUSSION AND CONCLUSION 25

On the other hand we have observed that character-level n-grams ap-proach performs significantly better at classifying hate speech on twitterthan word-level n-grams approach. The best results for classifying hateinstances which are the most problematic ones were obtained with com-binations between 1 and 3 with character level n-grams. There was nosignificant change in results when using the Bernoulli NBC versus theMultinomial NBC.

Bibliography

[1] Jason Brownlee. Supervised and Unsupervised Machine Learning Algo-rithms. 2016. URL: http://machinelearningmastery.com/supervised - and - unsupervised - machine - learning -algorithms/ (visited on 05/05/2017).

[2] Paula Cocozza. Crying with laughter: how we learned how to speakemoji. 2015. URL: https : / / www . theguardian . com /technology/2015/nov/17/crying-with-laughter-how-we-learned-how-to-speak-emoji (visited on 05/10/2017).

[3] Thomas Davidson et al. “Automated Hate Speech Detection and theProblem of Offensive Language”. In: (2017).

[4] Brandwatch Research Services Ditch The Label. “Cyberbullyingand Hate Speech”. In: (2016), p. 10.

[5] Ciro Donalek. SupervisedandUnsupervised Learning. 2011. URL:http://www.astro.caltech.edu/~george/aybi199/Donalek_Classif.pdf (visited on 05/05/2017).

[6] Tom Fawcett. The Basics of Classifier Evaluation Part 1. 2015. URL:https : / / svds . com / the - basics - of - classifier -evaluation-part-1/ (visited on 05/05/2017).

[7] Daniel Jurafsky and James H. Martin. “Speech and Language Pro-cessing”. In: (2014), p. 1.

[8] Ioannas Kanaris et al. “Words vs. character n-grams for anti spamfiltering”. In: (), p. 2.

[9] Susanna Kelley. Texting, Twitter contributing to student’s poor gram-mar skills, profs say. 2010. URL: http://www.theglobeandmail.com / technology / texting - twitter - contributing -to - students - poor - grammar - skills - profs - say /article4304193/ (visited on 05/05/2017).

26

BIBLIOGRAPHY 27

[10] Eamonn J. Keogh and Michael J. Pazzani. “Learning AugmentedBayesian Classifiers: A Comparison of Distribution-based andClassification-based Approaches”. In: (1999).

[11] Ashraf M. Kibriya et al. “Multinomial Naive Bayes for Text Catego-rization Revisited”. In: (2004).

[12] Irene Kwok and Yuzhou Wang. “Locate the Hate: Detecting Tweetsagainst Blacks”. In: (2013).

[13] Scikit Learn. The Bag of Words representation. 2017. URL: http :/ / scikit - learn . org / stable / modules / feature _extraction.html (visited on 05/05/2017).

[14] scikit learn. scikit learn. 2017. URL: http://scikit-learn.org/stable/ (visited on 05/06/2017).

[15] David D. Lewis. “Naive (Bayes) at Forty: The Independence As-sumption in Information Retrieval”. In: ().

[16] Andrew McCallum and Kamal Nigam. “A Comparison of EventModels for Naive Bayes Text Classification”. In: (1998).

[17] Vangelis Metsis, Ion Androutsopoulos, and Georgios Paliouras.“Spam Filtering with Naive Bayes – Which Naive Bayes?” In:(2006).

[18] Orange3 Text Mining. Bag of Words. 2015. URL: http://orange3-text.readthedocs.io/en/stable/widgets/bagofwords.html (visited on 05/05/2017).

[19] University of Minnesota. “Information Systems: A manager’sGuide to Harnessing Technology”. In: (2011).

[20] Andrew Ng. Lecture 61 - Model Selection and Train/Validation/TestSets. URL: https://www.coursera.org/learn/machine-learning/lecture/QGKbr/model-selection-and-train-validation-test-sets (visited on 05/05/2017).

[21] Bruno A. Olshausen. “Bayesian probability theory”. In: (2004).

[22] Fuchun Peng and Dale Schuurmans. “Combining Naive Bayes andn-Gram Language Models for Text Classification”. In: (2003).

[23] Sebastian Raschka. “Naive Bayes and Text Classification I”. In:(2014).

BIBLIOGRAPHY 28

[24] I. Rish. “An empirical study of the naive Bayes classifier”. In: (2014).

[25] Arthur L. Samuel. “Some studies in machine learning using thegame of checkers”. In: (1959), p. 1.

[26] Rob Schapire. COS 511: Theoretical Machine Learning. 2008. URL:http : / / www . cs . princeton . edu / courses / archive /spr08 / cos511 / scribe _ notes / 0204 . pdf (visited on05/05/2017).

[27] Hiroshi Shimodaira. Text Classification using Naive Bayes. 2015. URL:http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes / inf2b - learn - note07 - 2up . pdf (visited on05/05/2017).

[28] Leandro Silva et al. “Analyzing the Targets of Hate in Online SocialMedia”. In: (2016).

[29] Marina Sokolova and Guy Lapalme. “A systematic analysis of per-formance measures for classification tasks”. In: (2009).

[30] Statsoft. Naive Bayes Classifier. URL: http://www.statsoft.com / Textbook / Naive - Bayes - Classifier (visited on05/05/2017).

[31] Analytics More Text Mining. Computing Precision and Recall forMulti-Class Classification Problems. 2014. URL: http : / / text -analytics101 . rxnlp . com / 2014 / 10 / computing -precision-and-recall-for.html (visited on 05/05/2017).

[32] Twitter. About Twitter. 2017. URL: https://about.twitter.com/company (visited on 05/04/2017).

[33] Twitter. About your Twitter timeline. 2017. URL: https://support.twitter.com/articles/164083 (visited on 05/05/2017).

[34] Twitter. Hateful conduct policy. 2017. URL: https://support.twitter.com/articles/20175050 (visited on 05/05/2017).

[35] Twitter. Twitter FAQ. 2017. URL: https://support.twitter.com/articles/14019 (visited on 05/05/2017).

[36] Twitter. Using hashtags on Twitter. 2017. URL: https://support.twitter.com/articles/49309# (visited on 05/05/2017).

BIBLIOGRAPHY 29

[37] Jun-Ming Xu et al. “Learning from Bullying Traces in Social”. In:(2012), p. 3.

Appendices

30

A. Confusion Matrices

31

APPENDIX A. CONFUSION MATRICES 32

Table A.1: Confusion matrices for experiments using Multinomial naiveBayes classifier


APPENDIX A. CONFUSION MATRICES 33

Table A.2: Confusion matrices for experiments using Bernoulli naiveBayes classifier


detecting hate speech on twitter -...

Documents