knowledge discovery and data mining 1 (vo)...

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Knowledge Discovery and Data Mining 1 (VO)(706.701)Classification

Denis Helic

ISDS, TU Graz

January 10, 2019

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 1 / 74

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Big picture: KDDM

Knowledge Discovery Process

Preprocessing

Linear Algebra

Mathematical Tools

Information Theory Statistical Inference

Probability Theory Hardware & ProgrammingModel

Infrastructure

Transformation

Data Mining


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outline

1 Introduction

2 Bayes Rule for Classification

3 Naive Bayes Classifier

4 Naive Bayes for Text Classification

SlidesSlides are partially based on “Machine Learning” course from University ofWashington by Pedro Domingos, “Machine Learning” book by TomMitchell, and “Introduction to Information Retrieval” by Manning,Raghavan and Schütze


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification as a Machine Learning task

Classification problems: classify an example into given set ofcategoriesTypically we have a labeled set of training examplesWe apply an algorithm to learn from the training set and devise aclassification ruleWe apply the rule to classify new examples


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Types of learning

Supervised (inductive) learning: training data includes the desiredoutputsUnsupervised learning: training data does not include the desiredoutputsSemi-supervised learning: training data includes a few desired outputsReinforcement learning: rewards from sequence of actions


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Supervised learning

Given examples of a function (x, 𝑓(x))Predict function 𝑓(x) for new examples xDepending on 𝑓(x) we have:

1 Classification if 𝑓(x) is a discrete function2 Regression if 𝑓(x) is a continuous function3 Probability estimation if 𝑓(x) is a probability distribution


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification

Given: Training examples (x, 𝑓(x)) for some unknown discretefunction 𝑓Find: A good approximation for 𝑓x is data, e.g. a vector, text documents, emails, words in text, etc.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification: example applications

Credit risk assessment:x: properties of customers and proposed purchase𝑓(x): approve purchase or not

Disease diagnosis:x: properties of patients, lab tests𝑓(x): disease


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification: example applications

Email spam filter:x: email text (or features extracted from emails)𝑓(x): spam or not spam

Text categorization:x: documents or words (features) from a document𝑓(x): thematic category


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Terminology

Training example: an example of the form (x, 𝑓(x))Target function: the true function 𝑓Hypothesis: a proposed function ℎ believed to be similar to 𝑓Classifier: a discrete-valued function. The possible values𝑓(x) ∈ 1, 2,… ,𝐾 are called classes or labelsHypothesis space: a space of all possible hypothesis


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification approaches

Logistic regressionDecision treesSupport vector machinesNearest neighbor algorithmsNeural networksStatistical inference such as Bayesian learning (e.g. textcategorization)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Application of Bayes Rule for classification

We are given some hypothesis space 𝐻 and the observed data 𝐷, andsome prior probabilities of various hypothesesWe are interested in determining the best hypothesis from 𝐻The best means the most probable given the data plus the priorIn other words we are interested in posterior probabilities of varioushypotheses given the data and their prior probabilities


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Bayes TheoremFor some given joint probability distribution 𝑃(𝐷, ℎ) we have the priorprobability of ℎ given as 𝑃(ℎ), the probability of observing the data 𝑃(𝐷)regardless of which hypothesis holds, and the conditional probability ofdata given a hypothesis 𝑃(𝐷|ℎ). The posterior probability of a hypothesisℎ is given by:

𝑃(ℎ|𝐷) = 𝑃(𝐷|ℎ)𝑃(ℎ)𝑃(𝐷)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



𝑃(ℎ|𝐷) increases with 𝑃(ℎ) and with 𝑃(𝐷|ℎ)Also, 𝑃(ℎ|𝐷) decreases as 𝑃(𝐷) increases because the more probableit is that 𝐷 will be observed independent of ℎ, the less evidence 𝐷provides in support of ℎIn many learning scenarios the learner considers some set of candidatehypotheses 𝐻 and selects the most probable hypothesis from this setSuch maximally probably hypotheses is called maximum a posteriori(MAP) hypothesis


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Maximum a posteriori (MAP) hypothesis

MAP

ℎ𝑀𝐴𝑃 ≡ argmaxℎ∈𝐻

𝑃(ℎ|𝐷)

= argmaxℎ∈𝐻

𝑃(𝐷|ℎ)𝑃(ℎ)𝑃(𝐷)

= argmaxℎ∈𝐻

𝑃(𝐷|ℎ)𝑃(ℎ)

In the last step we dropped 𝑃(𝐷) because it is a constantindependent of ℎ


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Maximum likelihood (ML) hypothesis

In some cases we can assume that every hypothesis is equallyprobable a prioriThe MAP equation simplifies to:

ML

ℎ𝑀𝐿 = argmaxℎ∈𝐻

𝑃(𝐷|ℎ)

𝑃(𝐷|ℎ) is the likelihood of the data given a hypothesisThis is then maximum likelihood hypothesis


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


MAP: example

Medical diagnosisConsider a medical diagnosis problem in which there are two alternativehypotheses:

1 The patient has a particular form of cancer2 The patient does not have cancer.

The available data is from a particular laboratory test with two possibleoutcomes: P (positive) and N (negative). We have prior knowledge thatover the entire population only .008 people have this disease. Furthermore,the test returns a correct positive result in only 98% of the cases in whichthe disease is actually present and a correct negative test in only 97% ofthe cases in which the disease is not present. What is ℎ𝑀𝐴𝑃?


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


MAP: example

𝑃(𝑐𝑎𝑛𝑐𝑒𝑟) = 0.008𝑃(𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.992𝑃(𝑃 |𝑐𝑎𝑛𝑐𝑒𝑟) = 0.98𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟) = 0.02𝑃(𝑃 |𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.03𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.97ℎ𝑀𝐴𝑃 = ?


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


MAP: example

If the test is positive

𝑃(𝑃|𝑐𝑎𝑛𝑐𝑒𝑟)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟) = 0.00784𝑃(𝑃|𝑐𝑎𝑛𝑐𝑒𝑟𝑐)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.02976

ℎ𝑀𝐴𝑃 = 𝑐𝑎𝑛𝑐𝑒𝑟𝑐

If the test is negative

𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟) = 0.00016𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟𝑐)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.96224

ℎ𝑀𝐴𝑃 = 𝑐𝑎𝑛𝑐𝑒𝑟𝑐


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Bayes classifier

We considered the question: “What is the most probable hypothesisgiven the training data?”We can also ask: “What is the most probable classification of the newinstance given the training data?”In other words we can apply Bayes theorem for the classificationproblemThe most well-known example of such an approach is the Naive BayesClassifier


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



The Naive Bayes approach is based on the MAP approach applied tothe set of possible classifications of new data instancesGiven a data instance as a tuple of values x = ⟨𝑥1, 𝑥2,… , 𝑥𝑛⟩ and afinite set of classes 𝐶 𝑐𝑀𝐴𝑃 is given by:

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈𝐶

𝑃(𝑐𝑗|𝑥1, 𝑥2,… , 𝑥𝑛)


𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗)𝑃 (𝑐𝑗)𝑃 (𝑥1, 𝑥2,… , 𝑥𝑛

= argmax𝑐𝑗∈𝐶

𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗)𝑃 (𝑐𝑗)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



We should estimate the two terms from the previous equation basedon the training data𝑃(𝑐𝑗) is easy: we just count the frequency with which each classoccurs in the training dataEstimating 𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗) is difficult unless we have very verylarge datasetThe number of those terms is equal to the number of possibleinstances times the number of classesWe need to see every instance in the instance space to obtain reliableestimates


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



The 𝑐𝑀𝐴𝑃 is then given by:


𝑃(𝑐𝑗)∏𝑖

𝑃(𝑥𝑖|𝑐𝑗)

Estimating 𝑃(𝑥𝑖|𝑐𝑗) from the training data is much easier thanestimation of 𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗)The number of parameters to estimate is the number of distinctattribute values times the number of distinct classesEstimation of a single parameter is typically easy: e.g. based onfrequency of their occurrence in the training data


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Naive Bayes Classifier: Example

PlayTennis problem from the book “Machine Learning” by TimMitchellCan we play tennis on a given day with given attributesIt is a classification problem with two classes: “yes” and “no”The classification is based on the values of attributes such astemperature, humidity, etc.We have 14 training examples


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Naive Bayes Classifier: example

Day Outlook Temperature Humidity Wind PlayTennisD1 sunny hot high weak noD2 sunny hot high strong noD3 overcast hot high weak yesD4 rain mild high weak yesD5 rain cool normal weak yesD6 rain cool normal strong noD7 overcast cool normal strong yesD8 sunny mild high weak noD9 sunny cool normal weak yes

D10 rain mild normal weak yesD11 sunny mild normal strong yesD12 overcast mild high strong yesD13 overcast hot normal weak yesD14 rain mild high strong no


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



The task is to classify the new instance: ⟨sunny, cool, high, strong⟩The class is then given by:


𝑃(𝑐𝑗)∏𝑖

𝑃(𝑥𝑖|𝑐𝑗)


𝑃(𝑐𝑗)𝑃 (𝑠𝑢𝑛𝑛𝑦|𝑐𝑗)𝑃 (𝑐𝑜𝑜𝑙|𝑐𝑗)𝑃 (ℎ𝑖𝑔ℎ|𝑐𝑗)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑐𝑗)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



PlayTennisOutlook sunny overcast rain 𝑛

yes 2 4 3 9no 3 0 2 5

𝑃Outlook sunny overcast rain

𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘|𝑦𝑒𝑠) 2/9 4/9 3/9𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘|𝑛𝑜) 3/5 0 2/5

Table: Frequencies and conditional probability of “Outlook” feature


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



PlayTennisTemperature hot mild cool 𝑛

yes 2 4 3 9no 2 2 1 5

𝑃Temperature hot mild cool

𝑃(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝑦𝑒𝑠) 2/9 4/9 3/9𝑃(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝑛𝑜) 2/5 2/5 1/5

Table: Frequencies and conditional probability of “Temperature” feature


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



PlayTennisHumidity high normal 𝑛

yes 3 6 9no 4 1 5

𝑃Humidity high normal

𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦|𝑦𝑒𝑠) 3/9 6/9𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦|𝑛𝑜) 4/5 1/5

Table: Frequencies and conditional probability of “Humidity” feature


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



PlayTennisWind weak strong 𝑛

yes 6 3 9no 2 3 5

𝑃Wind weak strong

𝑃(𝑊𝑖𝑛𝑑|𝑦𝑒𝑠) 6/9 3/9𝑝(𝑊𝑖𝑛𝑑|𝑛𝑜) 2/5 3/5

Table: Frequencies and conditional probability of “Wind” feature


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



RemarksThese are not the probabilitiesTo obtain conditional probabilities we need to normalize with the sumover classes

0.0205070.00529+0.02057 = 0.7949In a real example the numbers get very small: danger of underflowUse logarithms!


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



The task is to classify the new instance: ⟨overcast, cool, high, strong⟩

𝑃(𝑦𝑒𝑠)𝑃 (𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡|𝑦𝑒𝑠)𝑃 (𝑐𝑜𝑜𝑙|𝑦𝑒𝑠)𝑃 (ℎ𝑖𝑔ℎ|𝑦𝑒𝑠)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑦𝑒𝑠) = 0.01058𝑃(𝑛𝑜)𝑃(𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡|𝑛𝑜)𝑃 (𝑐𝑜𝑜𝑙|𝑛𝑜)𝑃 (ℎ𝑖𝑔ℎ|𝑛𝑜)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑛𝑜) = 0

𝑃(𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡|𝑛𝑜) = 0𝑐𝑀𝐴𝑃 = 𝑦𝑒𝑠 for any new instance with overcast as Outlook


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Estimating probabilities

We have estimated probabilities by the fraction of times the event isobserved to occur over the total number of opportunities𝑃 = 𝑛𝑐

𝑛The first difficulty: 𝑛𝑐

𝑛 is a biased underestimate of the real probabilitySecond, when the estimate is 0 a posteriori probability for that class isalways 0Smoothing! (as discussed in feature extraction lecture)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



We introduce fake countsOne possibility: 𝑚-estimate of probability

𝑃 = 𝑛𝑐 +𝑚𝑝𝑛 +𝑚

𝑝 is prior estimate of the probability we wish to determine𝑚 is a constant called equivalent sample size


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



In absence of any other information we will assume uniform priors𝑝 = 1

𝑘 , with 𝑘 being the number of distinct values of a featureE.g. if a feature has two possible values 𝑝 = 0.5Using this prior we produce 𝑚 fake countsE.g. if a feature has two possible values we can produce two fakecounts:

𝑃 = 𝑛𝑐 + 1𝑛 + 2


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Classifying text

Text classification is a standard area of classical information retrievalA query (e.g. “Multicore CPU”) against a document collectiondivides the collection into two classesDocuments about “Multicore CPU”Documents not about “Multicore CPU”A typical two-class classification problem


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Classifying text

In a general case, a class is a subject, a topic or a categoryApplication examples of text classification include:

1 Automatic detection of spam documents or other types of offendingcontent

2 Sentiment detection (positive or negative product reviews)3 Personal email sorting4 Topical or vertical search

There are several approaches for text classification but again weconcentrate on statistical text classification, in particular on NaiveBayes


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Classifying text

We are given a description 𝑑 ∈ 𝕏 of a document and a fixed set ofclasses ℂ = {𝑐1, 𝑐2,… , 𝑐𝑚}𝕏 is a document space and is typically some high dimensional spaceThe classes are in most cases human definedWe are given a training set 𝔻 of labeled documents ⟨𝑑, 𝑐⟩ ∈ 𝕏 × ℂFor example: ⟨ Beijing joins the World Trade Organization, China ⟩


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Naive Bayes text classification

The MAP approach applied to the set of possible classifications of anew document 𝑑

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈ℂ

𝑃(𝑐𝑗|𝑑)


𝑃(𝑑|𝑐𝑗)𝑃 (𝑐𝑗)𝑃 (𝑑


𝑃(𝑑|𝑐𝑗)𝑃 (𝑐𝑗)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



We can interpret the last equation as a description of the generativeprocessTo generate a document we first select class 𝑐 with probability 𝑃(𝑐)In the second step we generate a document given 𝑐, corresponding tothe conditional probability distribution 𝑃(𝑑|𝑐)How does this conditional probability distribution looks like?First we need to make certain simplifying assumptions


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Similarly, to the naive Bayes classifier we will base our models on thesimplifying assumption that the attribute values are independent, orconditionally independent given the class

𝑃(𝑡1𝑡2𝑡3𝑡4) = 𝑃(𝑡1)𝑃 (𝑡2)𝑃 (𝑡3)𝑃 (𝑡4)

This is unigram language modelUnder this model the order of words is irrelevant: “bag of words”A generative model will have the same probability for any “bag ofwords” regardless their ordering


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Assuming “bag of words” model we still need to select a particularconditional distribution over bagsTwo most common approaches are:

1 Bernoulli distribution2 Multinomial distribution


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Bernoulli generative model

In this model we assume that the documents are generated from amultivariate Bernoulli distributionThe model generates an indicator for each term from the vocabulary1 indicates the presence of the term in a document0 indicates the absence of the term in a documentDifferent generation of the documents leads to different parameterestimation and different classification rules


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Bernoulli generative model

In particular, for each term in a vocabulary we perform a Bernoullitrial and its result decides if the term is present in a document orabsent

𝑃(𝑑|𝑐) = ∏𝑡𝑖∈𝑑

𝑝(𝑡𝑖|𝑐) ∏𝑡𝑖∉𝑑

(1 − 𝑝(𝑡𝑖|𝑐))


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Inference of parameters in the Bernoulli model

We apply MLE again to get the estimates for 𝑝(𝑐) and 𝑝(𝑡𝑖|𝑐)

𝑝(𝑐) = 𝑛𝑐𝑛

𝑝(𝑡𝑖|𝑐) =∑𝑖,𝑗 𝑒𝑡𝑖,𝑑𝑗

𝑛𝑐

𝑛 is the total number of documents, 𝑛𝑐 is the number of documentsin the class 𝑐𝑒𝑡𝑖,𝑑𝑗

is the indicator of presence of term 𝑡𝑖 in document 𝑑𝑗 belongingto the class 𝑐


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Inference of parameters in the Bernoulli model

As before because of zero counts we need to apply smoothing, e.g.𝑚-estimate smoothing with 𝑝 = 0.5 and 𝑚 = 2

𝑝(𝑡𝑖|𝑐) =∑𝑖,𝑗 𝑒𝑡𝑖,𝑑𝑗

+ 1𝑛𝑐 + 2


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


MAP in the Bernoulli model

With the estimated parameters and ignoring the irrelevant constantswe obtain



= argmax𝑐𝑗∈ℂ

𝑃(𝑐𝑗) ∏𝑡𝑖∈𝑑

𝑝(𝑡𝑖|𝑐) ∏𝑡𝑖∉𝑑

(1 − 𝑝(𝑡𝑖|𝑐)) (1)

Interpretation of 𝑝(𝑡𝑖, 𝑐) and (1 − 𝑝(𝑡𝑖|𝑐)): how much evidence thepresence or absence of 𝑡𝑖 contributes that 𝑐 is the correct classImplementation with logarithms! Underflow!


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB with Bernoulli model: Example

Document ClassChinese Beijing Chinese China

Chinese Chinese Shanghai ChinaChinese Macao China

Tokyo Japan Chinese Japan

What is the class of 𝑑 = ⟨ Chinese Chinese Chinese Tokyo Japan ⟩


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan

China 3 1 1 1 0 0Japan 1 0 0 0 1 1

𝑃Term Chinese Bejing Shanghai Macao Tokyo Japan

𝑃(𝑡|𝐶ℎ𝑖𝑛𝑎) 4/5 2/5 2/5 2/5 1/5 1/5𝑝(𝑡|𝐽𝑎𝑝𝑎𝑛) 2/3 1/3 1/3 1/3 2/3 2/3

Table: Frequencies and conditional probability of terms (𝑚-estimate smoothing)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



The task is to classify the new instance: 𝑑 = ⟨ Chinese ChineseChinese Tokyo Japan ⟩

𝑃(𝐶ℎ𝑖𝑛𝑎)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐶ℎ𝑖𝑛𝑎)𝑃(𝑇𝑜𝑘𝑦𝑜|𝐶ℎ𝑖𝑛𝑎)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐶ℎ𝑖𝑛𝑎) ×(1 −𝑃(𝐵𝑒𝑗𝑖𝑛𝑔|𝐶ℎ𝑖𝑛𝑎))(1 −𝑃(𝑆ℎ𝑎𝑛𝑔ℎ𝑎𝑖|𝐶ℎ𝑖𝑛𝑎))(1 −𝑃(𝑀𝑎𝑐𝑎𝑜|𝐶ℎ𝑖𝑛𝑎)) = 0.00518

𝑃(𝐽𝑎𝑝𝑎𝑛)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐽𝑎𝑝𝑎𝑛)𝑃(𝑇𝑜𝑘𝑦𝑜|𝐽𝑎𝑝𝑎𝑛)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐽𝑎𝑝𝑎𝑛) ×(1 −𝑃(𝐵𝑒𝑗𝑖𝑛𝑔|𝐽𝑎𝑝𝑎𝑛))(1 −𝑃(𝑆ℎ𝑎𝑛𝑔ℎ𝑎𝑖|𝐽𝑎𝑝𝑎𝑛))(1 −𝑃(𝑀𝑎𝑐𝑎𝑜|𝐽𝑎𝑝𝑎𝑛)) = 0.02194

𝑐𝑀𝐴𝑃 = 𝐽𝑎𝑝𝑎𝑛“Japan” and “Tokyo” are good indicators for “Japan” and since wedo not count occurrences they outweigh “Chinese” indicator


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB with Bernoulli model

Online edition (c)2009 Cambridge UP

13.3 The Bernoulli model 263

TRAINBERNOULLINB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 for each t ∈ V7 do Nct ← COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]← (Nct + 1)/(Nc + 2)9 return V, prior, condprob

APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd ← EXTRACTTERMSFROMDOC(V, d)2 for each c ∈ C

3 do score[c]← log prior[c]4 for each t ∈ V5 do if t ∈ Vd

6 then score[c] += log condprob[t][c]7 else score[c] += log(1− condprob[t][c])8 return arg maxc∈C

score[c]

◮ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.

13.3 The Bernoulli model

There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).

An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL

tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.

The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates P̂(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB with Bernoulli model


13.3 The Bernoulli model 263

TRAINBERNOULLINB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 for each t ∈ V7 do Nct ← COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]← (Nct + 1)/(Nc + 2)9 return V, prior, condprob

APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd ← EXTRACTTERMSFROMDOC(V, d)2 for each c ∈ C

3 do score[c]← log prior[c]4 for each t ∈ V5 do if t ∈ Vd

6 then score[c] += log condprob[t][c]7 else score[c] += log(1− condprob[t][c])8 return arg maxc∈C

score[c]

◮ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.

13.3 The Bernoulli model

There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).

An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL

tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.

The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates P̂(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Multinomial random variable

The model is based on multinomial distribution, which is ageneralization of the binomial distributionEach of 𝑛 trials leads to a success in one of 𝑘 categories, with eachcategory having a fixed success probabilityIt holds ∑𝑘

𝑖=1 𝑝𝑖 = 1A multinomial r.v. is given by the vector 𝑋 = (𝑋1, 𝑋2,… ,𝑋𝑘),where 𝑋𝑖 is the number of successes in the 𝑖th categoryIt holds ∑𝑘

𝑖=1 𝑋𝑖 = 𝑛


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Multinomial random variable

PMF

𝑝(𝑥1,… , 𝑥𝑘; 𝑛, 𝑝1,… , 𝑝𝑘) = {𝑛!

𝑥1!…𝑥𝑛!𝑝𝑥11 …𝑝𝑥𝑘

𝑘 if ∑𝑘𝑖=1 𝑥𝑖 = 𝑛

0, otherwise

With ∑𝑘𝑖=1 𝑝𝑖 = 1 and 𝑥𝑖 non-negative, ∀𝑖


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Multinomial generative model

In this model we assume that the documents are generated from amultinomial distribution, i.e. the number of occurrences of terms indocument is a multinomial r.v.

𝑃(𝑑|𝑐) = {𝐿𝑑!

𝑡𝑓𝑡1,𝑑!…𝑡𝑓𝑡𝑛,𝑑!𝑝(𝑡1|𝑐)𝑡𝑓𝑡1,𝑑 …𝑝(𝑡𝑀 |𝐶)𝑡𝑓𝑡𝑀,𝑑 if ∑𝑀

𝑖=1 𝑡𝑓𝑡𝑖,𝑑 = 𝐿𝑑

0, otherwise

𝑡𝑓𝑡𝑖,𝑑 is the term frequency of the term 𝑡𝑖 in document 𝑑𝐿𝑑 is the length of the document, i.e. 𝐿𝑑 = ∑𝑀

𝑖=1 𝑡𝑓𝑡𝑖,𝑑𝑀 is the size of the document vocabulary


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Inference of parameters in the multinomial model

Likelihood of all documents belonging to a class would be the productof multinomial probabilitiesSimple calculus (setting of derivative of log-likelihood to 0) give asthe maximum likelihood estimate for 𝑝(𝑐) and 𝑝(𝑡𝑖|𝑐)

𝑝(𝑐) = 𝑛𝑐𝑛

𝑝(𝑡𝑖|𝑐) =𝑡𝑓𝑡𝑖,𝑐

∑𝑀𝑖=1 𝑡𝑓𝑡𝑖,𝑐

𝑛 is the total number of documents, 𝑛𝑐 is the number of documentsin the class 𝑐𝑡𝑓𝑡𝑖,𝑐 is the term frequency of term 𝑡𝑖 in all documents belonging tothe class 𝑐


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Inference of parameters in the multinomial model

As before because of zero counts we need to apply smoothing, e.g.Laplace smoothing

𝑝(𝑡𝑖|𝑐) =𝑡𝑓𝑡𝑖,𝑐 + 1

∑𝑀𝑖=1(𝑡𝑓𝑡𝑖,𝑐 + 1)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


MAP in the multinomial model

With the estimated parameters and ignoring the irrelevant constantswe obtain



= argmax𝑐𝑗∈ℂ

𝑃(𝑐𝑗)𝐿𝑑

∏𝑖=1

𝑝(𝑡𝑖|𝑐) (2)

Interpretation of 𝑝(𝑡𝑖, 𝑐): how much evidence 𝑡𝑖 contributes that 𝑐 isthe correct classImplementation with logarithms! Underflow!


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB with multinomial model: Example

Document ClassChinese Beijing Chinese China

Chinese Chinese Shanghai ChinaChinese Macao China

Tokyo Japan Chinese Japan

What is the class of 𝑑 = ⟨ Chinese Chinese Chinese Tokyo Japan ⟩


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan

China 5 1 1 1 0 0Japan 1 0 0 0 1 1

𝑃Term Chinese Bejing Shanghai Macao Tokyo Japan

𝑃(𝑡|𝐶ℎ𝑖𝑛𝑎) 6/14 2/14 2/14 2/14 1/14 1/14𝑝(𝑡|𝐽𝑎𝑝𝑎𝑛) 2/9 1/9 1/9 1/9 2/9 2/9

Table: Frequencies and conditional probability of terms (Laplace smoothing)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



The task is to classify the new instance: 𝑑 = ⟨ Chinese ChineseChinese Tokyo Japan ⟩

𝑃(𝐶ℎ𝑖𝑛𝑎)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐶ℎ𝑖𝑛𝑎)3𝑃(𝑇𝑜𝑘𝑦𝑜|𝐶ℎ𝑖𝑛𝑎)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐶ℎ𝑖𝑛𝑎) = 0.0003𝑃(𝐽𝑎𝑝𝑎𝑛)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐽𝑎𝑝𝑎𝑛)3𝑃(𝑇𝑜𝑘𝑦𝑜|𝐽𝑎𝑝𝑎𝑛)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐽𝑎𝑝𝑎𝑛) = 0.0001

𝑐𝑀𝐴𝑃 = 𝐶ℎ𝑖𝑛𝑎Three occurrences of positive indicator outweigh two occurrences ofthe negative indicator


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB with multinomial model


260 13 Text classification and Naive Bayes

TRAINMULTINOMIALNB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 textc ← CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t ∈ V8 do Tct ← COUNTTOKENSOFTERM(textc, t)9 for each t ∈ V

10 do condprob[t][c]← Tct+1∑t′ (Tct′+1)

11 return V, prior, condprob

APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W ← EXTRACTTOKENSFROMDOC(V, d)2 for each c ∈ C

3 do score[c] ← log prior[c]4 for each t ∈W5 do score[c] += log condprob[t][c]6 return arg maxc∈C

score[c]

◮ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.

assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be “conditioned away,”no matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS

to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.

To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING

one to each count (cf. Section 11.3.2):

P̂(t|c) =Tct + 1

∑t′∈V(Tct′ + 1)=

Tct + 1(∑t′∈V Tct′) + B

,(13.7)

where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB with multinomial model



TRAINMULTINOMIALNB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 textc ← CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t ∈ V8 do Tct ← COUNTTOKENSOFTERM(textc, t)9 for each t ∈ V

10 do condprob[t][c]← Tct+1∑t′ (Tct′+1)

11 return V, prior, condprob

APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W ← EXTRACTTOKENSFROMDOC(V, d)2 for each c ∈ C

3 do score[c] ← log prior[c]4 for each t ∈W5 do score[c] += log condprob[t][c]6 return arg maxc∈C

score[c]

◮ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.

assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be “conditioned away,”no matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS

to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.

To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING

one to each count (cf. Section 11.3.2):

P̂(t|c) =Tct + 1

∑t′∈V(Tct′ + 1)=

Tct + 1(∑t′∈V Tct′) + B

,(13.7)

where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB with multinomial model: Complexity

Estimation of parameters has 𝑂(|ℂ||𝑉 |)Product of the number of classes and the size of the vocabularyThis is linear in the size of the vocabularyPreprocessing is also linearApplying is also linear in the size of the documentAll together linear in the time it takes to scan the data: very efficient!


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Multinomial vs. Bernoulli model

The Bernoulli model estimates 𝑝(𝑡𝑖|𝑐) as the fraction of documents ofclass 𝑐 that contains term 𝑡In contrast, the multinomial model estimates 𝑝(𝑡𝑖|𝑐) as the termfrequency of term 𝑡 in a document of a class 𝑐 relative to the termfrequency of term 𝑡 in all documents of class 𝑐When classifying a document the Bernoulli model uses binaryoccurrence information, and the multinomial model keeps track ofmultiple occurrences


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


Multinomial vs. Bernoulli model

As a result the Bernoulli model tends to make more errors whenclassifying long documentsE.g. it might assign a complete book to “China” class because of asingle occurrence of the term “Chinese”Non-occurring terms do not affect classification decision in themultinomial model, whereas in the Bernoulli model they doBernoulli works better with fewer featuresComplexity is same and both models work very good and accurate inpractice


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB models

The independence assumption is oversimplified“Bag of words” model ignores positions of the words and their contextE.g. in “Hong Kong” are not independent of each other nor are theirpositionsBernoulli model does not even take into account the term frequenciesThis leads to very bad estimates of true document probabilities


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB models

However, NB classification decisions are surprisingly goodSuppose the following true document probabilities 𝑃(𝑐1|𝑑) = 0.6 and𝑃(𝑐2|𝑑) = 0.4Assume that 𝑑 contains many positive indicators for 𝑐1 and manynegative indicators for 𝑐2Then the NB estimation might be: ̂𝑃 (𝑐1|𝑑) = 0.00099 and̂𝑃 (𝑐2|𝑑) = 0.00001

This is common: the winning class in NB has usually a much largerprobability than the othersIn classification we care only about the winning class


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB performance


13.5 Feature selection 275

# # ##

##

##

#

# #

#

###

1 10 100 1000 10000

0.0

0.2

0.4

0.6

0.8

number of features selected

F1 m

easure

oo o oo

o

o

o

o o

oo

ooo

x

x

x x

x

x

xx

xx x

x

x xx

b

b

b

bb b b

b

b

bb b b bb

#ox

b

multinomial, MImultinomial, chisquaremultinomial, frequencybinomial, MI

◮ Figure 13.8 Effect of feature set size on accuracy for multinomial and Bernoullimodels.

the end when we use all features. The reason is that the multinomial takesthe number of occurrences into account in parameter estimation and clas-sification and therefore better exploits a larger number of features than theBernoulli model. Regardless of the differences between the two methods,using a carefully selected subset of the features results in better effectivenessthan using all features.

13.5.2 χ2 Feature selection

Another popular feature selection method is χ2. In statistics, the χ2 test isχ2 FEATURE SELECTION

applied to test the independence of two events, where two events A and B aredefined to be independent if P(AB) = P(A)P(B) or, equivalently, P(A|B) =INDEPENDENCE

P(A) and P(B|A) = P(B). In feature selection, the two events are occurrenceof the term and occurrence of the class. We then rank terms with respect tothe following quantity:

X2(D, t, c) = ∑et∈{0,1}

∑ec∈{0,1}

(Netec − Eetec)2

Eetec

(13.18)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB performance



◮ Table 13.8 Macro- and microaveraging. “Truth” is the true class and “call” thedecision of the classifier. In this example, macroaveraged precision is [10/(10 + 10) +90/(10 + 90)]/2 = (0.5 + 0.9)/2 = 0.7. Microaveraged precision is 100/(100 + 20) ≈0.83.

class 1truth: truth:yes no

call:yes 10 10

call:no

10 970

class 2truth: truth:yes no

call:yes 90 10

call:no

10 890

pooled tabletruth: truth:yes no

call:yes 100 20

call:no

20 1860

◮ Table 13.9 Text classification effectiveness numbers on Reuters-21578 for F1 (inpercent). Results from Li and Yang (2003) (a), Joachims (1998) (b: kNN) and Dumaiset al. (1998) (b: NB, Rocchio, trees, SVM).

(a) NB Rocchio kNN SVMmicro-avg-L (90 classes) 80 85 86 89macro-avg (90 classes) 47 59 60 60

(b) NB Rocchio kNN trees SVMearn 96 93 97 98 98acq 88 65 92 90 94money-fx 57 47 78 66 75grain 79 68 82 85 95crude 80 70 86 85 89trade 64 65 77 73 76interest 65 63 74 67 78ship 85 49 79 74 86wheat 70 69 77 93 92corn 65 48 78 92 90micro-avg (top 10) 82 65 82 88 92micro-avg-D (118 classes) 75 62 n/a n/a 87

Rocchio and kNN. In addition, we give numbers for decision trees, an impor-DECISION TREES

tant classification method we do not cover. The bottom part of the tableshows that there is considerable variation from class to class. For instance,NB beats kNN on ship, but is much worse on money-fx.

Comparing parts (a) and (b) of the table, one is struck by the degree towhich the cited papers’ results differ. This is partly due to the fact that thenumbers in (b) are break-even scores (cf. page 161) averaged over 118 classes,whereas the numbers in (a) are true F1 scores (computed without any know-


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


NB performance

SVM seems to be better but it costs more to trainkNN is better for some classesIn theory: SVM better than kNN better than NBWhen data is nice!In practice with errors in data, data drifting, etc. very difficult to beatNB