Transcript
Page 1: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Knowledge Discovery and Data Mining 1 (VO)(706.701)Classification

Denis Helic

ISDS, TU Graz

January 10, 2019

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 1 / 74

Page 2: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Big picture: KDDM

Knowledge Discovery Process

Preprocessing

Linear Algebra

Mathematical Tools

Information Theory Statistical Inference

Probability Theory Hardware & ProgrammingModel

Infrastructure

Transformation

Data Mining

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 2 / 74

Page 3: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outline

1 Introduction

2 Bayes Rule for Classification

3 Naive Bayes Classifier

4 Naive Bayes for Text Classification

SlidesSlides are partially based on “Machine Learning” course from University ofWashington by Pedro Domingos, “Machine Learning” book by TomMitchell, and “Introduction to Information Retrieval” by Manning,Raghavan and Schütze

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 3 / 74

Page 4: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification as a Machine Learning task

Classification problems: classify an example into given set ofcategoriesTypically we have a labeled set of training examplesWe apply an algorithm to learn from the training set and devise aclassification ruleWe apply the rule to classify new examples

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 4 / 74

Page 5: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Types of learning

Supervised (inductive) learning: training data includes the desiredoutputsUnsupervised learning: training data does not include the desiredoutputsSemi-supervised learning: training data includes a few desired outputsReinforcement learning: rewards from sequence of actions

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 5 / 74

Page 6: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Supervised learning

Given examples of a function (x, 𝑓(x))Predict function 𝑓(x) for new examples xDepending on 𝑓(x) we have:

1 Classification if 𝑓(x) is a discrete function2 Regression if 𝑓(x) is a continuous function3 Probability estimation if 𝑓(x) is a probability distribution

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 6 / 74

Page 7: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification

Given: Training examples (x, 𝑓(x)) for some unknown discretefunction 𝑓Find: A good approximation for 𝑓x is data, e.g. a vector, text documents, emails, words in text, etc.

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 7 / 74

Page 8: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification: example applications

Credit risk assessment:x: properties of customers and proposed purchase𝑓(x): approve purchase or not

Disease diagnosis:x: properties of patients, lab tests𝑓(x): disease

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 8 / 74

Page 9: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification: example applications

Email spam filter:x: email text (or features extracted from emails)𝑓(x): spam or not spam

Text categorization:x: documents or words (features) from a document𝑓(x): thematic category

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 9 / 74

Page 10: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Terminology

Training example: an example of the form (x, 𝑓(x))Target function: the true function 𝑓Hypothesis: a proposed function ℎ believed to be similar to 𝑓Classifier: a discrete-valued function. The possible values𝑓(x) ∈ 1, 2,… ,𝐾 are called classes or labelsHypothesis space: a space of all possible hypothesis

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 10 / 74

Page 11: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification approaches

Logistic regressionDecision treesSupport vector machinesNearest neighbor algorithmsNeural networksStatistical inference such as Bayesian learning (e.g. textcategorization)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 11 / 74

Page 12: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Application of Bayes Rule for classification

We are given some hypothesis space 𝐻 and the observed data 𝐷, andsome prior probabilities of various hypothesesWe are interested in determining the best hypothesis from 𝐻The best means the most probable given the data plus the priorIn other words we are interested in posterior probabilities of varioushypotheses given the data and their prior probabilities

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 12 / 74

Page 13: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Application of Bayes Rule for classification

Bayes TheoremFor some given joint probability distribution 𝑃(𝐷, ℎ) we have the priorprobability of ℎ given as 𝑃(ℎ), the probability of observing the data 𝑃(𝐷)regardless of which hypothesis holds, and the conditional probability ofdata given a hypothesis 𝑃(𝐷|ℎ). The posterior probability of a hypothesisℎ is given by:

𝑃(ℎ|𝐷) = 𝑃(𝐷|ℎ)𝑃(ℎ)𝑃(𝐷)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 13 / 74

Page 14: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Application of Bayes Rule for classification

𝑃(ℎ|𝐷) increases with 𝑃(ℎ) and with 𝑃(𝐷|ℎ)Also, 𝑃(ℎ|𝐷) decreases as 𝑃(𝐷) increases because the more probableit is that 𝐷 will be observed independent of ℎ, the less evidence 𝐷provides in support of ℎIn many learning scenarios the learner considers some set of candidatehypotheses 𝐻 and selects the most probable hypothesis from this setSuch maximally probably hypotheses is called maximum a posteriori(MAP) hypothesis

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 14 / 74

Page 15: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Maximum a posteriori (MAP) hypothesis

MAP

ℎ𝑀𝐴𝑃 ≡ argmaxℎ∈𝐻

𝑃(ℎ|𝐷)

= argmaxℎ∈𝐻

𝑃(𝐷|ℎ)𝑃(ℎ)𝑃(𝐷)

= argmaxℎ∈𝐻

𝑃(𝐷|ℎ)𝑃(ℎ)

In the last step we dropped 𝑃(𝐷) because it is a constantindependent of ℎ

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 15 / 74

Page 16: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Maximum likelihood (ML) hypothesis

In some cases we can assume that every hypothesis is equallyprobable a prioriThe MAP equation simplifies to:

ML

ℎ𝑀𝐿 = argmaxℎ∈𝐻

𝑃(𝐷|ℎ)

𝑃(𝐷|ℎ) is the likelihood of the data given a hypothesisThis is then maximum likelihood hypothesis

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 16 / 74

Page 17: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

MAP: example

Medical diagnosisConsider a medical diagnosis problem in which there are two alternativehypotheses:

1 The patient has a particular form of cancer2 The patient does not have cancer.

The available data is from a particular laboratory test with two possibleoutcomes: P (positive) and N (negative). We have prior knowledge thatover the entire population only .008 people have this disease. Furthermore,the test returns a correct positive result in only 98% of the cases in whichthe disease is actually present and a correct negative test in only 97% ofthe cases in which the disease is not present. What is ℎ𝑀𝐴𝑃?

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 17 / 74

Page 18: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

MAP: example

𝑃(𝑐𝑎𝑛𝑐𝑒𝑟) = 0.008𝑃(𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.992𝑃(𝑃 |𝑐𝑎𝑛𝑐𝑒𝑟) = 0.98𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟) = 0.02𝑃(𝑃 |𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.03𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.97ℎ𝑀𝐴𝑃 = ?

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 18 / 74

Page 19: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

MAP: example

If the test is positive

𝑃(𝑃|𝑐𝑎𝑛𝑐𝑒𝑟)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟) = 0.00784𝑃(𝑃|𝑐𝑎𝑛𝑐𝑒𝑟𝑐)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.02976

ℎ𝑀𝐴𝑃 = 𝑐𝑎𝑛𝑐𝑒𝑟𝑐

If the test is negative

𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟) = 0.00016𝑃(𝑁|𝑐𝑎𝑛𝑐𝑒𝑟𝑐)𝑃(𝑐𝑎𝑛𝑐𝑒𝑟𝑐) = 0.96224

ℎ𝑀𝐴𝑃 = 𝑐𝑎𝑛𝑐𝑒𝑟𝑐

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 19 / 74

Page 20: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Bayes classifier

We considered the question: “What is the most probable hypothesisgiven the training data?”We can also ask: “What is the most probable classification of the newinstance given the training data?”In other words we can apply Bayes theorem for the classificationproblemThe most well-known example of such an approach is the Naive BayesClassifier

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 20 / 74

Page 21: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

The Naive Bayes approach is based on the MAP approach applied tothe set of possible classifications of new data instancesGiven a data instance as a tuple of values x = ⟨𝑥1, 𝑥2,… , 𝑥𝑛⟩ and afinite set of classes 𝐶 𝑐𝑀𝐴𝑃 is given by:

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈𝐶

𝑃(𝑐𝑗|𝑥1, 𝑥2,… , 𝑥𝑛)

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈𝐶

𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗)𝑃 (𝑐𝑗)𝑃 (𝑥1, 𝑥2,… , 𝑥𝑛

= argmax𝑐𝑗∈𝐶

𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗)𝑃 (𝑐𝑗)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 21 / 74

Page 22: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

We should estimate the two terms from the previous equation basedon the training data𝑃(𝑐𝑗) is easy: we just count the frequency with which each classoccurs in the training dataEstimating 𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗) is difficult unless we have very verylarge datasetThe number of those terms is equal to the number of possibleinstances times the number of classesWe need to see every instance in the instance space to obtain reliableestimates

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 22 / 74

Page 23: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

Using the chain rule we can rewrite 𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗) as:

𝑃(𝑥1,… , 𝑥𝑛|𝑐𝑗) = 𝑃(𝑥1|𝑐𝑗)𝑃(𝑥2|𝑥1, 𝑐𝑗)𝑃(𝑥3|𝑥1, 𝑥2, 𝑐𝑗)…𝑃(𝑥𝑛|𝑥1,… , 𝑥𝑛−1, 𝑐𝑗)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 23 / 74

Page 24: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

The naive Bayes classifier is based on the simplifying assumption thatthe attribute values are conditionally independent given the class

𝑃(𝑥, 𝑦|𝑐) = 𝑃(𝑥|𝑐)𝑃 (𝑦|𝑐)

𝑃 (𝑥|𝑦, 𝑐) = 𝑃(𝑥|𝑐), ∀𝑥, 𝑦, 𝑐

𝑃 (𝑥1,… , 𝑥𝑛|𝑐𝑗) = 𝑃(𝑥1|𝑐𝑗)𝑃 (𝑥2|𝑐𝑗)𝑃 (𝑥3|𝑐𝑗)…𝑃(𝑥𝑛|𝑐𝑗)= ∏

𝑖𝑃(𝑥𝑖|𝑐𝑗)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 24 / 74

Page 25: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

The 𝑐𝑀𝐴𝑃 is then given by:

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈𝐶

𝑃(𝑐𝑗)∏𝑖

𝑃(𝑥𝑖|𝑐𝑗)

Estimating 𝑃(𝑥𝑖|𝑐𝑗) from the training data is much easier thanestimation of 𝑃(𝑥1, 𝑥2,… , 𝑥𝑛|𝑐𝑗)The number of parameters to estimate is the number of distinctattribute values times the number of distinct classesEstimation of a single parameter is typically easy: e.g. based onfrequency of their occurrence in the training data

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 25 / 74

Page 26: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

PlayTennis problem from the book “Machine Learning” by TimMitchellCan we play tennis on a given day with given attributesIt is a classification problem with two classes: “yes” and “no”The classification is based on the values of attributes such astemperature, humidity, etc.We have 14 training examples

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 26 / 74

Page 27: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

Day Outlook Temperature Humidity Wind PlayTennisD1 sunny hot high weak noD2 sunny hot high strong noD3 overcast hot high weak yesD4 rain mild high weak yesD5 rain cool normal weak yesD6 rain cool normal strong noD7 overcast cool normal strong yesD8 sunny mild high weak noD9 sunny cool normal weak yes

D10 rain mild normal weak yesD11 sunny mild normal strong yesD12 overcast mild high strong yesD13 overcast hot normal weak yesD14 rain mild high strong no

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 27 / 74

Page 28: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

The task is to classify the new instance: ⟨sunny, cool, high, strong⟩The class is then given by:

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈𝐶

𝑃(𝑐𝑗)∏𝑖

𝑃(𝑥𝑖|𝑐𝑗)

= argmax𝑐𝑗∈𝐶

𝑃(𝑐𝑗)𝑃 (𝑠𝑢𝑛𝑛𝑦|𝑐𝑗)𝑃 (𝑐𝑜𝑜𝑙|𝑐𝑗)𝑃 (ℎ𝑖𝑔ℎ|𝑐𝑗)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑐𝑗)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 28 / 74

Page 29: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisOutlook sunny overcast rain 𝑛

yes 2 4 3 9no 3 0 2 5

𝑃Outlook sunny overcast rain

𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘|𝑦𝑒𝑠) 2/9 4/9 3/9𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘|𝑛𝑜) 3/5 0 2/5

Table: Frequencies and conditional probability of “Outlook” feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 29 / 74

Page 30: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisTemperature hot mild cool 𝑛

yes 2 4 3 9no 2 2 1 5

𝑃Temperature hot mild cool

𝑃(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝑦𝑒𝑠) 2/9 4/9 3/9𝑃(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒|𝑛𝑜) 2/5 2/5 1/5

Table: Frequencies and conditional probability of “Temperature” feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 30 / 74

Page 31: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisHumidity high normal 𝑛

yes 3 6 9no 4 1 5

𝑃Humidity high normal

𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦|𝑦𝑒𝑠) 3/9 6/9𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦|𝑛𝑜) 4/5 1/5

Table: Frequencies and conditional probability of “Humidity” feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 31 / 74

Page 32: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisWind weak strong 𝑛

yes 6 3 9no 2 3 5

𝑃Wind weak strong

𝑃(𝑊𝑖𝑛𝑑|𝑦𝑒𝑠) 6/9 3/9𝑝(𝑊𝑖𝑛𝑑|𝑛𝑜) 2/5 3/5

Table: Frequencies and conditional probability of “Wind” feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 32 / 74

Page 33: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

We calculate unnormalized a posteriori probabilities for the classes:

𝑃(𝑦𝑒𝑠)𝑃 (𝑠𝑢𝑛𝑛𝑦|𝑦𝑒𝑠)𝑃 (𝑐𝑜𝑜𝑙|𝑦𝑒𝑠)𝑃 (ℎ𝑖𝑔ℎ|𝑦𝑒𝑠)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑦𝑒𝑠) = 0.00529𝑃(𝑛𝑜)𝑃(𝑠𝑢𝑛𝑛𝑦|𝑛𝑜)𝑃 (𝑐𝑜𝑜𝑙|𝑛𝑜)𝑃 (ℎ𝑖𝑔ℎ|𝑛𝑜)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑛𝑜) = 0.02057

𝑐𝑀𝐴𝑃 = 𝑛𝑜

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 33 / 74

Page 34: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

RemarksThese are not the probabilitiesTo obtain conditional probabilities we need to normalize with the sumover classes

0.0205070.00529+0.02057 = 0.7949In a real example the numbers get very small: danger of underflowUse logarithms!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 34 / 74

Page 35: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

The task is to classify the new instance: ⟨overcast, cool, high, strong⟩

𝑃(𝑦𝑒𝑠)𝑃 (𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡|𝑦𝑒𝑠)𝑃 (𝑐𝑜𝑜𝑙|𝑦𝑒𝑠)𝑃 (ℎ𝑖𝑔ℎ|𝑦𝑒𝑠)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑦𝑒𝑠) = 0.01058𝑃(𝑛𝑜)𝑃(𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡|𝑛𝑜)𝑃 (𝑐𝑜𝑜𝑙|𝑛𝑜)𝑃 (ℎ𝑖𝑔ℎ|𝑛𝑜)𝑃 (𝑠𝑡𝑟𝑜𝑛𝑔|𝑛𝑜) = 0

𝑃(𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡|𝑛𝑜) = 0𝑐𝑀𝐴𝑃 = 𝑦𝑒𝑠 for any new instance with overcast as Outlook

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 35 / 74

Page 36: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Estimating probabilities

We have estimated probabilities by the fraction of times the event isobserved to occur over the total number of opportunities𝑃 = 𝑛𝑐

𝑛The first difficulty: 𝑛𝑐

𝑛 is a biased underestimate of the real probabilitySecond, when the estimate is 0 a posteriori probability for that class isalways 0Smoothing! (as discussed in feature extraction lecture)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 36 / 74

Page 37: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Estimating probabilities

We introduce fake countsOne possibility: 𝑚-estimate of probability

𝑃 = 𝑛𝑐 +𝑚𝑝𝑛 +𝑚

𝑝 is prior estimate of the probability we wish to determine𝑚 is a constant called equivalent sample size

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 37 / 74

Page 38: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Estimating probabilities

In absence of any other information we will assume uniform priors𝑝 = 1

𝑘 , with 𝑘 being the number of distinct values of a featureE.g. if a feature has two possible values 𝑝 = 0.5Using this prior we produce 𝑚 fake countsE.g. if a feature has two possible values we can produce two fakecounts:

𝑃 = 𝑛𝑐 + 1𝑛 + 2

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 38 / 74

Page 39: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Classifying text

Text classification is a standard area of classical information retrievalA query (e.g. “Multicore CPU”) against a document collectiondivides the collection into two classesDocuments about “Multicore CPU”Documents not about “Multicore CPU”A typical two-class classification problem

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 39 / 74

Page 40: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Classifying text

In a general case, a class is a subject, a topic or a categoryApplication examples of text classification include:

1 Automatic detection of spam documents or other types of offendingcontent

2 Sentiment detection (positive or negative product reviews)3 Personal email sorting4 Topical or vertical search

There are several approaches for text classification but again weconcentrate on statistical text classification, in particular on NaiveBayes

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 40 / 74

Page 41: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Classifying text

We are given a description 𝑑 ∈ 𝕏 of a document and a fixed set ofclasses ℂ = {𝑐1, 𝑐2,… , 𝑐𝑚}𝕏 is a document space and is typically some high dimensional spaceThe classes are in most cases human definedWe are given a training set 𝔻 of labeled documents ⟨𝑑, 𝑐⟩ ∈ 𝕏 × ℂFor example: ⟨ Beijing joins the World Trade Organization, China ⟩

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 41 / 74

Page 42: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

The MAP approach applied to the set of possible classifications of anew document 𝑑

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈ℂ

𝑃(𝑐𝑗|𝑑)

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈ℂ

𝑃(𝑑|𝑐𝑗)𝑃 (𝑐𝑗)𝑃 (𝑑

= argmax𝑐𝑗∈𝐶

𝑃(𝑑|𝑐𝑗)𝑃 (𝑐𝑗)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 42 / 74

Page 43: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

We can interpret the last equation as a description of the generativeprocessTo generate a document we first select class 𝑐 with probability 𝑃(𝑐)In the second step we generate a document given 𝑐, corresponding tothe conditional probability distribution 𝑃(𝑑|𝑐)How does this conditional probability distribution looks like?First we need to make certain simplifying assumptions

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 43 / 74

Page 44: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

Similarly, to the naive Bayes classifier we will base our models on thesimplifying assumption that the attribute values are independent, orconditionally independent given the class

𝑃(𝑡1𝑡2𝑡3𝑡4) = 𝑃(𝑡1)𝑃 (𝑡2)𝑃 (𝑡3)𝑃 (𝑡4)

This is unigram language modelUnder this model the order of words is irrelevant: “bag of words”A generative model will have the same probability for any “bag ofwords” regardless their ordering

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 44 / 74

Page 45: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

Assuming “bag of words” model we still need to select a particularconditional distribution over bagsTwo most common approaches are:

1 Bernoulli distribution2 Multinomial distribution

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 45 / 74

Page 46: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Bernoulli generative model

In this model we assume that the documents are generated from amultivariate Bernoulli distributionThe model generates an indicator for each term from the vocabulary1 indicates the presence of the term in a document0 indicates the absence of the term in a documentDifferent generation of the documents leads to different parameterestimation and different classification rules

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 46 / 74

Page 47: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Bernoulli generative model

In particular, for each term in a vocabulary we perform a Bernoullitrial and its result decides if the term is present in a document orabsent

𝑃(𝑑|𝑐) = ∏𝑡𝑖∈𝑑

𝑝(𝑡𝑖|𝑐) ∏𝑡𝑖∉𝑑

(1 − 𝑝(𝑡𝑖|𝑐))

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 47 / 74

Page 48: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the Bernoulli model

We apply MLE again to get the estimates for 𝑝(𝑐) and 𝑝(𝑡𝑖|𝑐)

𝑝(𝑐) = 𝑛𝑐𝑛

𝑝(𝑡𝑖|𝑐) =∑𝑖,𝑗 𝑒𝑡𝑖,𝑑𝑗

𝑛𝑐

𝑛 is the total number of documents, 𝑛𝑐 is the number of documentsin the class 𝑐𝑒𝑡𝑖,𝑑𝑗

is the indicator of presence of term 𝑡𝑖 in document 𝑑𝑗 belongingto the class 𝑐

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 48 / 74

Page 49: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the Bernoulli model

As before because of zero counts we need to apply smoothing, e.g.𝑚-estimate smoothing with 𝑝 = 0.5 and 𝑚 = 2

𝑝(𝑡𝑖|𝑐) =∑𝑖,𝑗 𝑒𝑡𝑖,𝑑𝑗

+ 1𝑛𝑐 + 2

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 49 / 74

Page 50: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

MAP in the Bernoulli model

With the estimated parameters and ignoring the irrelevant constantswe obtain

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈ℂ

𝑃(𝑑|𝑐𝑗)𝑃 (𝑐𝑗)

= argmax𝑐𝑗∈ℂ

𝑃(𝑐𝑗) ∏𝑡𝑖∈𝑑

𝑝(𝑡𝑖|𝑐) ∏𝑡𝑖∉𝑑

(1 − 𝑝(𝑡𝑖|𝑐)) (1)

Interpretation of 𝑝(𝑡𝑖, 𝑐) and (1 − 𝑝(𝑡𝑖|𝑐)): how much evidence thepresence or absence of 𝑡𝑖 contributes that 𝑐 is the correct classImplementation with logarithms! Underflow!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 50 / 74

Page 51: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model: Example

Document ClassChinese Beijing Chinese China

Chinese Chinese Shanghai ChinaChinese Macao China

Tokyo Japan Chinese Japan

What is the class of 𝑑 = ⟨ Chinese Chinese Chinese Tokyo Japan ⟩

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 51 / 74

Page 52: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model: Example

ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan

China 3 1 1 1 0 0Japan 1 0 0 0 1 1

𝑃Term Chinese Bejing Shanghai Macao Tokyo Japan

𝑃(𝑡|𝐶ℎ𝑖𝑛𝑎) 4/5 2/5 2/5 2/5 1/5 1/5𝑝(𝑡|𝐽𝑎𝑝𝑎𝑛) 2/3 1/3 1/3 1/3 2/3 2/3

Table: Frequencies and conditional probability of terms (𝑚-estimate smoothing)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 52 / 74

Page 53: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model: Example

The task is to classify the new instance: 𝑑 = ⟨ Chinese ChineseChinese Tokyo Japan ⟩

𝑃(𝐶ℎ𝑖𝑛𝑎)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐶ℎ𝑖𝑛𝑎)𝑃(𝑇𝑜𝑘𝑦𝑜|𝐶ℎ𝑖𝑛𝑎)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐶ℎ𝑖𝑛𝑎) ×(1 −𝑃(𝐵𝑒𝑗𝑖𝑛𝑔|𝐶ℎ𝑖𝑛𝑎))(1 −𝑃(𝑆ℎ𝑎𝑛𝑔ℎ𝑎𝑖|𝐶ℎ𝑖𝑛𝑎))(1 −𝑃(𝑀𝑎𝑐𝑎𝑜|𝐶ℎ𝑖𝑛𝑎)) = 0.00518

𝑃(𝐽𝑎𝑝𝑎𝑛)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐽𝑎𝑝𝑎𝑛)𝑃(𝑇𝑜𝑘𝑦𝑜|𝐽𝑎𝑝𝑎𝑛)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐽𝑎𝑝𝑎𝑛) ×(1 −𝑃(𝐵𝑒𝑗𝑖𝑛𝑔|𝐽𝑎𝑝𝑎𝑛))(1 −𝑃(𝑆ℎ𝑎𝑛𝑔ℎ𝑎𝑖|𝐽𝑎𝑝𝑎𝑛))(1 −𝑃(𝑀𝑎𝑐𝑎𝑜|𝐽𝑎𝑝𝑎𝑛)) = 0.02194

𝑐𝑀𝐴𝑃 = 𝐽𝑎𝑝𝑎𝑛“Japan” and “Tokyo” are good indicators for “Japan” and since wedo not count occurrences they outweigh “Chinese” indicator

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 53 / 74

Page 54: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model

Online edition (c)2009 Cambridge UP

13.3 The Bernoulli model 263

TRAINBERNOULLINB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 for each t ∈ V7 do Nct ← COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]← (Nct + 1)/(Nc + 2)9 return V, prior, condprob

APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd ← EXTRACTTERMSFROMDOC(V, d)2 for each c ∈ C

3 do score[c]← log prior[c]4 for each t ∈ V5 do if t ∈ Vd

6 then score[c] += log condprob[t][c]7 else score[c] += log(1− condprob[t][c])8 return arg maxc∈C

score[c]

◮ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.

13.3 The Bernoulli model

There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).

An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL

tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.

The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates P̂(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 54 / 74

Page 55: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model

Online edition (c)2009 Cambridge UP

13.3 The Bernoulli model 263

TRAINBERNOULLINB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 for each t ∈ V7 do Nct ← COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]← (Nct + 1)/(Nc + 2)9 return V, prior, condprob

APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd ← EXTRACTTERMSFROMDOC(V, d)2 for each c ∈ C

3 do score[c]← log prior[c]4 for each t ∈ V5 do if t ∈ Vd

6 then score[c] += log condprob[t][c]7 else score[c] += log(1− condprob[t][c])8 return arg maxc∈C

score[c]

◮ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.

13.3 The Bernoulli model

There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).

An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL

tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.

The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates P̂(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 55 / 74

Page 56: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial random variable

The model is based on multinomial distribution, which is ageneralization of the binomial distributionEach of 𝑛 trials leads to a success in one of 𝑘 categories, with eachcategory having a fixed success probabilityIt holds ∑𝑘

𝑖=1 𝑝𝑖 = 1A multinomial r.v. is given by the vector 𝑋 = (𝑋1, 𝑋2,… ,𝑋𝑘),where 𝑋𝑖 is the number of successes in the 𝑖th categoryIt holds ∑𝑘

𝑖=1 𝑋𝑖 = 𝑛

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 56 / 74

Page 57: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial random variable

PMF

𝑝(𝑥1,… , 𝑥𝑘; 𝑛, 𝑝1,… , 𝑝𝑘) = {𝑛!

𝑥1!…𝑥𝑛!𝑝𝑥11 …𝑝𝑥𝑘

𝑘 if ∑𝑘𝑖=1 𝑥𝑖 = 𝑛

0, otherwise

With ∑𝑘𝑖=1 𝑝𝑖 = 1 and 𝑥𝑖 non-negative, ∀𝑖

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 57 / 74

Page 58: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial generative model

In this model we assume that the documents are generated from amultinomial distribution, i.e. the number of occurrences of terms indocument is a multinomial r.v.

𝑃(𝑑|𝑐) = {𝐿𝑑!

𝑡𝑓𝑡1,𝑑!…𝑡𝑓𝑡𝑛,𝑑!𝑝(𝑡1|𝑐)𝑡𝑓𝑡1,𝑑 …𝑝(𝑡𝑀 |𝐶)𝑡𝑓𝑡𝑀,𝑑 if ∑𝑀

𝑖=1 𝑡𝑓𝑡𝑖,𝑑 = 𝐿𝑑

0, otherwise

𝑡𝑓𝑡𝑖,𝑑 is the term frequency of the term 𝑡𝑖 in document 𝑑𝐿𝑑 is the length of the document, i.e. 𝐿𝑑 = ∑𝑀

𝑖=1 𝑡𝑓𝑡𝑖,𝑑𝑀 is the size of the document vocabulary

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 58 / 74

Page 59: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the multinomial model

Likelihood of all documents belonging to a class would be the productof multinomial probabilitiesSimple calculus (setting of derivative of log-likelihood to 0) give asthe maximum likelihood estimate for 𝑝(𝑐) and 𝑝(𝑡𝑖|𝑐)

𝑝(𝑐) = 𝑛𝑐𝑛

𝑝(𝑡𝑖|𝑐) =𝑡𝑓𝑡𝑖,𝑐

∑𝑀𝑖=1 𝑡𝑓𝑡𝑖,𝑐

𝑛 is the total number of documents, 𝑛𝑐 is the number of documentsin the class 𝑐𝑡𝑓𝑡𝑖,𝑐 is the term frequency of term 𝑡𝑖 in all documents belonging tothe class 𝑐

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 59 / 74

Page 60: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the multinomial model

As before because of zero counts we need to apply smoothing, e.g.Laplace smoothing

𝑝(𝑡𝑖|𝑐) =𝑡𝑓𝑡𝑖,𝑐 + 1

∑𝑀𝑖=1(𝑡𝑓𝑡𝑖,𝑐 + 1)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 60 / 74

Page 61: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

MAP in the multinomial model

With the estimated parameters and ignoring the irrelevant constantswe obtain

𝑐𝑀𝐴𝑃 = argmax𝑐𝑗∈ℂ

𝑃(𝑑|𝑐𝑗)𝑃 (𝑐𝑗)

= argmax𝑐𝑗∈ℂ

𝑃(𝑐𝑗)𝐿𝑑

∏𝑖=1

𝑝(𝑡𝑖|𝑐) (2)

Interpretation of 𝑝(𝑡𝑖, 𝑐): how much evidence 𝑡𝑖 contributes that 𝑐 isthe correct classImplementation with logarithms! Underflow!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 61 / 74

Page 62: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Example

Document ClassChinese Beijing Chinese China

Chinese Chinese Shanghai ChinaChinese Macao China

Tokyo Japan Chinese Japan

What is the class of 𝑑 = ⟨ Chinese Chinese Chinese Tokyo Japan ⟩

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 62 / 74

Page 63: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Example

ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan

China 5 1 1 1 0 0Japan 1 0 0 0 1 1

𝑃Term Chinese Bejing Shanghai Macao Tokyo Japan

𝑃(𝑡|𝐶ℎ𝑖𝑛𝑎) 6/14 2/14 2/14 2/14 1/14 1/14𝑝(𝑡|𝐽𝑎𝑝𝑎𝑛) 2/9 1/9 1/9 1/9 2/9 2/9

Table: Frequencies and conditional probability of terms (Laplace smoothing)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 63 / 74

Page 64: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Example

The task is to classify the new instance: 𝑑 = ⟨ Chinese ChineseChinese Tokyo Japan ⟩

𝑃(𝐶ℎ𝑖𝑛𝑎)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐶ℎ𝑖𝑛𝑎)3𝑃(𝑇𝑜𝑘𝑦𝑜|𝐶ℎ𝑖𝑛𝑎)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐶ℎ𝑖𝑛𝑎) = 0.0003𝑃(𝐽𝑎𝑝𝑎𝑛)𝑃(𝐶ℎ𝑖𝑛𝑒𝑠𝑒|𝐽𝑎𝑝𝑎𝑛)3𝑃(𝑇𝑜𝑘𝑦𝑜|𝐽𝑎𝑝𝑎𝑛)𝑃(𝐽𝑎𝑝𝑎𝑛|𝐽𝑎𝑝𝑎𝑛) = 0.0001

𝑐𝑀𝐴𝑃 = 𝐶ℎ𝑖𝑛𝑎Three occurrences of positive indicator outweigh two occurrences ofthe negative indicator

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 64 / 74

Page 65: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model

Online edition (c)2009 Cambridge UP

260 13 Text classification and Naive Bayes

TRAINMULTINOMIALNB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 textc ← CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t ∈ V8 do Tct ← COUNTTOKENSOFTERM(textc, t)9 for each t ∈ V

10 do condprob[t][c]← Tct+1∑t′ (Tct′+1)

11 return V, prior, condprob

APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W ← EXTRACTTOKENSFROMDOC(V, d)2 for each c ∈ C

3 do score[c] ← log prior[c]4 for each t ∈W5 do score[c] += log condprob[t][c]6 return arg maxc∈C

score[c]

◮ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.

assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be “conditioned away,”no matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS

to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.

To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING

one to each count (cf. Section 11.3.2):

P̂(t|c) =Tct + 1

∑t′∈V(Tct′ + 1)=

Tct + 1(∑t′∈V Tct′) + B

,(13.7)

where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 65 / 74

Page 66: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model

Online edition (c)2009 Cambridge UP

260 13 Text classification and Naive Bayes

TRAINMULTINOMIALNB(C, D)1 V ← EXTRACTVOCABULARY(D)2 N ← COUNTDOCS(D)3 for each c ∈ C

4 do Nc ← COUNTDOCSINCLASS(D, c)5 prior[c]← Nc/N6 textc ← CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t ∈ V8 do Tct ← COUNTTOKENSOFTERM(textc, t)9 for each t ∈ V

10 do condprob[t][c]← Tct+1∑t′ (Tct′+1)

11 return V, prior, condprob

APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W ← EXTRACTTOKENSFROMDOC(V, d)2 for each c ∈ C

3 do score[c] ← log prior[c]4 for each t ∈W5 do score[c] += log condprob[t][c]6 return arg maxc∈C

score[c]

◮ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.

assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be “conditioned away,”no matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS

to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.

To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING

one to each count (cf. Section 11.3.2):

P̂(t|c) =Tct + 1

∑t′∈V(Tct′ + 1)=

Tct + 1(∑t′∈V Tct′) + B

,(13.7)

where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 66 / 74

Page 67: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Complexity

Estimation of parameters has 𝑂(|ℂ||𝑉 |)Product of the number of classes and the size of the vocabularyThis is linear in the size of the vocabularyPreprocessing is also linearApplying is also linear in the size of the documentAll together linear in the time it takes to scan the data: very efficient!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 67 / 74

Page 68: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial vs. Bernoulli model

The Bernoulli model estimates 𝑝(𝑡𝑖|𝑐) as the fraction of documents ofclass 𝑐 that contains term 𝑡In contrast, the multinomial model estimates 𝑝(𝑡𝑖|𝑐) as the termfrequency of term 𝑡 in a document of a class 𝑐 relative to the termfrequency of term 𝑡 in all documents of class 𝑐When classifying a document the Bernoulli model uses binaryoccurrence information, and the multinomial model keeps track ofmultiple occurrences

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 68 / 74

Page 69: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial vs. Bernoulli model

As a result the Bernoulli model tends to make more errors whenclassifying long documentsE.g. it might assign a complete book to “China” class because of asingle occurrence of the term “Chinese”Non-occurring terms do not affect classification decision in themultinomial model, whereas in the Bernoulli model they doBernoulli works better with fewer featuresComplexity is same and both models work very good and accurate inpractice

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 69 / 74

Page 70: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB models

The independence assumption is oversimplified“Bag of words” model ignores positions of the words and their contextE.g. in “Hong Kong” are not independent of each other nor are theirpositionsBernoulli model does not even take into account the term frequenciesThis leads to very bad estimates of true document probabilities

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 70 / 74

Page 71: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB models

However, NB classification decisions are surprisingly goodSuppose the following true document probabilities 𝑃(𝑐1|𝑑) = 0.6 and𝑃(𝑐2|𝑑) = 0.4Assume that 𝑑 contains many positive indicators for 𝑐1 and manynegative indicators for 𝑐2Then the NB estimation might be: ̂𝑃 (𝑐1|𝑑) = 0.00099 and̂𝑃 (𝑐2|𝑑) = 0.00001

This is common: the winning class in NB has usually a much largerprobability than the othersIn classification we care only about the winning class

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 71 / 74

Page 72: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB performance

Online edition (c)2009 Cambridge UP

13.5 Feature selection 275

# # ##

##

##

#

# #

#

###

1 10 100 1000 10000

0.0

0.2

0.4

0.6

0.8

number of features selected

F1 m

easure

oo o oo

o

o

o

o o

oo

ooo

x

x

x x

x

x

xx

xx x

x

x xx

b

b

b

bb b b

b

b

bb b b bb

#ox

b

multinomial, MImultinomial, chisquaremultinomial, frequencybinomial, MI

◮ Figure 13.8 Effect of feature set size on accuracy for multinomial and Bernoullimodels.

the end when we use all features. The reason is that the multinomial takesthe number of occurrences into account in parameter estimation and clas-sification and therefore better exploits a larger number of features than theBernoulli model. Regardless of the differences between the two methods,using a carefully selected subset of the features results in better effectivenessthan using all features.

13.5.2 χ2 Feature selection

Another popular feature selection method is χ2. In statistics, the χ2 test isχ2 FEATURE SELECTION

applied to test the independence of two events, where two events A and B aredefined to be independent if P(AB) = P(A)P(B) or, equivalently, P(A|B) =INDEPENDENCE

P(A) and P(B|A) = P(B). In feature selection, the two events are occurrenceof the term and occurrence of the class. We then rank terms with respect tothe following quantity:

X2(D, t, c) = ∑et∈{0,1}

∑ec∈{0,1}

(Netec − Eetec)2

Eetec

(13.18)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 72 / 74

Page 73: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB performance

Online edition (c)2009 Cambridge UP

282 13 Text classification and Naive Bayes

◮ Table 13.8 Macro- and microaveraging. “Truth” is the true class and “call” thedecision of the classifier. In this example, macroaveraged precision is [10/(10 + 10) +90/(10 + 90)]/2 = (0.5 + 0.9)/2 = 0.7. Microaveraged precision is 100/(100 + 20) ≈0.83.

class 1truth: truth:yes no

call:yes 10 10

call:no

10 970

class 2truth: truth:yes no

call:yes 90 10

call:no

10 890

pooled tabletruth: truth:yes no

call:yes 100 20

call:no

20 1860

◮ Table 13.9 Text classification effectiveness numbers on Reuters-21578 for F1 (inpercent). Results from Li and Yang (2003) (a), Joachims (1998) (b: kNN) and Dumaiset al. (1998) (b: NB, Rocchio, trees, SVM).

(a) NB Rocchio kNN SVMmicro-avg-L (90 classes) 80 85 86 89macro-avg (90 classes) 47 59 60 60

(b) NB Rocchio kNN trees SVMearn 96 93 97 98 98acq 88 65 92 90 94money-fx 57 47 78 66 75grain 79 68 82 85 95crude 80 70 86 85 89trade 64 65 77 73 76interest 65 63 74 67 78ship 85 49 79 74 86wheat 70 69 77 93 92corn 65 48 78 92 90micro-avg (top 10) 82 65 82 88 92micro-avg-D (118 classes) 75 62 n/a n/a 87

Rocchio and kNN. In addition, we give numbers for decision trees, an impor-DECISION TREES

tant classification method we do not cover. The bottom part of the tableshows that there is considerable variation from class to class. For instance,NB beats kNN on ship, but is much worse on money-fx.

Comparing parts (a) and (b) of the table, one is struck by the degree towhich the cited papers’ results differ. This is partly due to the fact that thenumbers in (b) are break-even scores (cf. page 161) averaged over 118 classes,whereas the numbers in (a) are true F1 scores (computed without any know-

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 73 / 74

Page 74: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB performance

SVM seems to be better but it costs more to trainkNN is better for some classesIn theory: SVM better than kNN better than NBWhen data is nice!In practice with errors in data, data drifting, etc. very difficult to beatNB

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 74 / 74


Top Related