knowledge discovery and data mining 1 (vo)...

74
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge Discovery and Data Mining 1 (VO) (706.701) Classi๏ฌcation Denis Helic ISDS, TU Graz January 10, 2019 Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 1 / 74

Upload: others

Post on 19-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Knowledge Discovery and Data Mining 1 (VO)(706.701)Classification

Denis Helic

ISDS, TU Graz

January 10, 2019

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 1 / 74

Page 2: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Big picture: KDDM

Knowledge Discovery Process

Preprocessing

Linear Algebra

Mathematical Tools

Information Theory Statistical Inference

Probability Theory Hardware & ProgrammingModel

Infrastructure

Transformation

Data Mining

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 2 / 74

Page 3: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outline

1 Introduction

2 Bayes Rule for Classification

3 Naive Bayes Classifier

4 Naive Bayes for Text Classification

SlidesSlides are partially based on โ€œMachine Learningโ€ course from University ofWashington by Pedro Domingos, โ€œMachine Learningโ€ book by TomMitchell, and โ€œIntroduction to Information Retrievalโ€ by Manning,Raghavan and Schรผtze

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 3 / 74

Page 4: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification as a Machine Learning task

Classification problems: classify an example into given set ofcategoriesTypically we have a labeled set of training examplesWe apply an algorithm to learn from the training set and devise aclassification ruleWe apply the rule to classify new examples

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 4 / 74

Page 5: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Types of learning

Supervised (inductive) learning: training data includes the desiredoutputsUnsupervised learning: training data does not include the desiredoutputsSemi-supervised learning: training data includes a few desired outputsReinforcement learning: rewards from sequence of actions

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 5 / 74

Page 6: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Supervised learning

Given examples of a function (x, ๐‘“(x))Predict function ๐‘“(x) for new examples xDepending on ๐‘“(x) we have:

1 Classification if ๐‘“(x) is a discrete function2 Regression if ๐‘“(x) is a continuous function3 Probability estimation if ๐‘“(x) is a probability distribution

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 6 / 74

Page 7: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification

Given: Training examples (x, ๐‘“(x)) for some unknown discretefunction ๐‘“Find: A good approximation for ๐‘“x is data, e.g. a vector, text documents, emails, words in text, etc.

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 7 / 74

Page 8: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification: example applications

Credit risk assessment:x: properties of customers and proposed purchase๐‘“(x): approve purchase or not

Disease diagnosis:x: properties of patients, lab tests๐‘“(x): disease

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 8 / 74

Page 9: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification: example applications

Email spam filter:x: email text (or features extracted from emails)๐‘“(x): spam or not spam

Text categorization:x: documents or words (features) from a document๐‘“(x): thematic category

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 9 / 74

Page 10: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Terminology

Training example: an example of the form (x, ๐‘“(x))Target function: the true function ๐‘“Hypothesis: a proposed function โ„Ž believed to be similar to ๐‘“Classifier: a discrete-valued function. The possible values๐‘“(x) โˆˆ 1, 2,โ€ฆ ,๐พ are called classes or labelsHypothesis space: a space of all possible hypothesis

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 10 / 74

Page 11: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Classification approaches

Logistic regressionDecision treesSupport vector machinesNearest neighbor algorithmsNeural networksStatistical inference such as Bayesian learning (e.g. textcategorization)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 11 / 74

Page 12: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Application of Bayes Rule for classification

We are given some hypothesis space ๐ป and the observed data ๐ท, andsome prior probabilities of various hypothesesWe are interested in determining the best hypothesis from ๐ปThe best means the most probable given the data plus the priorIn other words we are interested in posterior probabilities of varioushypotheses given the data and their prior probabilities

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 12 / 74

Page 13: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Application of Bayes Rule for classification

Bayes TheoremFor some given joint probability distribution ๐‘ƒ(๐ท, โ„Ž) we have the priorprobability of โ„Ž given as ๐‘ƒ(โ„Ž), the probability of observing the data ๐‘ƒ(๐ท)regardless of which hypothesis holds, and the conditional probability ofdata given a hypothesis ๐‘ƒ(๐ท|โ„Ž). The posterior probability of a hypothesisโ„Ž is given by:

๐‘ƒ(โ„Ž|๐ท) = ๐‘ƒ(๐ท|โ„Ž)๐‘ƒ(โ„Ž)๐‘ƒ(๐ท)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 13 / 74

Page 14: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Application of Bayes Rule for classification

๐‘ƒ(โ„Ž|๐ท) increases with ๐‘ƒ(โ„Ž) and with ๐‘ƒ(๐ท|โ„Ž)Also, ๐‘ƒ(โ„Ž|๐ท) decreases as ๐‘ƒ(๐ท) increases because the more probableit is that ๐ท will be observed independent of โ„Ž, the less evidence ๐ทprovides in support of โ„ŽIn many learning scenarios the learner considers some set of candidatehypotheses ๐ป and selects the most probable hypothesis from this setSuch maximally probably hypotheses is called maximum a posteriori(MAP) hypothesis

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 14 / 74

Page 15: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Maximum a posteriori (MAP) hypothesis

MAP

โ„Ž๐‘€๐ด๐‘ƒ โ‰ก argmaxโ„Žโˆˆ๐ป

๐‘ƒ(โ„Ž|๐ท)

= argmaxโ„Žโˆˆ๐ป

๐‘ƒ(๐ท|โ„Ž)๐‘ƒ(โ„Ž)๐‘ƒ(๐ท)

= argmaxโ„Žโˆˆ๐ป

๐‘ƒ(๐ท|โ„Ž)๐‘ƒ(โ„Ž)

In the last step we dropped ๐‘ƒ(๐ท) because it is a constantindependent of โ„Ž

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 15 / 74

Page 16: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

Maximum likelihood (ML) hypothesis

In some cases we can assume that every hypothesis is equallyprobable a prioriThe MAP equation simplifies to:

ML

โ„Ž๐‘€๐ฟ = argmaxโ„Žโˆˆ๐ป

๐‘ƒ(๐ท|โ„Ž)

๐‘ƒ(๐ท|โ„Ž) is the likelihood of the data given a hypothesisThis is then maximum likelihood hypothesis

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 16 / 74

Page 17: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

MAP: example

Medical diagnosisConsider a medical diagnosis problem in which there are two alternativehypotheses:

1 The patient has a particular form of cancer2 The patient does not have cancer.

The available data is from a particular laboratory test with two possibleoutcomes: P (positive) and N (negative). We have prior knowledge thatover the entire population only .008 people have this disease. Furthermore,the test returns a correct positive result in only 98% of the cases in whichthe disease is actually present and a correct negative test in only 97% ofthe cases in which the disease is not present. What is โ„Ž๐‘€๐ด๐‘ƒ?

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 17 / 74

Page 18: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

MAP: example

๐‘ƒ(๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ) = 0.008๐‘ƒ(๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘) = 0.992๐‘ƒ(๐‘ƒ |๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ) = 0.98๐‘ƒ(๐‘|๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ) = 0.02๐‘ƒ(๐‘ƒ |๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘) = 0.03๐‘ƒ(๐‘|๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘) = 0.97โ„Ž๐‘€๐ด๐‘ƒ = ?

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 18 / 74

Page 19: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Bayes Rule for Classification

MAP: example

If the test is positive

๐‘ƒ(๐‘ƒ|๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ)๐‘ƒ(๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ) = 0.00784๐‘ƒ(๐‘ƒ|๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘)๐‘ƒ(๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘) = 0.02976

โ„Ž๐‘€๐ด๐‘ƒ = ๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘

If the test is negative

๐‘ƒ(๐‘|๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ)๐‘ƒ(๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ) = 0.00016๐‘ƒ(๐‘|๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘)๐‘ƒ(๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘) = 0.96224

โ„Ž๐‘€๐ด๐‘ƒ = ๐‘๐‘Ž๐‘›๐‘๐‘’๐‘Ÿ๐‘

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 19 / 74

Page 20: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Bayes classifier

We considered the question: โ€œWhat is the most probable hypothesisgiven the training data?โ€We can also ask: โ€œWhat is the most probable classification of the newinstance given the training data?โ€In other words we can apply Bayes theorem for the classificationproblemThe most well-known example of such an approach is the Naive BayesClassifier

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 20 / 74

Page 21: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

The Naive Bayes approach is based on the MAP approach applied tothe set of possible classifications of new data instancesGiven a data instance as a tuple of values x = โŸจ๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›โŸฉ and afinite set of classes ๐ถ ๐‘๐‘€๐ด๐‘ƒ is given by:

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆ๐ถ

๐‘ƒ(๐‘๐‘—|๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›)

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆ๐ถ

๐‘ƒ(๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›|๐‘๐‘—)๐‘ƒ (๐‘๐‘—)๐‘ƒ (๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›

= argmax๐‘๐‘—โˆˆ๐ถ

๐‘ƒ(๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›|๐‘๐‘—)๐‘ƒ (๐‘๐‘—)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 21 / 74

Page 22: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

We should estimate the two terms from the previous equation basedon the training data๐‘ƒ(๐‘๐‘—) is easy: we just count the frequency with which each classoccurs in the training dataEstimating ๐‘ƒ(๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›|๐‘๐‘—) is difficult unless we have very verylarge datasetThe number of those terms is equal to the number of possibleinstances times the number of classesWe need to see every instance in the instance space to obtain reliableestimates

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 22 / 74

Page 23: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

Using the chain rule we can rewrite ๐‘ƒ(๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›|๐‘๐‘—) as:

๐‘ƒ(๐‘ฅ1,โ€ฆ , ๐‘ฅ๐‘›|๐‘๐‘—) = ๐‘ƒ(๐‘ฅ1|๐‘๐‘—)๐‘ƒ(๐‘ฅ2|๐‘ฅ1, ๐‘๐‘—)๐‘ƒ(๐‘ฅ3|๐‘ฅ1, ๐‘ฅ2, ๐‘๐‘—)โ€ฆ๐‘ƒ(๐‘ฅ๐‘›|๐‘ฅ1,โ€ฆ , ๐‘ฅ๐‘›โˆ’1, ๐‘๐‘—)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 23 / 74

Page 24: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

The naive Bayes classifier is based on the simplifying assumption thatthe attribute values are conditionally independent given the class

๐‘ƒ(๐‘ฅ, ๐‘ฆ|๐‘) = ๐‘ƒ(๐‘ฅ|๐‘)๐‘ƒ (๐‘ฆ|๐‘)

๐‘ƒ (๐‘ฅ|๐‘ฆ, ๐‘) = ๐‘ƒ(๐‘ฅ|๐‘), โˆ€๐‘ฅ, ๐‘ฆ, ๐‘

๐‘ƒ (๐‘ฅ1,โ€ฆ , ๐‘ฅ๐‘›|๐‘๐‘—) = ๐‘ƒ(๐‘ฅ1|๐‘๐‘—)๐‘ƒ (๐‘ฅ2|๐‘๐‘—)๐‘ƒ (๐‘ฅ3|๐‘๐‘—)โ€ฆ๐‘ƒ(๐‘ฅ๐‘›|๐‘๐‘—)= โˆ

๐‘–๐‘ƒ(๐‘ฅ๐‘–|๐‘๐‘—)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 24 / 74

Page 25: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier

The ๐‘๐‘€๐ด๐‘ƒ is then given by:

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆ๐ถ

๐‘ƒ(๐‘๐‘—)โˆ๐‘–

๐‘ƒ(๐‘ฅ๐‘–|๐‘๐‘—)

Estimating ๐‘ƒ(๐‘ฅ๐‘–|๐‘๐‘—) from the training data is much easier thanestimation of ๐‘ƒ(๐‘ฅ1, ๐‘ฅ2,โ€ฆ , ๐‘ฅ๐‘›|๐‘๐‘—)The number of parameters to estimate is the number of distinctattribute values times the number of distinct classesEstimation of a single parameter is typically easy: e.g. based onfrequency of their occurrence in the training data

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 25 / 74

Page 26: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

PlayTennis problem from the book โ€œMachine Learningโ€ by TimMitchellCan we play tennis on a given day with given attributesIt is a classification problem with two classes: โ€œyesโ€ and โ€œnoโ€The classification is based on the values of attributes such astemperature, humidity, etc.We have 14 training examples

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 26 / 74

Page 27: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

Day Outlook Temperature Humidity Wind PlayTennisD1 sunny hot high weak noD2 sunny hot high strong noD3 overcast hot high weak yesD4 rain mild high weak yesD5 rain cool normal weak yesD6 rain cool normal strong noD7 overcast cool normal strong yesD8 sunny mild high weak noD9 sunny cool normal weak yes

D10 rain mild normal weak yesD11 sunny mild normal strong yesD12 overcast mild high strong yesD13 overcast hot normal weak yesD14 rain mild high strong no

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 27 / 74

Page 28: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

The task is to classify the new instance: โŸจsunny, cool, high, strongโŸฉThe class is then given by:

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆ๐ถ

๐‘ƒ(๐‘๐‘—)โˆ๐‘–

๐‘ƒ(๐‘ฅ๐‘–|๐‘๐‘—)

= argmax๐‘๐‘—โˆˆ๐ถ

๐‘ƒ(๐‘๐‘—)๐‘ƒ (๐‘ ๐‘ข๐‘›๐‘›๐‘ฆ|๐‘๐‘—)๐‘ƒ (๐‘๐‘œ๐‘œ๐‘™|๐‘๐‘—)๐‘ƒ (โ„Ž๐‘–๐‘”โ„Ž|๐‘๐‘—)๐‘ƒ (๐‘ ๐‘ก๐‘Ÿ๐‘œ๐‘›๐‘”|๐‘๐‘—)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 28 / 74

Page 29: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisOutlook sunny overcast rain ๐‘›

yes 2 4 3 9no 3 0 2 5

๐‘ƒOutlook sunny overcast rain

๐‘ƒ(๐‘‚๐‘ข๐‘ก๐‘™๐‘œ๐‘œ๐‘˜|๐‘ฆ๐‘’๐‘ ) 2/9 4/9 3/9๐‘ƒ(๐‘‚๐‘ข๐‘ก๐‘™๐‘œ๐‘œ๐‘˜|๐‘›๐‘œ) 3/5 0 2/5

Table: Frequencies and conditional probability of โ€œOutlookโ€ feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 29 / 74

Page 30: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisTemperature hot mild cool ๐‘›

yes 2 4 3 9no 2 2 1 5

๐‘ƒTemperature hot mild cool

๐‘ƒ(๐‘‡๐‘’๐‘š๐‘๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘ข๐‘Ÿ๐‘’|๐‘ฆ๐‘’๐‘ ) 2/9 4/9 3/9๐‘ƒ(๐‘‡๐‘’๐‘š๐‘๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘ข๐‘Ÿ๐‘’|๐‘›๐‘œ) 2/5 2/5 1/5

Table: Frequencies and conditional probability of โ€œTemperatureโ€ feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 30 / 74

Page 31: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisHumidity high normal ๐‘›

yes 3 6 9no 4 1 5

๐‘ƒHumidity high normal

๐‘ƒ(๐ป๐‘ข๐‘š๐‘–๐‘‘๐‘–๐‘ก๐‘ฆ|๐‘ฆ๐‘’๐‘ ) 3/9 6/9๐‘ƒ(๐ป๐‘ข๐‘š๐‘–๐‘‘๐‘–๐‘ก๐‘ฆ|๐‘›๐‘œ) 4/5 1/5

Table: Frequencies and conditional probability of โ€œHumidityโ€ feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 31 / 74

Page 32: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: example

PlayTennisWind weak strong ๐‘›

yes 6 3 9no 2 3 5

๐‘ƒWind weak strong

๐‘ƒ(๐‘Š๐‘–๐‘›๐‘‘|๐‘ฆ๐‘’๐‘ ) 6/9 3/9๐‘(๐‘Š๐‘–๐‘›๐‘‘|๐‘›๐‘œ) 2/5 3/5

Table: Frequencies and conditional probability of โ€œWindโ€ feature

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 32 / 74

Page 33: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

We calculate unnormalized a posteriori probabilities for the classes:

๐‘ƒ(๐‘ฆ๐‘’๐‘ )๐‘ƒ (๐‘ ๐‘ข๐‘›๐‘›๐‘ฆ|๐‘ฆ๐‘’๐‘ )๐‘ƒ (๐‘๐‘œ๐‘œ๐‘™|๐‘ฆ๐‘’๐‘ )๐‘ƒ (โ„Ž๐‘–๐‘”โ„Ž|๐‘ฆ๐‘’๐‘ )๐‘ƒ (๐‘ ๐‘ก๐‘Ÿ๐‘œ๐‘›๐‘”|๐‘ฆ๐‘’๐‘ ) = 0.00529๐‘ƒ(๐‘›๐‘œ)๐‘ƒ(๐‘ ๐‘ข๐‘›๐‘›๐‘ฆ|๐‘›๐‘œ)๐‘ƒ (๐‘๐‘œ๐‘œ๐‘™|๐‘›๐‘œ)๐‘ƒ (โ„Ž๐‘–๐‘”โ„Ž|๐‘›๐‘œ)๐‘ƒ (๐‘ ๐‘ก๐‘Ÿ๐‘œ๐‘›๐‘”|๐‘›๐‘œ) = 0.02057

๐‘๐‘€๐ด๐‘ƒ = ๐‘›๐‘œ

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 33 / 74

Page 34: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

RemarksThese are not the probabilitiesTo obtain conditional probabilities we need to normalize with the sumover classes

0.0205070.00529+0.02057 = 0.7949In a real example the numbers get very small: danger of underflowUse logarithms!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 34 / 74

Page 35: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Naive Bayes Classifier: Example

The task is to classify the new instance: โŸจovercast, cool, high, strongโŸฉ

๐‘ƒ(๐‘ฆ๐‘’๐‘ )๐‘ƒ (๐‘œ๐‘ฃ๐‘’๐‘Ÿ๐‘๐‘Ž๐‘ ๐‘ก|๐‘ฆ๐‘’๐‘ )๐‘ƒ (๐‘๐‘œ๐‘œ๐‘™|๐‘ฆ๐‘’๐‘ )๐‘ƒ (โ„Ž๐‘–๐‘”โ„Ž|๐‘ฆ๐‘’๐‘ )๐‘ƒ (๐‘ ๐‘ก๐‘Ÿ๐‘œ๐‘›๐‘”|๐‘ฆ๐‘’๐‘ ) = 0.01058๐‘ƒ(๐‘›๐‘œ)๐‘ƒ(๐‘œ๐‘ฃ๐‘’๐‘Ÿ๐‘๐‘Ž๐‘ ๐‘ก|๐‘›๐‘œ)๐‘ƒ (๐‘๐‘œ๐‘œ๐‘™|๐‘›๐‘œ)๐‘ƒ (โ„Ž๐‘–๐‘”โ„Ž|๐‘›๐‘œ)๐‘ƒ (๐‘ ๐‘ก๐‘Ÿ๐‘œ๐‘›๐‘”|๐‘›๐‘œ) = 0

๐‘ƒ(๐‘œ๐‘ฃ๐‘’๐‘Ÿ๐‘๐‘Ž๐‘ ๐‘ก|๐‘›๐‘œ) = 0๐‘๐‘€๐ด๐‘ƒ = ๐‘ฆ๐‘’๐‘  for any new instance with overcast as Outlook

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 35 / 74

Page 36: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Estimating probabilities

We have estimated probabilities by the fraction of times the event isobserved to occur over the total number of opportunities๐‘ƒ = ๐‘›๐‘

๐‘›The first difficulty: ๐‘›๐‘

๐‘› is a biased underestimate of the real probabilitySecond, when the estimate is 0 a posteriori probability for that class isalways 0Smoothing! (as discussed in feature extraction lecture)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 36 / 74

Page 37: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Estimating probabilities

We introduce fake countsOne possibility: ๐‘š-estimate of probability

๐‘ƒ = ๐‘›๐‘ +๐‘š๐‘๐‘› +๐‘š

๐‘ is prior estimate of the probability we wish to determine๐‘š is a constant called equivalent sample size

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 37 / 74

Page 38: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes Classifier

Estimating probabilities

In absence of any other information we will assume uniform priors๐‘ = 1

๐‘˜ , with ๐‘˜ being the number of distinct values of a featureE.g. if a feature has two possible values ๐‘ = 0.5Using this prior we produce ๐‘š fake countsE.g. if a feature has two possible values we can produce two fakecounts:

๐‘ƒ = ๐‘›๐‘ + 1๐‘› + 2

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 38 / 74

Page 39: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Classifying text

Text classification is a standard area of classical information retrievalA query (e.g. โ€œMulticore CPUโ€) against a document collectiondivides the collection into two classesDocuments about โ€œMulticore CPUโ€Documents not about โ€œMulticore CPUโ€A typical two-class classification problem

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 39 / 74

Page 40: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Classifying text

In a general case, a class is a subject, a topic or a categoryApplication examples of text classification include:

1 Automatic detection of spam documents or other types of offendingcontent

2 Sentiment detection (positive or negative product reviews)3 Personal email sorting4 Topical or vertical search

There are several approaches for text classification but again weconcentrate on statistical text classification, in particular on NaiveBayes

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 40 / 74

Page 41: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Classifying text

We are given a description ๐‘‘ โˆˆ ๐• of a document and a fixed set ofclasses โ„‚ = {๐‘1, ๐‘2,โ€ฆ , ๐‘๐‘š}๐• is a document space and is typically some high dimensional spaceThe classes are in most cases human definedWe are given a training set ๐”ป of labeled documents โŸจ๐‘‘, ๐‘โŸฉ โˆˆ ๐• ร— โ„‚For example: โŸจ Beijing joins the World Trade Organization, China โŸฉ

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 41 / 74

Page 42: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

The MAP approach applied to the set of possible classifications of anew document ๐‘‘

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆโ„‚

๐‘ƒ(๐‘๐‘—|๐‘‘)

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆโ„‚

๐‘ƒ(๐‘‘|๐‘๐‘—)๐‘ƒ (๐‘๐‘—)๐‘ƒ (๐‘‘

= argmax๐‘๐‘—โˆˆ๐ถ

๐‘ƒ(๐‘‘|๐‘๐‘—)๐‘ƒ (๐‘๐‘—)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 42 / 74

Page 43: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

We can interpret the last equation as a description of the generativeprocessTo generate a document we first select class ๐‘ with probability ๐‘ƒ(๐‘)In the second step we generate a document given ๐‘, corresponding tothe conditional probability distribution ๐‘ƒ(๐‘‘|๐‘)How does this conditional probability distribution looks like?First we need to make certain simplifying assumptions

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 43 / 74

Page 44: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

Similarly, to the naive Bayes classifier we will base our models on thesimplifying assumption that the attribute values are independent, orconditionally independent given the class

๐‘ƒ(๐‘ก1๐‘ก2๐‘ก3๐‘ก4) = ๐‘ƒ(๐‘ก1)๐‘ƒ (๐‘ก2)๐‘ƒ (๐‘ก3)๐‘ƒ (๐‘ก4)

This is unigram language modelUnder this model the order of words is irrelevant: โ€œbag of wordsโ€A generative model will have the same probability for any โ€œbag ofwordsโ€ regardless their ordering

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 44 / 74

Page 45: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Naive Bayes text classification

Assuming โ€œbag of wordsโ€ model we still need to select a particularconditional distribution over bagsTwo most common approaches are:

1 Bernoulli distribution2 Multinomial distribution

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 45 / 74

Page 46: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Bernoulli generative model

In this model we assume that the documents are generated from amultivariate Bernoulli distributionThe model generates an indicator for each term from the vocabulary1 indicates the presence of the term in a document0 indicates the absence of the term in a documentDifferent generation of the documents leads to different parameterestimation and different classification rules

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 46 / 74

Page 47: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Bernoulli generative model

In particular, for each term in a vocabulary we perform a Bernoullitrial and its result decides if the term is present in a document orabsent

๐‘ƒ(๐‘‘|๐‘) = โˆ๐‘ก๐‘–โˆˆ๐‘‘

๐‘(๐‘ก๐‘–|๐‘) โˆ๐‘ก๐‘–โˆ‰๐‘‘

(1 โˆ’ ๐‘(๐‘ก๐‘–|๐‘))

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 47 / 74

Page 48: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the Bernoulli model

We apply MLE again to get the estimates for ๐‘(๐‘) and ๐‘(๐‘ก๐‘–|๐‘)

๐‘(๐‘) = ๐‘›๐‘๐‘›

๐‘(๐‘ก๐‘–|๐‘) =โˆ‘๐‘–,๐‘— ๐‘’๐‘ก๐‘–,๐‘‘๐‘—

๐‘›๐‘

๐‘› is the total number of documents, ๐‘›๐‘ is the number of documentsin the class ๐‘๐‘’๐‘ก๐‘–,๐‘‘๐‘—

is the indicator of presence of term ๐‘ก๐‘– in document ๐‘‘๐‘— belongingto the class ๐‘

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 48 / 74

Page 49: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the Bernoulli model

As before because of zero counts we need to apply smoothing, e.g.๐‘š-estimate smoothing with ๐‘ = 0.5 and ๐‘š = 2

๐‘(๐‘ก๐‘–|๐‘) =โˆ‘๐‘–,๐‘— ๐‘’๐‘ก๐‘–,๐‘‘๐‘—

+ 1๐‘›๐‘ + 2

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 49 / 74

Page 50: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

MAP in the Bernoulli model

With the estimated parameters and ignoring the irrelevant constantswe obtain

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆโ„‚

๐‘ƒ(๐‘‘|๐‘๐‘—)๐‘ƒ (๐‘๐‘—)

= argmax๐‘๐‘—โˆˆโ„‚

๐‘ƒ(๐‘๐‘—) โˆ๐‘ก๐‘–โˆˆ๐‘‘

๐‘(๐‘ก๐‘–|๐‘) โˆ๐‘ก๐‘–โˆ‰๐‘‘

(1 โˆ’ ๐‘(๐‘ก๐‘–|๐‘)) (1)

Interpretation of ๐‘(๐‘ก๐‘–, ๐‘) and (1 โˆ’ ๐‘(๐‘ก๐‘–|๐‘)): how much evidence thepresence or absence of ๐‘ก๐‘– contributes that ๐‘ is the correct classImplementation with logarithms! Underflow!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 50 / 74

Page 51: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model: Example

Document ClassChinese Beijing Chinese China

Chinese Chinese Shanghai ChinaChinese Macao China

Tokyo Japan Chinese Japan

What is the class of ๐‘‘ = โŸจ Chinese Chinese Chinese Tokyo Japan โŸฉ

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 51 / 74

Page 52: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model: Example

ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan

China 3 1 1 1 0 0Japan 1 0 0 0 1 1

๐‘ƒTerm Chinese Bejing Shanghai Macao Tokyo Japan

๐‘ƒ(๐‘ก|๐ถโ„Ž๐‘–๐‘›๐‘Ž) 4/5 2/5 2/5 2/5 1/5 1/5๐‘(๐‘ก|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›) 2/3 1/3 1/3 1/3 2/3 2/3

Table: Frequencies and conditional probability of terms (๐‘š-estimate smoothing)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 52 / 74

Page 53: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model: Example

The task is to classify the new instance: ๐‘‘ = โŸจ Chinese ChineseChinese Tokyo Japan โŸฉ

๐‘ƒ(๐ถโ„Ž๐‘–๐‘›๐‘Ž)๐‘ƒ(๐ถโ„Ž๐‘–๐‘›๐‘’๐‘ ๐‘’|๐ถโ„Ž๐‘–๐‘›๐‘Ž)๐‘ƒ(๐‘‡๐‘œ๐‘˜๐‘ฆ๐‘œ|๐ถโ„Ž๐‘–๐‘›๐‘Ž)๐‘ƒ(๐ฝ๐‘Ž๐‘๐‘Ž๐‘›|๐ถโ„Ž๐‘–๐‘›๐‘Ž) ร—(1 โˆ’๐‘ƒ(๐ต๐‘’๐‘—๐‘–๐‘›๐‘”|๐ถโ„Ž๐‘–๐‘›๐‘Ž))(1 โˆ’๐‘ƒ(๐‘†โ„Ž๐‘Ž๐‘›๐‘”โ„Ž๐‘Ž๐‘–|๐ถโ„Ž๐‘–๐‘›๐‘Ž))(1 โˆ’๐‘ƒ(๐‘€๐‘Ž๐‘๐‘Ž๐‘œ|๐ถโ„Ž๐‘–๐‘›๐‘Ž)) = 0.00518

๐‘ƒ(๐ฝ๐‘Ž๐‘๐‘Ž๐‘›)๐‘ƒ(๐ถโ„Ž๐‘–๐‘›๐‘’๐‘ ๐‘’|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›)๐‘ƒ(๐‘‡๐‘œ๐‘˜๐‘ฆ๐‘œ|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›)๐‘ƒ(๐ฝ๐‘Ž๐‘๐‘Ž๐‘›|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›) ร—(1 โˆ’๐‘ƒ(๐ต๐‘’๐‘—๐‘–๐‘›๐‘”|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›))(1 โˆ’๐‘ƒ(๐‘†โ„Ž๐‘Ž๐‘›๐‘”โ„Ž๐‘Ž๐‘–|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›))(1 โˆ’๐‘ƒ(๐‘€๐‘Ž๐‘๐‘Ž๐‘œ|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›)) = 0.02194

๐‘๐‘€๐ด๐‘ƒ = ๐ฝ๐‘Ž๐‘๐‘Ž๐‘›โ€œJapanโ€ and โ€œTokyoโ€ are good indicators for โ€œJapanโ€ and since wedo not count occurrences they outweigh โ€œChineseโ€ indicator

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 53 / 74

Page 54: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model

Online edition (c)2009 Cambridge UP

13.3 The Bernoulli model 263

TRAINBERNOULLINB(C, D)1 V โ† EXTRACTVOCABULARY(D)2 N โ† COUNTDOCS(D)3 for each c โˆˆ C

4 do Nc โ† COUNTDOCSINCLASS(D, c)5 prior[c]โ† Nc/N6 for each t โˆˆ V7 do Nct โ† COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]โ† (Nct + 1)/(Nc + 2)9 return V, prior, condprob

APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd โ† EXTRACTTERMSFROMDOC(V, d)2 for each c โˆˆ C

3 do score[c]โ† log prior[c]4 for each t โˆˆ V5 do if t โˆˆ Vd

6 then score[c] += log condprob[t][c]7 else score[c] += log(1โˆ’ condprob[t][c])8 return arg maxcโˆˆC

score[c]

โ—ฎ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.

13.3 The Bernoulli model

There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).

An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL

tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.

The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates Pฬ‚(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 54 / 74

Page 55: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with Bernoulli model

Online edition (c)2009 Cambridge UP

13.3 The Bernoulli model 263

TRAINBERNOULLINB(C, D)1 V โ† EXTRACTVOCABULARY(D)2 N โ† COUNTDOCS(D)3 for each c โˆˆ C

4 do Nc โ† COUNTDOCSINCLASS(D, c)5 prior[c]โ† Nc/N6 for each t โˆˆ V7 do Nct โ† COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]โ† (Nct + 1)/(Nc + 2)9 return V, prior, condprob

APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd โ† EXTRACTTERMSFROMDOC(V, d)2 for each c โˆˆ C

3 do score[c]โ† log prior[c]4 for each t โˆˆ V5 do if t โˆˆ Vd

6 then score[c] += log condprob[t][c]7 else score[c] += log(1โˆ’ condprob[t][c])8 return arg maxcโˆˆC

score[c]

โ—ฎ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.

13.3 The Bernoulli model

There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).

An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL

tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.

The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates Pฬ‚(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 55 / 74

Page 56: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial random variable

The model is based on multinomial distribution, which is ageneralization of the binomial distributionEach of ๐‘› trials leads to a success in one of ๐‘˜ categories, with eachcategory having a fixed success probabilityIt holds โˆ‘๐‘˜

๐‘–=1 ๐‘๐‘– = 1A multinomial r.v. is given by the vector ๐‘‹ = (๐‘‹1, ๐‘‹2,โ€ฆ ,๐‘‹๐‘˜),where ๐‘‹๐‘– is the number of successes in the ๐‘–th categoryIt holds โˆ‘๐‘˜

๐‘–=1 ๐‘‹๐‘– = ๐‘›

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 56 / 74

Page 57: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial random variable

PMF

๐‘(๐‘ฅ1,โ€ฆ , ๐‘ฅ๐‘˜; ๐‘›, ๐‘1,โ€ฆ , ๐‘๐‘˜) = {๐‘›!

๐‘ฅ1!โ€ฆ๐‘ฅ๐‘›!๐‘๐‘ฅ11 โ€ฆ๐‘๐‘ฅ๐‘˜

๐‘˜ if โˆ‘๐‘˜๐‘–=1 ๐‘ฅ๐‘– = ๐‘›

0, otherwise

With โˆ‘๐‘˜๐‘–=1 ๐‘๐‘– = 1 and ๐‘ฅ๐‘– non-negative, โˆ€๐‘–

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 57 / 74

Page 58: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial generative model

In this model we assume that the documents are generated from amultinomial distribution, i.e. the number of occurrences of terms indocument is a multinomial r.v.

๐‘ƒ(๐‘‘|๐‘) = {๐ฟ๐‘‘!

๐‘ก๐‘“๐‘ก1,๐‘‘!โ€ฆ๐‘ก๐‘“๐‘ก๐‘›,๐‘‘!๐‘(๐‘ก1|๐‘)๐‘ก๐‘“๐‘ก1,๐‘‘ โ€ฆ๐‘(๐‘ก๐‘€ |๐ถ)๐‘ก๐‘“๐‘ก๐‘€,๐‘‘ if โˆ‘๐‘€

๐‘–=1 ๐‘ก๐‘“๐‘ก๐‘–,๐‘‘ = ๐ฟ๐‘‘

0, otherwise

๐‘ก๐‘“๐‘ก๐‘–,๐‘‘ is the term frequency of the term ๐‘ก๐‘– in document ๐‘‘๐ฟ๐‘‘ is the length of the document, i.e. ๐ฟ๐‘‘ = โˆ‘๐‘€

๐‘–=1 ๐‘ก๐‘“๐‘ก๐‘–,๐‘‘๐‘€ is the size of the document vocabulary

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 58 / 74

Page 59: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the multinomial model

Likelihood of all documents belonging to a class would be the productof multinomial probabilitiesSimple calculus (setting of derivative of log-likelihood to 0) give asthe maximum likelihood estimate for ๐‘(๐‘) and ๐‘(๐‘ก๐‘–|๐‘)

๐‘(๐‘) = ๐‘›๐‘๐‘›

๐‘(๐‘ก๐‘–|๐‘) =๐‘ก๐‘“๐‘ก๐‘–,๐‘

โˆ‘๐‘€๐‘–=1 ๐‘ก๐‘“๐‘ก๐‘–,๐‘

๐‘› is the total number of documents, ๐‘›๐‘ is the number of documentsin the class ๐‘๐‘ก๐‘“๐‘ก๐‘–,๐‘ is the term frequency of term ๐‘ก๐‘– in all documents belonging tothe class ๐‘

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 59 / 74

Page 60: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Inference of parameters in the multinomial model

As before because of zero counts we need to apply smoothing, e.g.Laplace smoothing

๐‘(๐‘ก๐‘–|๐‘) =๐‘ก๐‘“๐‘ก๐‘–,๐‘ + 1

โˆ‘๐‘€๐‘–=1(๐‘ก๐‘“๐‘ก๐‘–,๐‘ + 1)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 60 / 74

Page 61: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

MAP in the multinomial model

With the estimated parameters and ignoring the irrelevant constantswe obtain

๐‘๐‘€๐ด๐‘ƒ = argmax๐‘๐‘—โˆˆโ„‚

๐‘ƒ(๐‘‘|๐‘๐‘—)๐‘ƒ (๐‘๐‘—)

= argmax๐‘๐‘—โˆˆโ„‚

๐‘ƒ(๐‘๐‘—)๐ฟ๐‘‘

โˆ๐‘–=1

๐‘(๐‘ก๐‘–|๐‘) (2)

Interpretation of ๐‘(๐‘ก๐‘–, ๐‘): how much evidence ๐‘ก๐‘– contributes that ๐‘ isthe correct classImplementation with logarithms! Underflow!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 61 / 74

Page 62: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Example

Document ClassChinese Beijing Chinese China

Chinese Chinese Shanghai ChinaChinese Macao China

Tokyo Japan Chinese Japan

What is the class of ๐‘‘ = โŸจ Chinese Chinese Chinese Tokyo Japan โŸฉ

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 62 / 74

Page 63: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Example

ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan

China 5 1 1 1 0 0Japan 1 0 0 0 1 1

๐‘ƒTerm Chinese Bejing Shanghai Macao Tokyo Japan

๐‘ƒ(๐‘ก|๐ถโ„Ž๐‘–๐‘›๐‘Ž) 6/14 2/14 2/14 2/14 1/14 1/14๐‘(๐‘ก|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›) 2/9 1/9 1/9 1/9 2/9 2/9

Table: Frequencies and conditional probability of terms (Laplace smoothing)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 63 / 74

Page 64: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Example

The task is to classify the new instance: ๐‘‘ = โŸจ Chinese ChineseChinese Tokyo Japan โŸฉ

๐‘ƒ(๐ถโ„Ž๐‘–๐‘›๐‘Ž)๐‘ƒ(๐ถโ„Ž๐‘–๐‘›๐‘’๐‘ ๐‘’|๐ถโ„Ž๐‘–๐‘›๐‘Ž)3๐‘ƒ(๐‘‡๐‘œ๐‘˜๐‘ฆ๐‘œ|๐ถโ„Ž๐‘–๐‘›๐‘Ž)๐‘ƒ(๐ฝ๐‘Ž๐‘๐‘Ž๐‘›|๐ถโ„Ž๐‘–๐‘›๐‘Ž) = 0.0003๐‘ƒ(๐ฝ๐‘Ž๐‘๐‘Ž๐‘›)๐‘ƒ(๐ถโ„Ž๐‘–๐‘›๐‘’๐‘ ๐‘’|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›)3๐‘ƒ(๐‘‡๐‘œ๐‘˜๐‘ฆ๐‘œ|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›)๐‘ƒ(๐ฝ๐‘Ž๐‘๐‘Ž๐‘›|๐ฝ๐‘Ž๐‘๐‘Ž๐‘›) = 0.0001

๐‘๐‘€๐ด๐‘ƒ = ๐ถโ„Ž๐‘–๐‘›๐‘ŽThree occurrences of positive indicator outweigh two occurrences ofthe negative indicator

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 64 / 74

Page 65: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model

Online edition (c)2009 Cambridge UP

260 13 Text classification and Naive Bayes

TRAINMULTINOMIALNB(C, D)1 V โ† EXTRACTVOCABULARY(D)2 N โ† COUNTDOCS(D)3 for each c โˆˆ C

4 do Nc โ† COUNTDOCSINCLASS(D, c)5 prior[c]โ† Nc/N6 textc โ† CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t โˆˆ V8 do Tct โ† COUNTTOKENSOFTERM(textc, t)9 for each t โˆˆ V

10 do condprob[t][c]โ† Tct+1โˆ‘tโ€ฒ (Tctโ€ฒ+1)

11 return V, prior, condprob

APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W โ† EXTRACTTOKENSFROMDOC(V, d)2 for each c โˆˆ C

3 do score[c] โ† log prior[c]4 for each t โˆˆW5 do score[c] += log condprob[t][c]6 return arg maxcโˆˆC

score[c]

โ—ฎ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.

assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be โ€œconditioned away,โ€no matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS

to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.

To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING

one to each count (cf. Section 11.3.2):

Pฬ‚(t|c) =Tct + 1

โˆ‘tโ€ฒโˆˆV(Tctโ€ฒ + 1)=

Tct + 1(โˆ‘tโ€ฒโˆˆV Tctโ€ฒ) + B

,(13.7)

where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 65 / 74

Page 66: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model

Online edition (c)2009 Cambridge UP

260 13 Text classification and Naive Bayes

TRAINMULTINOMIALNB(C, D)1 V โ† EXTRACTVOCABULARY(D)2 N โ† COUNTDOCS(D)3 for each c โˆˆ C

4 do Nc โ† COUNTDOCSINCLASS(D, c)5 prior[c]โ† Nc/N6 textc โ† CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t โˆˆ V8 do Tct โ† COUNTTOKENSOFTERM(textc, t)9 for each t โˆˆ V

10 do condprob[t][c]โ† Tct+1โˆ‘tโ€ฒ (Tctโ€ฒ+1)

11 return V, prior, condprob

APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W โ† EXTRACTTOKENSFROMDOC(V, d)2 for each c โˆˆ C

3 do score[c] โ† log prior[c]4 for each t โˆˆW5 do score[c] += log condprob[t][c]6 return arg maxcโˆˆC

score[c]

โ—ฎ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.

assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be โ€œconditioned away,โ€no matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS

to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.

To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING

one to each count (cf. Section 11.3.2):

Pฬ‚(t|c) =Tct + 1

โˆ‘tโ€ฒโˆˆV(Tctโ€ฒ + 1)=

Tct + 1(โˆ‘tโ€ฒโˆˆV Tctโ€ฒ) + B

,(13.7)

where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 66 / 74

Page 67: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB with multinomial model: Complexity

Estimation of parameters has ๐‘‚(|โ„‚||๐‘‰ |)Product of the number of classes and the size of the vocabularyThis is linear in the size of the vocabularyPreprocessing is also linearApplying is also linear in the size of the documentAll together linear in the time it takes to scan the data: very efficient!

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 67 / 74

Page 68: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial vs. Bernoulli model

The Bernoulli model estimates ๐‘(๐‘ก๐‘–|๐‘) as the fraction of documents ofclass ๐‘ that contains term ๐‘กIn contrast, the multinomial model estimates ๐‘(๐‘ก๐‘–|๐‘) as the termfrequency of term ๐‘ก in a document of a class ๐‘ relative to the termfrequency of term ๐‘ก in all documents of class ๐‘When classifying a document the Bernoulli model uses binaryoccurrence information, and the multinomial model keeps track ofmultiple occurrences

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 68 / 74

Page 69: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

Multinomial vs. Bernoulli model

As a result the Bernoulli model tends to make more errors whenclassifying long documentsE.g. it might assign a complete book to โ€œChinaโ€ class because of asingle occurrence of the term โ€œChineseโ€Non-occurring terms do not affect classification decision in themultinomial model, whereas in the Bernoulli model they doBernoulli works better with fewer featuresComplexity is same and both models work very good and accurate inpractice

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 69 / 74

Page 70: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB models

The independence assumption is oversimplifiedโ€œBag of wordsโ€ model ignores positions of the words and their contextE.g. in โ€œHong Kongโ€ are not independent of each other nor are theirpositionsBernoulli model does not even take into account the term frequenciesThis leads to very bad estimates of true document probabilities

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 70 / 74

Page 71: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB models

However, NB classification decisions are surprisingly goodSuppose the following true document probabilities ๐‘ƒ(๐‘1|๐‘‘) = 0.6 and๐‘ƒ(๐‘2|๐‘‘) = 0.4Assume that ๐‘‘ contains many positive indicators for ๐‘1 and manynegative indicators for ๐‘2Then the NB estimation might be: ฬ‚๐‘ƒ (๐‘1|๐‘‘) = 0.00099 andฬ‚๐‘ƒ (๐‘2|๐‘‘) = 0.00001

This is common: the winning class in NB has usually a much largerprobability than the othersIn classification we care only about the winning class

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 71 / 74

Page 72: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB performance

Online edition (c)2009 Cambridge UP

13.5 Feature selection 275

# # ##

##

##

#

# #

#

###

1 10 100 1000 10000

0.0

0.2

0.4

0.6

0.8

number of features selected

F1 m

easure

oo o oo

o

o

o

o o

oo

ooo

x

x

x x

x

x

xx

xx x

x

x xx

b

b

b

bb b b

b

b

bb b b bb

#ox

b

multinomial, MImultinomial, chisquaremultinomial, frequencybinomial, MI

โ—ฎ Figure 13.8 Effect of feature set size on accuracy for multinomial and Bernoullimodels.

the end when we use all features. The reason is that the multinomial takesthe number of occurrences into account in parameter estimation and clas-sification and therefore better exploits a larger number of features than theBernoulli model. Regardless of the differences between the two methods,using a carefully selected subset of the features results in better effectivenessthan using all features.

13.5.2 ฯ‡2 Feature selection

Another popular feature selection method is ฯ‡2. In statistics, the ฯ‡2 test isฯ‡2 FEATURE SELECTION

applied to test the independence of two events, where two events A and B aredefined to be independent if P(AB) = P(A)P(B) or, equivalently, P(A|B) =INDEPENDENCE

P(A) and P(B|A) = P(B). In feature selection, the two events are occurrenceof the term and occurrence of the class. We then rank terms with respect tothe following quantity:

X2(D, t, c) = โˆ‘etโˆˆ{0,1}

โˆ‘ecโˆˆ{0,1}

(Netec โˆ’ Eetec)2

Eetec

(13.18)

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 72 / 74

Page 73: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB performance

Online edition (c)2009 Cambridge UP

282 13 Text classification and Naive Bayes

โ—ฎ Table 13.8 Macro- and microaveraging. โ€œTruthโ€ is the true class and โ€œcallโ€ thedecision of the classifier. In this example, macroaveraged precision is [10/(10 + 10) +90/(10 + 90)]/2 = (0.5 + 0.9)/2 = 0.7. Microaveraged precision is 100/(100 + 20) โ‰ˆ0.83.

class 1truth: truth:yes no

call:yes 10 10

call:no

10 970

class 2truth: truth:yes no

call:yes 90 10

call:no

10 890

pooled tabletruth: truth:yes no

call:yes 100 20

call:no

20 1860

โ—ฎ Table 13.9 Text classification effectiveness numbers on Reuters-21578 for F1 (inpercent). Results from Li and Yang (2003) (a), Joachims (1998) (b: kNN) and Dumaiset al. (1998) (b: NB, Rocchio, trees, SVM).

(a) NB Rocchio kNN SVMmicro-avg-L (90 classes) 80 85 86 89macro-avg (90 classes) 47 59 60 60

(b) NB Rocchio kNN trees SVMearn 96 93 97 98 98acq 88 65 92 90 94money-fx 57 47 78 66 75grain 79 68 82 85 95crude 80 70 86 85 89trade 64 65 77 73 76interest 65 63 74 67 78ship 85 49 79 74 86wheat 70 69 77 93 92corn 65 48 78 92 90micro-avg (top 10) 82 65 82 88 92micro-avg-D (118 classes) 75 62 n/a n/a 87

Rocchio and kNN. In addition, we give numbers for decision trees, an impor-DECISION TREES

tant classification method we do not cover. The bottom part of the tableshows that there is considerable variation from class to class. For instance,NB beats kNN on ship, but is much worse on money-fx.

Comparing parts (a) and (b) of the table, one is struck by the degree towhich the cited papersโ€™ results differ. This is partly due to the fact that thenumbers in (b) are break-even scores (cf. page 161) averaged over 118 classes,whereas the numbers in (a) are true F1 scores (computed without any know-

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 73 / 74

Page 74: Knowledge Discovery and Data Mining 1 (VO) (706.701)kti.tugraz.at/staff/denis/courses/kddm1/classification.pdfBig picture: KDDM Knowledge Discovery Process Preprocessing Linear Algebra

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Naive Bayes for Text Classification

NB performance

SVM seems to be better but it costs more to trainkNN is better for some classesIn theory: SVM better than kNN better than NBWhen data is nice!In practice with errors in data, data drifting, etc. very difficult to beatNB

Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 74 / 74