knowledge discovery and data mining 1 (vo)...
TRANSCRIPT
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Knowledge Discovery and Data Mining 1 (VO)(706.701)Classification
Denis Helic
ISDS, TU Graz
January 10, 2019
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 1 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Big picture: KDDM
Knowledge Discovery Process
Preprocessing
Linear Algebra
Mathematical Tools
Information Theory Statistical Inference
Probability Theory Hardware & ProgrammingModel
Infrastructure
Transformation
Data Mining
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 2 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Outline
1 Introduction
2 Bayes Rule for Classification
3 Naive Bayes Classifier
4 Naive Bayes for Text Classification
SlidesSlides are partially based on โMachine Learningโ course from University ofWashington by Pedro Domingos, โMachine Learningโ book by TomMitchell, and โIntroduction to Information Retrievalโ by Manning,Raghavan and Schรผtze
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 3 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Classification as a Machine Learning task
Classification problems: classify an example into given set ofcategoriesTypically we have a labeled set of training examplesWe apply an algorithm to learn from the training set and devise aclassification ruleWe apply the rule to classify new examples
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 4 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Types of learning
Supervised (inductive) learning: training data includes the desiredoutputsUnsupervised learning: training data does not include the desiredoutputsSemi-supervised learning: training data includes a few desired outputsReinforcement learning: rewards from sequence of actions
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 5 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Supervised learning
Given examples of a function (x, ๐(x))Predict function ๐(x) for new examples xDepending on ๐(x) we have:
1 Classification if ๐(x) is a discrete function2 Regression if ๐(x) is a continuous function3 Probability estimation if ๐(x) is a probability distribution
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 6 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Classification
Given: Training examples (x, ๐(x)) for some unknown discretefunction ๐Find: A good approximation for ๐x is data, e.g. a vector, text documents, emails, words in text, etc.
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 7 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Classification: example applications
Credit risk assessment:x: properties of customers and proposed purchase๐(x): approve purchase or not
Disease diagnosis:x: properties of patients, lab tests๐(x): disease
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 8 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Classification: example applications
Email spam filter:x: email text (or features extracted from emails)๐(x): spam or not spam
Text categorization:x: documents or words (features) from a document๐(x): thematic category
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 9 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Terminology
Training example: an example of the form (x, ๐(x))Target function: the true function ๐Hypothesis: a proposed function โ believed to be similar to ๐Classifier: a discrete-valued function. The possible values๐(x) โ 1, 2,โฆ ,๐พ are called classes or labelsHypothesis space: a space of all possible hypothesis
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 10 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Introduction
Classification approaches
Logistic regressionDecision treesSupport vector machinesNearest neighbor algorithmsNeural networksStatistical inference such as Bayesian learning (e.g. textcategorization)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 11 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
Application of Bayes Rule for classification
We are given some hypothesis space ๐ป and the observed data ๐ท, andsome prior probabilities of various hypothesesWe are interested in determining the best hypothesis from ๐ปThe best means the most probable given the data plus the priorIn other words we are interested in posterior probabilities of varioushypotheses given the data and their prior probabilities
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 12 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
Application of Bayes Rule for classification
Bayes TheoremFor some given joint probability distribution ๐(๐ท, โ) we have the priorprobability of โ given as ๐(โ), the probability of observing the data ๐(๐ท)regardless of which hypothesis holds, and the conditional probability ofdata given a hypothesis ๐(๐ท|โ). The posterior probability of a hypothesisโ is given by:
๐(โ|๐ท) = ๐(๐ท|โ)๐(โ)๐(๐ท)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 13 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
Application of Bayes Rule for classification
๐(โ|๐ท) increases with ๐(โ) and with ๐(๐ท|โ)Also, ๐(โ|๐ท) decreases as ๐(๐ท) increases because the more probableit is that ๐ท will be observed independent of โ, the less evidence ๐ทprovides in support of โIn many learning scenarios the learner considers some set of candidatehypotheses ๐ป and selects the most probable hypothesis from this setSuch maximally probably hypotheses is called maximum a posteriori(MAP) hypothesis
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 14 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
Maximum a posteriori (MAP) hypothesis
MAP
โ๐๐ด๐ โก argmaxโโ๐ป
๐(โ|๐ท)
= argmaxโโ๐ป
๐(๐ท|โ)๐(โ)๐(๐ท)
= argmaxโโ๐ป
๐(๐ท|โ)๐(โ)
In the last step we dropped ๐(๐ท) because it is a constantindependent of โ
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 15 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
Maximum likelihood (ML) hypothesis
In some cases we can assume that every hypothesis is equallyprobable a prioriThe MAP equation simplifies to:
ML
โ๐๐ฟ = argmaxโโ๐ป
๐(๐ท|โ)
๐(๐ท|โ) is the likelihood of the data given a hypothesisThis is then maximum likelihood hypothesis
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 16 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
MAP: example
Medical diagnosisConsider a medical diagnosis problem in which there are two alternativehypotheses:
1 The patient has a particular form of cancer2 The patient does not have cancer.
The available data is from a particular laboratory test with two possibleoutcomes: P (positive) and N (negative). We have prior knowledge thatover the entire population only .008 people have this disease. Furthermore,the test returns a correct positive result in only 98% of the cases in whichthe disease is actually present and a correct negative test in only 97% ofthe cases in which the disease is not present. What is โ๐๐ด๐?
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 17 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
MAP: example
๐(๐๐๐๐๐๐) = 0.008๐(๐๐๐๐๐๐๐) = 0.992๐(๐ |๐๐๐๐๐๐) = 0.98๐(๐|๐๐๐๐๐๐) = 0.02๐(๐ |๐๐๐๐๐๐๐) = 0.03๐(๐|๐๐๐๐๐๐๐) = 0.97โ๐๐ด๐ = ?
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 18 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Bayes Rule for Classification
MAP: example
If the test is positive
๐(๐|๐๐๐๐๐๐)๐(๐๐๐๐๐๐) = 0.00784๐(๐|๐๐๐๐๐๐๐)๐(๐๐๐๐๐๐๐) = 0.02976
โ๐๐ด๐ = ๐๐๐๐๐๐๐
If the test is negative
๐(๐|๐๐๐๐๐๐)๐(๐๐๐๐๐๐) = 0.00016๐(๐|๐๐๐๐๐๐๐)๐(๐๐๐๐๐๐๐) = 0.96224
โ๐๐ด๐ = ๐๐๐๐๐๐๐
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 19 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Bayes classifier
We considered the question: โWhat is the most probable hypothesisgiven the training data?โWe can also ask: โWhat is the most probable classification of the newinstance given the training data?โIn other words we can apply Bayes theorem for the classificationproblemThe most well-known example of such an approach is the Naive BayesClassifier
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 20 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier
The Naive Bayes approach is based on the MAP approach applied tothe set of possible classifications of new data instancesGiven a data instance as a tuple of values x = โจ๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐โฉ and afinite set of classes ๐ถ ๐๐๐ด๐ is given by:
๐๐๐ด๐ = argmax๐๐โ๐ถ
๐(๐๐|๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐)
๐๐๐ด๐ = argmax๐๐โ๐ถ
๐(๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐|๐๐)๐ (๐๐)๐ (๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐
= argmax๐๐โ๐ถ
๐(๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐|๐๐)๐ (๐๐)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 21 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier
We should estimate the two terms from the previous equation basedon the training data๐(๐๐) is easy: we just count the frequency with which each classoccurs in the training dataEstimating ๐(๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐|๐๐) is difficult unless we have very verylarge datasetThe number of those terms is equal to the number of possibleinstances times the number of classesWe need to see every instance in the instance space to obtain reliableestimates
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 22 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier
Using the chain rule we can rewrite ๐(๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐|๐๐) as:
๐(๐ฅ1,โฆ , ๐ฅ๐|๐๐) = ๐(๐ฅ1|๐๐)๐(๐ฅ2|๐ฅ1, ๐๐)๐(๐ฅ3|๐ฅ1, ๐ฅ2, ๐๐)โฆ๐(๐ฅ๐|๐ฅ1,โฆ , ๐ฅ๐โ1, ๐๐)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 23 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier
The naive Bayes classifier is based on the simplifying assumption thatthe attribute values are conditionally independent given the class
๐(๐ฅ, ๐ฆ|๐) = ๐(๐ฅ|๐)๐ (๐ฆ|๐)
๐ (๐ฅ|๐ฆ, ๐) = ๐(๐ฅ|๐), โ๐ฅ, ๐ฆ, ๐
๐ (๐ฅ1,โฆ , ๐ฅ๐|๐๐) = ๐(๐ฅ1|๐๐)๐ (๐ฅ2|๐๐)๐ (๐ฅ3|๐๐)โฆ๐(๐ฅ๐|๐๐)= โ
๐๐(๐ฅ๐|๐๐)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 24 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier
The ๐๐๐ด๐ is then given by:
๐๐๐ด๐ = argmax๐๐โ๐ถ
๐(๐๐)โ๐
๐(๐ฅ๐|๐๐)
Estimating ๐(๐ฅ๐|๐๐) from the training data is much easier thanestimation of ๐(๐ฅ1, ๐ฅ2,โฆ , ๐ฅ๐|๐๐)The number of parameters to estimate is the number of distinctattribute values times the number of distinct classesEstimation of a single parameter is typically easy: e.g. based onfrequency of their occurrence in the training data
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 25 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: Example
PlayTennis problem from the book โMachine Learningโ by TimMitchellCan we play tennis on a given day with given attributesIt is a classification problem with two classes: โyesโ and โnoโThe classification is based on the values of attributes such astemperature, humidity, etc.We have 14 training examples
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 26 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: example
Day Outlook Temperature Humidity Wind PlayTennisD1 sunny hot high weak noD2 sunny hot high strong noD3 overcast hot high weak yesD4 rain mild high weak yesD5 rain cool normal weak yesD6 rain cool normal strong noD7 overcast cool normal strong yesD8 sunny mild high weak noD9 sunny cool normal weak yes
D10 rain mild normal weak yesD11 sunny mild normal strong yesD12 overcast mild high strong yesD13 overcast hot normal weak yesD14 rain mild high strong no
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 27 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: Example
The task is to classify the new instance: โจsunny, cool, high, strongโฉThe class is then given by:
๐๐๐ด๐ = argmax๐๐โ๐ถ
๐(๐๐)โ๐
๐(๐ฅ๐|๐๐)
= argmax๐๐โ๐ถ
๐(๐๐)๐ (๐ ๐ข๐๐๐ฆ|๐๐)๐ (๐๐๐๐|๐๐)๐ (โ๐๐โ|๐๐)๐ (๐ ๐ก๐๐๐๐|๐๐)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 28 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: example
PlayTennisOutlook sunny overcast rain ๐
yes 2 4 3 9no 3 0 2 5
๐Outlook sunny overcast rain
๐(๐๐ข๐ก๐๐๐๐|๐ฆ๐๐ ) 2/9 4/9 3/9๐(๐๐ข๐ก๐๐๐๐|๐๐) 3/5 0 2/5
Table: Frequencies and conditional probability of โOutlookโ feature
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 29 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: example
PlayTennisTemperature hot mild cool ๐
yes 2 4 3 9no 2 2 1 5
๐Temperature hot mild cool
๐(๐๐๐๐๐๐๐๐ก๐ข๐๐|๐ฆ๐๐ ) 2/9 4/9 3/9๐(๐๐๐๐๐๐๐๐ก๐ข๐๐|๐๐) 2/5 2/5 1/5
Table: Frequencies and conditional probability of โTemperatureโ feature
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 30 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: example
PlayTennisHumidity high normal ๐
yes 3 6 9no 4 1 5
๐Humidity high normal
๐(๐ป๐ข๐๐๐๐๐ก๐ฆ|๐ฆ๐๐ ) 3/9 6/9๐(๐ป๐ข๐๐๐๐๐ก๐ฆ|๐๐) 4/5 1/5
Table: Frequencies and conditional probability of โHumidityโ feature
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 31 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: example
PlayTennisWind weak strong ๐
yes 6 3 9no 2 3 5
๐Wind weak strong
๐(๐๐๐๐|๐ฆ๐๐ ) 6/9 3/9๐(๐๐๐๐|๐๐) 2/5 3/5
Table: Frequencies and conditional probability of โWindโ feature
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 32 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: Example
We calculate unnormalized a posteriori probabilities for the classes:
๐(๐ฆ๐๐ )๐ (๐ ๐ข๐๐๐ฆ|๐ฆ๐๐ )๐ (๐๐๐๐|๐ฆ๐๐ )๐ (โ๐๐โ|๐ฆ๐๐ )๐ (๐ ๐ก๐๐๐๐|๐ฆ๐๐ ) = 0.00529๐(๐๐)๐(๐ ๐ข๐๐๐ฆ|๐๐)๐ (๐๐๐๐|๐๐)๐ (โ๐๐โ|๐๐)๐ (๐ ๐ก๐๐๐๐|๐๐) = 0.02057
๐๐๐ด๐ = ๐๐
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 33 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: Example
RemarksThese are not the probabilitiesTo obtain conditional probabilities we need to normalize with the sumover classes
0.0205070.00529+0.02057 = 0.7949In a real example the numbers get very small: danger of underflowUse logarithms!
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 34 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Naive Bayes Classifier: Example
The task is to classify the new instance: โจovercast, cool, high, strongโฉ
๐(๐ฆ๐๐ )๐ (๐๐ฃ๐๐๐๐๐ ๐ก|๐ฆ๐๐ )๐ (๐๐๐๐|๐ฆ๐๐ )๐ (โ๐๐โ|๐ฆ๐๐ )๐ (๐ ๐ก๐๐๐๐|๐ฆ๐๐ ) = 0.01058๐(๐๐)๐(๐๐ฃ๐๐๐๐๐ ๐ก|๐๐)๐ (๐๐๐๐|๐๐)๐ (โ๐๐โ|๐๐)๐ (๐ ๐ก๐๐๐๐|๐๐) = 0
๐(๐๐ฃ๐๐๐๐๐ ๐ก|๐๐) = 0๐๐๐ด๐ = ๐ฆ๐๐ for any new instance with overcast as Outlook
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 35 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Estimating probabilities
We have estimated probabilities by the fraction of times the event isobserved to occur over the total number of opportunities๐ = ๐๐
๐The first difficulty: ๐๐
๐ is a biased underestimate of the real probabilitySecond, when the estimate is 0 a posteriori probability for that class isalways 0Smoothing! (as discussed in feature extraction lecture)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 36 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Estimating probabilities
We introduce fake countsOne possibility: ๐-estimate of probability
๐ = ๐๐ +๐๐๐ +๐
๐ is prior estimate of the probability we wish to determine๐ is a constant called equivalent sample size
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 37 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes Classifier
Estimating probabilities
In absence of any other information we will assume uniform priors๐ = 1
๐ , with ๐ being the number of distinct values of a featureE.g. if a feature has two possible values ๐ = 0.5Using this prior we produce ๐ fake countsE.g. if a feature has two possible values we can produce two fakecounts:
๐ = ๐๐ + 1๐ + 2
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 38 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Classifying text
Text classification is a standard area of classical information retrievalA query (e.g. โMulticore CPUโ) against a document collectiondivides the collection into two classesDocuments about โMulticore CPUโDocuments not about โMulticore CPUโA typical two-class classification problem
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 39 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Classifying text
In a general case, a class is a subject, a topic or a categoryApplication examples of text classification include:
1 Automatic detection of spam documents or other types of offendingcontent
2 Sentiment detection (positive or negative product reviews)3 Personal email sorting4 Topical or vertical search
There are several approaches for text classification but again weconcentrate on statistical text classification, in particular on NaiveBayes
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 40 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Classifying text
We are given a description ๐ โ ๐ of a document and a fixed set ofclasses โ = {๐1, ๐2,โฆ , ๐๐}๐ is a document space and is typically some high dimensional spaceThe classes are in most cases human definedWe are given a training set ๐ป of labeled documents โจ๐, ๐โฉ โ ๐ ร โFor example: โจ Beijing joins the World Trade Organization, China โฉ
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 41 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Naive Bayes text classification
The MAP approach applied to the set of possible classifications of anew document ๐
๐๐๐ด๐ = argmax๐๐โโ
๐(๐๐|๐)
๐๐๐ด๐ = argmax๐๐โโ
๐(๐|๐๐)๐ (๐๐)๐ (๐
= argmax๐๐โ๐ถ
๐(๐|๐๐)๐ (๐๐)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 42 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Naive Bayes text classification
We can interpret the last equation as a description of the generativeprocessTo generate a document we first select class ๐ with probability ๐(๐)In the second step we generate a document given ๐, corresponding tothe conditional probability distribution ๐(๐|๐)How does this conditional probability distribution looks like?First we need to make certain simplifying assumptions
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 43 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Naive Bayes text classification
Similarly, to the naive Bayes classifier we will base our models on thesimplifying assumption that the attribute values are independent, orconditionally independent given the class
๐(๐ก1๐ก2๐ก3๐ก4) = ๐(๐ก1)๐ (๐ก2)๐ (๐ก3)๐ (๐ก4)
This is unigram language modelUnder this model the order of words is irrelevant: โbag of wordsโA generative model will have the same probability for any โbag ofwordsโ regardless their ordering
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 44 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Naive Bayes text classification
Assuming โbag of wordsโ model we still need to select a particularconditional distribution over bagsTwo most common approaches are:
1 Bernoulli distribution2 Multinomial distribution
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 45 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Bernoulli generative model
In this model we assume that the documents are generated from amultivariate Bernoulli distributionThe model generates an indicator for each term from the vocabulary1 indicates the presence of the term in a document0 indicates the absence of the term in a documentDifferent generation of the documents leads to different parameterestimation and different classification rules
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 46 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Bernoulli generative model
In particular, for each term in a vocabulary we perform a Bernoullitrial and its result decides if the term is present in a document orabsent
๐(๐|๐) = โ๐ก๐โ๐
๐(๐ก๐|๐) โ๐ก๐โ๐
(1 โ ๐(๐ก๐|๐))
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 47 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Inference of parameters in the Bernoulli model
We apply MLE again to get the estimates for ๐(๐) and ๐(๐ก๐|๐)
๐(๐) = ๐๐๐
๐(๐ก๐|๐) =โ๐,๐ ๐๐ก๐,๐๐
๐๐
๐ is the total number of documents, ๐๐ is the number of documentsin the class ๐๐๐ก๐,๐๐
is the indicator of presence of term ๐ก๐ in document ๐๐ belongingto the class ๐
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 48 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Inference of parameters in the Bernoulli model
As before because of zero counts we need to apply smoothing, e.g.๐-estimate smoothing with ๐ = 0.5 and ๐ = 2
๐(๐ก๐|๐) =โ๐,๐ ๐๐ก๐,๐๐
+ 1๐๐ + 2
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 49 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
MAP in the Bernoulli model
With the estimated parameters and ignoring the irrelevant constantswe obtain
๐๐๐ด๐ = argmax๐๐โโ
๐(๐|๐๐)๐ (๐๐)
= argmax๐๐โโ
๐(๐๐) โ๐ก๐โ๐
๐(๐ก๐|๐) โ๐ก๐โ๐
(1 โ ๐(๐ก๐|๐)) (1)
Interpretation of ๐(๐ก๐, ๐) and (1 โ ๐(๐ก๐|๐)): how much evidence thepresence or absence of ๐ก๐ contributes that ๐ is the correct classImplementation with logarithms! Underflow!
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 50 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with Bernoulli model: Example
Document ClassChinese Beijing Chinese China
Chinese Chinese Shanghai ChinaChinese Macao China
Tokyo Japan Chinese Japan
What is the class of ๐ = โจ Chinese Chinese Chinese Tokyo Japan โฉ
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 51 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with Bernoulli model: Example
ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan
China 3 1 1 1 0 0Japan 1 0 0 0 1 1
๐Term Chinese Bejing Shanghai Macao Tokyo Japan
๐(๐ก|๐ถโ๐๐๐) 4/5 2/5 2/5 2/5 1/5 1/5๐(๐ก|๐ฝ๐๐๐๐) 2/3 1/3 1/3 1/3 2/3 2/3
Table: Frequencies and conditional probability of terms (๐-estimate smoothing)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 52 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with Bernoulli model: Example
The task is to classify the new instance: ๐ = โจ Chinese ChineseChinese Tokyo Japan โฉ
๐(๐ถโ๐๐๐)๐(๐ถโ๐๐๐๐ ๐|๐ถโ๐๐๐)๐(๐๐๐๐ฆ๐|๐ถโ๐๐๐)๐(๐ฝ๐๐๐๐|๐ถโ๐๐๐) ร(1 โ๐(๐ต๐๐๐๐๐|๐ถโ๐๐๐))(1 โ๐(๐โ๐๐๐โ๐๐|๐ถโ๐๐๐))(1 โ๐(๐๐๐๐๐|๐ถโ๐๐๐)) = 0.00518
๐(๐ฝ๐๐๐๐)๐(๐ถโ๐๐๐๐ ๐|๐ฝ๐๐๐๐)๐(๐๐๐๐ฆ๐|๐ฝ๐๐๐๐)๐(๐ฝ๐๐๐๐|๐ฝ๐๐๐๐) ร(1 โ๐(๐ต๐๐๐๐๐|๐ฝ๐๐๐๐))(1 โ๐(๐โ๐๐๐โ๐๐|๐ฝ๐๐๐๐))(1 โ๐(๐๐๐๐๐|๐ฝ๐๐๐๐)) = 0.02194
๐๐๐ด๐ = ๐ฝ๐๐๐๐โJapanโ and โTokyoโ are good indicators for โJapanโ and since wedo not count occurrences they outweigh โChineseโ indicator
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 53 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with Bernoulli model
Online edition (c)2009 Cambridge UP
13.3 The Bernoulli model 263
TRAINBERNOULLINB(C, D)1 V โ EXTRACTVOCABULARY(D)2 N โ COUNTDOCS(D)3 for each c โ C
4 do Nc โ COUNTDOCSINCLASS(D, c)5 prior[c]โ Nc/N6 for each t โ V7 do Nct โ COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]โ (Nct + 1)/(Nc + 2)9 return V, prior, condprob
APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd โ EXTRACTTERMSFROMDOC(V, d)2 for each c โ C
3 do score[c]โ log prior[c]4 for each t โ V5 do if t โ Vd
6 then score[c] += log condprob[t][c]7 else score[c] += log(1โ condprob[t][c])8 return arg maxcโC
score[c]
โฎ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.
13.3 The Bernoulli model
There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).
An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL
tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.
The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates Pฬ(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 54 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with Bernoulli model
Online edition (c)2009 Cambridge UP
13.3 The Bernoulli model 263
TRAINBERNOULLINB(C, D)1 V โ EXTRACTVOCABULARY(D)2 N โ COUNTDOCS(D)3 for each c โ C
4 do Nc โ COUNTDOCSINCLASS(D, c)5 prior[c]โ Nc/N6 for each t โ V7 do Nct โ COUNTDOCSINCLASSCONTAININGTERM(D, c, t)8 condprob[t][c]โ (Nct + 1)/(Nc + 2)9 return V, prior, condprob
APPLYBERNOULLINB(C, V, prior, condprob, d)1 Vd โ EXTRACTTERMSFROMDOC(V, d)2 for each c โ C
3 do score[c]โ log prior[c]4 for each t โ V5 do if t โ Vd
6 then score[c] += log condprob[t][c]7 else score[c] += log(1โ condprob[t][c])8 return arg maxcโC
score[c]
โฎ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-onesmoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.
13.3 The Bernoulli model
There are two different ways we can set up an NB classifier. The model we in-troduced in the previous section is the multinomial model. It generates oneterm from the vocabulary in each position of the document, where we as-sume a generative model that will be discussed in more detail in Section 13.4(see also page 237).
An alternative to the multinomial model is the multivariate Bernoulli modelor Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL
tion 11.3 (page 222), which generates an indicator for each term of the vo-cabulary, either 1 indicating presence of the term in the document or 0 indi-cating absence. Figure 13.3 presents training and testing algorithms for theBernoulli model. The Bernoulli model has the same time complexity as themultinomial model.
The different generation models imply different estimation strategies anddifferent classification rules. The Bernoulli model estimates Pฬ(t|c) as the frac-tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI-
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 55 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Multinomial random variable
The model is based on multinomial distribution, which is ageneralization of the binomial distributionEach of ๐ trials leads to a success in one of ๐ categories, with eachcategory having a fixed success probabilityIt holds โ๐
๐=1 ๐๐ = 1A multinomial r.v. is given by the vector ๐ = (๐1, ๐2,โฆ ,๐๐),where ๐๐ is the number of successes in the ๐th categoryIt holds โ๐
๐=1 ๐๐ = ๐
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 56 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Multinomial random variable
PMF
๐(๐ฅ1,โฆ , ๐ฅ๐; ๐, ๐1,โฆ , ๐๐) = {๐!
๐ฅ1!โฆ๐ฅ๐!๐๐ฅ11 โฆ๐๐ฅ๐
๐ if โ๐๐=1 ๐ฅ๐ = ๐
0, otherwise
With โ๐๐=1 ๐๐ = 1 and ๐ฅ๐ non-negative, โ๐
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 57 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Multinomial generative model
In this model we assume that the documents are generated from amultinomial distribution, i.e. the number of occurrences of terms indocument is a multinomial r.v.
๐(๐|๐) = {๐ฟ๐!
๐ก๐๐ก1,๐!โฆ๐ก๐๐ก๐,๐!๐(๐ก1|๐)๐ก๐๐ก1,๐ โฆ๐(๐ก๐ |๐ถ)๐ก๐๐ก๐,๐ if โ๐
๐=1 ๐ก๐๐ก๐,๐ = ๐ฟ๐
0, otherwise
๐ก๐๐ก๐,๐ is the term frequency of the term ๐ก๐ in document ๐๐ฟ๐ is the length of the document, i.e. ๐ฟ๐ = โ๐
๐=1 ๐ก๐๐ก๐,๐๐ is the size of the document vocabulary
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 58 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Inference of parameters in the multinomial model
Likelihood of all documents belonging to a class would be the productof multinomial probabilitiesSimple calculus (setting of derivative of log-likelihood to 0) give asthe maximum likelihood estimate for ๐(๐) and ๐(๐ก๐|๐)
๐(๐) = ๐๐๐
๐(๐ก๐|๐) =๐ก๐๐ก๐,๐
โ๐๐=1 ๐ก๐๐ก๐,๐
๐ is the total number of documents, ๐๐ is the number of documentsin the class ๐๐ก๐๐ก๐,๐ is the term frequency of term ๐ก๐ in all documents belonging tothe class ๐
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 59 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Inference of parameters in the multinomial model
As before because of zero counts we need to apply smoothing, e.g.Laplace smoothing
๐(๐ก๐|๐) =๐ก๐๐ก๐,๐ + 1
โ๐๐=1(๐ก๐๐ก๐,๐ + 1)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 60 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
MAP in the multinomial model
With the estimated parameters and ignoring the irrelevant constantswe obtain
๐๐๐ด๐ = argmax๐๐โโ
๐(๐|๐๐)๐ (๐๐)
= argmax๐๐โโ
๐(๐๐)๐ฟ๐
โ๐=1
๐(๐ก๐|๐) (2)
Interpretation of ๐(๐ก๐, ๐): how much evidence ๐ก๐ contributes that ๐ isthe correct classImplementation with logarithms! Underflow!
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 61 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with multinomial model: Example
Document ClassChinese Beijing Chinese China
Chinese Chinese Shanghai ChinaChinese Macao China
Tokyo Japan Chinese Japan
What is the class of ๐ = โจ Chinese Chinese Chinese Tokyo Japan โฉ
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 62 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with multinomial model: Example
ClassTerm Chinese Bejing Shanghai Macao Tokyo Japan
China 5 1 1 1 0 0Japan 1 0 0 0 1 1
๐Term Chinese Bejing Shanghai Macao Tokyo Japan
๐(๐ก|๐ถโ๐๐๐) 6/14 2/14 2/14 2/14 1/14 1/14๐(๐ก|๐ฝ๐๐๐๐) 2/9 1/9 1/9 1/9 2/9 2/9
Table: Frequencies and conditional probability of terms (Laplace smoothing)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 63 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with multinomial model: Example
The task is to classify the new instance: ๐ = โจ Chinese ChineseChinese Tokyo Japan โฉ
๐(๐ถโ๐๐๐)๐(๐ถโ๐๐๐๐ ๐|๐ถโ๐๐๐)3๐(๐๐๐๐ฆ๐|๐ถโ๐๐๐)๐(๐ฝ๐๐๐๐|๐ถโ๐๐๐) = 0.0003๐(๐ฝ๐๐๐๐)๐(๐ถโ๐๐๐๐ ๐|๐ฝ๐๐๐๐)3๐(๐๐๐๐ฆ๐|๐ฝ๐๐๐๐)๐(๐ฝ๐๐๐๐|๐ฝ๐๐๐๐) = 0.0001
๐๐๐ด๐ = ๐ถโ๐๐๐Three occurrences of positive indicator outweigh two occurrences ofthe negative indicator
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 64 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with multinomial model
Online edition (c)2009 Cambridge UP
260 13 Text classification and Naive Bayes
TRAINMULTINOMIALNB(C, D)1 V โ EXTRACTVOCABULARY(D)2 N โ COUNTDOCS(D)3 for each c โ C
4 do Nc โ COUNTDOCSINCLASS(D, c)5 prior[c]โ Nc/N6 textc โ CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t โ V8 do Tct โ COUNTTOKENSOFTERM(textc, t)9 for each t โ V
10 do condprob[t][c]โ Tct+1โtโฒ (Tctโฒ+1)
11 return V, prior, condprob
APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W โ EXTRACTTOKENSFROMDOC(V, d)2 for each c โ C
3 do score[c] โ log prior[c]4 for each t โW5 do score[c] += log condprob[t][c]6 return arg maxcโC
score[c]
โฎ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.
assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be โconditioned away,โno matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS
to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.
To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING
one to each count (cf. Section 11.3.2):
Pฬ(t|c) =Tct + 1
โtโฒโV(Tctโฒ + 1)=
Tct + 1(โtโฒโV Tctโฒ) + B
,(13.7)
where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 65 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with multinomial model
Online edition (c)2009 Cambridge UP
260 13 Text classification and Naive Bayes
TRAINMULTINOMIALNB(C, D)1 V โ EXTRACTVOCABULARY(D)2 N โ COUNTDOCS(D)3 for each c โ C
4 do Nc โ COUNTDOCSINCLASS(D, c)5 prior[c]โ Nc/N6 textc โ CONCATENATETEXTOFALLDOCSINCLASS(D, c)7 for each t โ V8 do Tct โ COUNTTOKENSOFTERM(textc, t)9 for each t โ V
10 do condprob[t][c]โ Tct+1โtโฒ (Tctโฒ+1)
11 return V, prior, condprob
APPLYMULTINOMIALNB(C, V, prior, condprob, d)1 W โ EXTRACTTOKENSFROMDOC(V, d)2 for each c โ C
3 do score[c] โ log prior[c]4 for each t โW5 do score[c] += log condprob[t][c]6 return arg maxcโC
score[c]
โฎ Figure 13.2 Naive Bayes algorithm (multinomial model): Training and testing.
assign a high probability to the UK class because the term Britain occurs. Theproblem is that the zero probability for WTO cannot be โconditioned away,โno matter how strong the evidence for the class UK from other features. Theestimate is 0 because of sparseness: The training data are never large enoughSPARSENESS
to represent the frequency of rare events adequately, for example, the fre-quency of WTO occurring in UK documents.
To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING
one to each count (cf. Section 11.3.2):
Pฬ(t|c) =Tct + 1
โtโฒโV(Tctโฒ + 1)=
Tct + 1(โtโฒโV Tctโฒ) + B
,(13.7)
where B = |V| is the number of terms in the vocabulary. Add-one smoothingcan be interpreted as a uniform prior (each term occurs once for each class)that is then updated as evidence from the training data comes in. Note thatthis is a prior probability for the occurrence of a term as opposed to the priorprobability of a class which we estimate in Equation (13.5) on the documentlevel.
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 66 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB with multinomial model: Complexity
Estimation of parameters has ๐(|โ||๐ |)Product of the number of classes and the size of the vocabularyThis is linear in the size of the vocabularyPreprocessing is also linearApplying is also linear in the size of the documentAll together linear in the time it takes to scan the data: very efficient!
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 67 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Multinomial vs. Bernoulli model
The Bernoulli model estimates ๐(๐ก๐|๐) as the fraction of documents ofclass ๐ that contains term ๐กIn contrast, the multinomial model estimates ๐(๐ก๐|๐) as the termfrequency of term ๐ก in a document of a class ๐ relative to the termfrequency of term ๐ก in all documents of class ๐When classifying a document the Bernoulli model uses binaryoccurrence information, and the multinomial model keeps track ofmultiple occurrences
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 68 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
Multinomial vs. Bernoulli model
As a result the Bernoulli model tends to make more errors whenclassifying long documentsE.g. it might assign a complete book to โChinaโ class because of asingle occurrence of the term โChineseโNon-occurring terms do not affect classification decision in themultinomial model, whereas in the Bernoulli model they doBernoulli works better with fewer featuresComplexity is same and both models work very good and accurate inpractice
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 69 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB models
The independence assumption is oversimplifiedโBag of wordsโ model ignores positions of the words and their contextE.g. in โHong Kongโ are not independent of each other nor are theirpositionsBernoulli model does not even take into account the term frequenciesThis leads to very bad estimates of true document probabilities
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 70 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB models
However, NB classification decisions are surprisingly goodSuppose the following true document probabilities ๐(๐1|๐) = 0.6 and๐(๐2|๐) = 0.4Assume that ๐ contains many positive indicators for ๐1 and manynegative indicators for ๐2Then the NB estimation might be: ฬ๐ (๐1|๐) = 0.00099 andฬ๐ (๐2|๐) = 0.00001
This is common: the winning class in NB has usually a much largerprobability than the othersIn classification we care only about the winning class
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 71 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB performance
Online edition (c)2009 Cambridge UP
13.5 Feature selection 275
# # ##
##
##
#
# #
#
###
1 10 100 1000 10000
0.0
0.2
0.4
0.6
0.8
number of features selected
F1 m
easure
oo o oo
o
o
o
o o
oo
ooo
x
x
x x
x
x
xx
xx x
x
x xx
b
b
b
bb b b
b
b
bb b b bb
#ox
b
multinomial, MImultinomial, chisquaremultinomial, frequencybinomial, MI
โฎ Figure 13.8 Effect of feature set size on accuracy for multinomial and Bernoullimodels.
the end when we use all features. The reason is that the multinomial takesthe number of occurrences into account in parameter estimation and clas-sification and therefore better exploits a larger number of features than theBernoulli model. Regardless of the differences between the two methods,using a carefully selected subset of the features results in better effectivenessthan using all features.
13.5.2 ฯ2 Feature selection
Another popular feature selection method is ฯ2. In statistics, the ฯ2 test isฯ2 FEATURE SELECTION
applied to test the independence of two events, where two events A and B aredefined to be independent if P(AB) = P(A)P(B) or, equivalently, P(A|B) =INDEPENDENCE
P(A) and P(B|A) = P(B). In feature selection, the two events are occurrenceof the term and occurrence of the class. We then rank terms with respect tothe following quantity:
X2(D, t, c) = โetโ{0,1}
โecโ{0,1}
(Netec โ Eetec)2
Eetec
(13.18)
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 72 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB performance
Online edition (c)2009 Cambridge UP
282 13 Text classification and Naive Bayes
โฎ Table 13.8 Macro- and microaveraging. โTruthโ is the true class and โcallโ thedecision of the classifier. In this example, macroaveraged precision is [10/(10 + 10) +90/(10 + 90)]/2 = (0.5 + 0.9)/2 = 0.7. Microaveraged precision is 100/(100 + 20) โ0.83.
class 1truth: truth:yes no
call:yes 10 10
call:no
10 970
class 2truth: truth:yes no
call:yes 90 10
call:no
10 890
pooled tabletruth: truth:yes no
call:yes 100 20
call:no
20 1860
โฎ Table 13.9 Text classification effectiveness numbers on Reuters-21578 for F1 (inpercent). Results from Li and Yang (2003) (a), Joachims (1998) (b: kNN) and Dumaiset al. (1998) (b: NB, Rocchio, trees, SVM).
(a) NB Rocchio kNN SVMmicro-avg-L (90 classes) 80 85 86 89macro-avg (90 classes) 47 59 60 60
(b) NB Rocchio kNN trees SVMearn 96 93 97 98 98acq 88 65 92 90 94money-fx 57 47 78 66 75grain 79 68 82 85 95crude 80 70 86 85 89trade 64 65 77 73 76interest 65 63 74 67 78ship 85 49 79 74 86wheat 70 69 77 93 92corn 65 48 78 92 90micro-avg (top 10) 82 65 82 88 92micro-avg-D (118 classes) 75 62 n/a n/a 87
Rocchio and kNN. In addition, we give numbers for decision trees, an impor-DECISION TREES
tant classification method we do not cover. The bottom part of the tableshows that there is considerable variation from class to class. For instance,NB beats kNN on ship, but is much worse on money-fx.
Comparing parts (a) and (b) of the table, one is struck by the degree towhich the cited papersโ results differ. This is partly due to the fact that thenumbers in (b) are break-even scores (cf. page 161) averaged over 118 classes,whereas the numbers in (a) are true F1 scores (computed without any know-
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 73 / 74
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Naive Bayes for Text Classification
NB performance
SVM seems to be better but it costs more to trainkNN is better for some classesIn theory: SVM better than kNN better than NBWhen data is nice!In practice with errors in data, data drifting, etc. very difficult to beatNB
Denis Helic (ISDS, TU Graz) KDDM1 January 10, 2019 74 / 74