lecture 07 - bayesian learning - 1

8/13/2019 Lecture 07 - Bayesian Learning - 1

1/25

1

Bayesian Classif iers

Bayesian classifiers are statistical classifiers, and are based

on Bayes theorem

They can calculate the probability that a given samplebelongs to a particular class

BAYESIAN LEARNING


2/25

2

Bayesian learning algorithms are among the most

practical approaches to certain types of learning

problems.

Their results are comparable to the performance of other

classifiers, such as decision tree and neural networks in

many cases

BAYESIAN LEARNING


3/25

3

Bayes Theorem

Let Xbe a data sample, e.g. red and round fruit

Let H be some hypothesis, such as that X belongs to a

specified class C(e.g. X is an apple)

For classification problems, we want to determine P(H|X),

the probability that the hypothesis H holds given the

observed data sample X

BAYESIAN LEARNING


4/25

4

Prior & Poster ior Probabil i ty

The probability P(H) is called the prior probability of H, i.e

the probability that any given data sample is an apple,

regardless of how the data sample looks

The probability P(H|X) is called posterior probability. It is

based on more information, then the prior probability P(H)

which is independent of X

BAYESIAN LEARNING


5/25

5

Bayes Theorem

It provides a way of calculating the posterior probability

P(H|X)= P(X|H ) P(H)

P(X)

P(X|H) is the posterior probability of X given H (it is the

probability that Xis red and round given that Xis an apple)

P(X) is the prior probability of X (probability that a data

sample is red and round)

BAYESIAN LEARNING


6/25

6

Bayes Theorem: Proof

The posterior probability of the fruit being an apple given

that its shape is round and its colour is red is

P(H|X)= |HX| / |X|i.e. the number of apples which are red and round divided bythe total number of red and round fruits

Since P(HX) = |HX| / |total fruits of all size and shapes|and P(X) = |X| / |total fruits of all size and shapes|

Hence P(H|X)= P(HX) / P(X)

BAYESIAN LEARNING


8/25

8

Nave (Simple) Bayesian Classi f ication

Studies comparing classification algorithms have found that

the simple Bayesian classifier is comparable in performance

with decision tree and neural network classifiers

It works as follows:

1. Each data sample is represented by an n-dimensional

feature vector, X = (x1, x2, , xn), depicting nmeasurements made on the sample from n attributes,

respectively A1, A2, An

BAYESIAN LEARNING


9/25

9


2. Suppose that there are m classes C1, C2, Cm. Given an

unknown data sample, X (i.e. having no class label), the

classifier will predict that X belongs to the class having the

highest posterior probability given X

Thus if P(Ci|X) > P(Cj|X) for 1 j m , j i

then X is assigned to Ci

This is called Bayes decision rule

BAYESIAN LEARNING


10/2510


3. We have P(Ci|X)= P(X|Ci) P(Ci) / P(X)

As P(X) is constant for all classes, only P(X|Ci) P(Ci) needs to

be calculated

The class prior probabilities may be estimated by

P(Ci) = si / s

where siis the number of training samples of class Ci& s is the total number of training samples

If class prior probabilities are equal (or not known and thus

assumed to be equal) then we need to calculate only P(X|Ci)

BAYESIAN LEARNING


11/2511


4. Given data sets with many attributes, it would be

extremely computationally expensive to compute P(X|Ci)

For example, assuming the attributes of colour and shape tobe Boolean, we need to store 4 probabilities for the category

apple

P(red round | apple)P(red

round | apple)P(red round | apple)

P(red round | apple)If there are 6 attributes and they are Boolean, then we need

to store 26probabilities

BAYESIAN LEARNING


12/2512


In order to reduce computation, the nave assumption of

class conditional independenceis made

This presumes that the values of the attributes areconditionally independent of one another, given the class

label of the sample (we assume that there are no dependence

relationships among the attributes)

BAYESIAN LEARNING


13/2513


Thus we assume that P(X|Ci) = nk=1P(xk|Ci)Example

P(colour shape | apple) = P(colour | apple) P(shape | apple)For 6 Boolean attributes, we would have only 12 probabilities

to store instead of 26= 64

Similarly for 6, three valued attributes, we would have 18probabilities to store instead of 36

BAYESIAN LEARNING


14/2514


The probabilities P(x1|Ci), P(x2|Ci), , P(xn|Ci) can be

estimated from the training samples, where

For an attribute Ak, which can take on the values x1k, x2k,

e.g. colour = red, green,

P(xk|Ci) = sik/si

where sik

is the number of training samples of class Cihaving

the value xkfor Ak

and siis the number of training samples belonging to Ci

e.g. P(red|apple) = 7/10 if 7 out of 10 apples are red

BAYESIAN LEARNING


15/2515


Example:

BAYESIAN LEARNING


16/2516


Example:

Let C1= class buy computer and C2= class not buy computer

The unknown sample:

X = {age = 30, income = medium, student = yes, credit-

rating = fair}

The prior probability of each class can be computed as

P(buy computer = yes) = 9/14 = 0.643

P(buy_computer = no) = 5/14 = 0.357

BAYESIAN LEARNING


17/2517


Example:

To compute P(X|Ci) we compute the following conditional

probabilities

BAYESIAN LEARNING


18/2518


Example:

Using the above probabilities we obtain

And hence the nave Bayesian classifier predicts that the

student will buy computer, because

BAYESIAN LEARNING


19/2519


An Example: Learning to classify text

-Instances (training samples) are text documents

-Classification labels can be: like-dislike, etc.-The task is to learn from these training examples to

predict the class of unseen documents

Design issue:

- How to represent a text document in terms of

attribute values

BAYESIAN LEARNING


20/2520


One approach:

- The attributes are the word positions

- Value of an attribute is the word found in that

position

Note that the number of attributes may be different for each

document

We calculate the prior probabilities of classes from the

training samples

Also the probabilities of word in a position is calculated

e.g. P(Thein first position | like document)

BAYESIAN LEARNING


21/2521


Second approach:

The frequency with which a word occurs is counted

irrespective of the wordsposition

Note that here also the number of attributes may be different

for each document

The probabilities of words are

e.g. P(The| like document)

BAYESIAN LEARNING


22/2522


Results

An algorithm based on the second approach was applied to

the problem of classifying articles of news groups

-20 newsgroups were considered

- 1,000 articles of each news group were collected (total

20,000 articles)

-The nave Bayes algorithm was applied using 2/3rd of

these articles as training samples

-Testing was done over the remaining 3rd

BAYESIAN LEARNING


23/2523


Results

-Given 20 news groups, we would expect random

guessing to achieve a classification accuracy of 5%-The accuracy achieved by this program was 89%

BAYESIAN LEARNING


24/2524


Minor Variant

The algorithm used only a subset of the words used in the

documents

- 100 most frequent words were removed (these include

words such as the,and of)

-Any word occurring fewer than 3 times was also

removed

BAYESIAN LEARNING


25/25

Chapter 6 of T. Mitchell

Reference

BAYESIAN LEARNING

lecture 07 - bayesian learning - 1

Documents