michael m. richterpages.cpsc.ucalgary.ca/~mrichter/ml/ml 2010/concept... · michael m. richter...

Michael M. Richter University of Calgary

2010

Machine Learning

Michael M. Richter

Bayesian Learning

Email: [email protected]

Michael M. Richter

Topic

• This is concept learning the probabilistic way.

• That means, everything that is stated is done in an exact

way but not always true.

• That means, the learned concept is equipped with a

probability for being correct.

University of Calgary

2010


2010

History

• Bayesian Decision Theory came long before Version

Spaces, Decision Tree Learning and Neural Networks. It

was studied in the field of Statistical Theory and more

specifically, in the field of Pattern Recognition.

• Bayesian Decision Theory is at the basis of important

learning schemes such as the Naïve Bayes Classifier,

Learning Bayesian Belief Networks and the EM

Algorithm.

• Bayesian Decision Theory is also useful as it provides a

framework within which many non-Bayesian classifiers

can be studied (See [Mitchell, Sections 6.3, 4,5,6]).

Michael M. Richter

: Why Bayesian Classification?• Probabilistic learning: Calculate explicit probabilities for

hypothesis, among the most practical approaches to certain types of learning problems

• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities

• Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Michael M. Richter

Maximum Likelihood

• Suppose there are a number of hypothesis generated and

that for each one there is a probability for being the right

one is calculated.

• The maximum likelihood principle says one should

choose the hypothesis h with the highest probability.

• P(h│D) is the a posteriori probability of h (after seeing the

data D).

• P(h) is the a priori probability of h.

• P(D│h) is the likelihood of D under h.


2010

Michael M. Richter

Part 1

The Naïve Bayesian Approach


2010

Michael M. Richter

Basic Formulas for Probabilities

• Product Rule : probability P(AB) of a conjunction of two events A and B:

•Sum Rule: probability of a disjunction of two events A and B:

•Theorem of Total Probability: if events A1, …., An are mutually exclusive then

)()|()()|(),( APABPBPBAPBAP

)()()()( ABPBPAPBAP

)()|()(1

i

n

i

i APABPBP


2010

– Event Y = y Observed example event

– Event Z = z Correctness of hypothesis „z“

– D: Data

P(h | D) =

P(h ) · P(D | h)

P(D) Probability that

h is a correct

hypothesis for

data DProb(h) to

be a correct

hypothesis

Prob(D) to be

observed, if h

is correct

Prob( D)

to be

observed

P(Z = z | Y = y) =

P(Z = z) · P(Y = y | Z = z)

P(Y = y)

A Basic Learning Scenario (1)


2010

• Notation:

– P(h) is a-priori-probability of h

– P(D | h) is likelihood of D under h

– P(h | D) is a-posteriori-probability of h given D

• The basic theorem of Bayes Rule:

P(h | D) =P(h ) · P(D | h)

P(D)


This theorem makes applications possible because it reduces the unknown conditional probability to ones that are known a priori.


2010


• Learner has hypotheses h1, …,hn and uses observed

data D.

• Wanted: Some

h { h1,..., hk },

for which P(h | D) is maximal:

(Maximum-a-posteriori-Hypothesis).

• A posteriori means: After seeing data

• Background knowledge: A-priori-probability Pr(h) of h

• A priori means: before seeing data


2010

Bayesian Classification and

Decision (1)

• The Bayes decision rule selects the class with minimum conditional risk.

• In the case of minimum-error-rate classification, the rule will select the class with the maximum posteriori probability.

• Suppose there are k classes, c1, c2, ..., ck.

• Given a feature vector x:

• The minimum-error-rate rule will assign it to the class cj if

P(cj | x) > P(ci | x) for all i j.


2010

Bayesian Classification and Decision (2)

• An equivalent but more useful criterion for minimum-error-

rate classification is:

• Choose class cj so that

P(x | cj)P(cj) > P(x | ci)P(ci) for all i j

• This relies on Bayes theorem.

• Note: There can no method exist that finds with higher

probability the correct hypothesis.

• But: That can change if one has additional knowledge.


2010

Example

• Assume:

• (1) A lab test D for a form of cancer has 98% chance of

giving a positive result if the cancer is present, and 97%

chance of giving a negative result if the cancer is absent.

• (2) 0.8% of population has this cancer: P(cancer)=0.008

and P(~cancer)=0.992

• What is probability that the cancer is present for a positive

result?

• P(cancer|D) = P(D|cancer)P(cancer) / P(D) = 0.98*0.008

/(0.98*0.008 + 0.03*0.992)=0.21


2010

MAP and ML

• Given some data D and a hypothesis space H, what is the

most probable hypothesis hH; i.e., P(h|D) is maximal?

• This hypothesis is called the maximal a posteriori

hypothesis hMAP

hMAP = = argmaxhH P(D|h)P(h)

• Again:

hMAP is optimal in the sense that no method can exist that

finds with higher correct probability.

• If P(h) = P(h’) for all h, h’H then this reduces to the

maximum likelihood principle

hML = argmaxhH P(D|h)


2010

The Gibbs Classifier

• Bayes Optimal is optimal but expensive; it uses all

hypotheses in H. Non-optimal but much more efficient is

the GIBBS-classifier Algorithm:

• Given:

• Sample S set of data {x1,... ,xm} D), hypotheses space

H with a probability distribution P and some to be

classified.

• Method:

• 1. Select h H randomly according to P (that is similar to

GA!)

• 2. Output: h(x)

• Surprisingly: E(errorGibbs) 2 E(errorBayesOptimal)


2010

The Naïve Bayesian Algorithm (1)

• Learning Scenario:

– Examples x1,...,xm, xi = (ai1,...,ain) for attributes A1, ..., An;

– Hypotheses H = { h1,..., hk } for classes

– Class of x ist C(x)

• Two ways to proceed:

– 1) Using Bayes optimal classification

– 2) Do not access H for classification.

• Method 2) avoids to overview all hypothesis in H what is

often very difficult and impractical.


2010

Estimation of Probabilities from

Samples

How do we estimate P(C) ?

E.g. Simple Binomial estimation

Count # of instances with

C = -1, and with C = +1

X1 X2 … XN C

0 1 1 +1

1 0 1 -1

1 1 0 +1

… … … …

0 0 0 +1

Two classes: -1, +1

N boolean attributes

How do we estimate P(X1,…,XN|C) ?

Count P(X1,…,XN|C=+1)

Count P(X1,…,XN|C=-1)

Very complex tasks!


2010

Conditional Independence

• Conditional independence is supposed to simplify the estimation task.

• Def.:

• (i) Y is independent of Z, if for all y Y, z Z

P(Y = y, Z = z) = P(Y = y) · P(Z = z)

(ii) X is conditionally independent of Y given Z if P(X=x,Y=y | Z=z) = P(X=x|Z=z) P(Y=y|Z=z)

Another formulation:

P(X = x | Y = y, Z = z) = P(X = x | Z = z)

This reduces the complexity for n variables from O(2n) in the product space to O(n) !


2010


• Given x = (a1,...,an): The (conditional) independence assumption says:

P(a1,… ,an| h) = P(a1| h)P(a2| h)… P(an| h)

• The assumption is called „naive“.

• This reduces the parameter estimation from the product space (which is =(2n)) to the sum of the attribute spaces (which is O(n)).

• However, it is not always satisfied (e.g. thunder is not independent to rain).

• The goal is now to avoid the knowledge about P(h) for all h H.


2010


• Therefore we proceed: .

• hMAP = h { h1,..., hk }, for which

P(C(x)=h | x = (a1,...,an)) is maximal.

• Equivalent:

P(x = (a1,...,an) | C(x)=h) · P(C(x)=h) is maximal.

• The probabilities from the right side are estimated from a

given set S of examples. Without the independence

assumption this would be impractible because S needs to

be too large.

Michael M. Richter

Example• A naive Bayes classiers adopts the assumption of conditional

independence. Given:

P(pneumonia) = 0.01, P(flu) = 0.05

P(cough | pneumonia) = 0.9, P(fever | pneumonia) = 0.9,

P(chest-pain | pneumonia) = 0.8,

P(cough | flu) = 0.5, P(fever | flu) = 0.9,

P(chest-pain | flu) = 0.1

• Suppose a patient had cough, fever, but no chest pain. What is the

probability ratio between pneumonia and flu? What is the best

diagnosis?

• Solution:

Probability ratio = 0.01 * 0.9 * 0.9 * (1 - 0.8) / 0.05 * 0.5 * 0.9 * (1 -

0.1) = 0.08

• So flu is at least ten times more likely than pneumonia.


2010

Michael M. Richter

Discussion (1)• Advantages:

• Tends to work well despite strong assumption of

conditional independence.

• Experiments show it to be quite competitive with other

classification methods on standard UCI datasets.

• Although it does not produce accurate probability

estimates when its independence assumptions are

violated, it may still pick the correct maximum-probability

class in many cases.

– Able to learn conjunctive concepts in any case


2010

Michael M. Richter

Discussion (2)

• Disadvantages:

• Does not perform any search of the hypothesis

space. Directly constructs a hypothesis from parameter

estimates that are easily calculated from the training data.

– Strong bias

• Not guarantee consistency with training data.

• Typically handles noise well since it does not even focus

on completely fitting the training data.


2010

Michael M. Richter

Part 2

Belief Networks


2010


2010

Bayesian Belief Networks (1)

• Discussing the independence assumption:

• Positive: makes computation feasible

• Negative: Is it often not satisfied

• Reason: There are causal or influential relations between

the attributes.

• Such relation are background knowledge.

• Idea: Make them visible in a graph.

• Conditional independence is now only between a subsets

of variables.

• Belief networks combine both.


2010

Bayesian Belief Networks (2)

• A Bayesian belief net (BBN) is a directed graph, together

with an associated set of probability tables.

• The nodes represent variables, which can be discrete or

continuous.

• The edges represent causal/influential relationships

between variables.

• Nodes not connected by edges are independent.

Michael M. Richter

Causality (1)

• Although Bayesian networks are often used to represent

causal relationships, this need not be the case: a directed

edge from u to v does not require that Xv is causally

dependent on Xu.

• Example:

• The graphs:

A→B→C and C→B→A

• are equivalent: that is they impose exactly the same

conditional independence requirements.


2010

Michael M. Richter

Causality (2)

• A causal network is a Bayesian network with an explicit

requirement that the relationships be causal.

• The additional semantics of the causal networks specify

that if a node X is actively caused to be in a given state x

(an action written as do(X=x)), then the probability density

function changes to the one of the network obtained by

cutting the links from X's parents to X, and setting X to the

caused value x.

• Using these semantics, one can predict the impact of

external interventions from data obtained prior to

intervention.


2010

Michael M. Richter

Influence Diagrams

• The network can represent influence diagrams.

• Such diagrams are used to represent decision models.

• Therefore they are a method to support decision making.


2010


2010

Example (1)

Temperature

Cloudiness

Winds

Rain

Temperature:

cold, mild, hot

Cloudes:

none, partial, covered

Winds:

No, mild, strong

Conditinal probability table

Umbrella


2010

Example (2)

StormBusTourGroup

Lightning Campfire

Thunder ForestFire

Each node is asserted to be conditionally independent of

its non-descendants, given its immediate parents

Associated with each

node is a conditional

probability table, which

specifies the conditional

distribution for the

variable given its

immediate parents in

the graph


2010

Inference in Bayesian Networks (1)

• In general:

• Calculate conditional probabilities along the directed

edges.

• This can be done in a forward or backward mode.

• Example forward mode:

• Suppose we have the edge A B, then we get

P(B) = P(B|A)P(A) + P(B|not A)P(not A)

and

P(not B) = P(not B|A)P(A) + P(not B|not A)P(not A)

Michael M. Richter

Example (1)

•Age •Income

•House

•Owner

•Voting

•Pattern

•Newspaper

•Preference

•Living

•Location

How likely are elderly richpeople to buy Sun?

P( paper = Sun | Age>60, Income > 60k)

Michael M. Richter

Example (2)

•Age •Income

•House

•Owner

•Voting

•Pattern

•Newspaper

•Preference

•Living

•Location

How likely are elderly richpeople who voted liberal

to buy Herald?

P( paper = H | Age>60,

Income > 60k, v = liberal)

Michael M. Richter

Unobserved Variables

• Bayesian networks can be used to answer probabilistic

queries about unobserved variables

• They can be used to find out updated knowledge of the

state of a subset of variables when other variables (the

evidence variables) are observed.

• This process of computing the posterior distribution of

variables given evidence is called probabilistic inference.

A Bayesian network can thus be considered a

mechanism for automatically applying Bayes’ theorem to

complex problems.


2010


2010

Complete and Incomplete Information

1. The network structure is given in advance and all the

variables are fully observable in the training examples.

==> Trivial Case: just estimate the conditional

probabilities.

2. The network structure is given in advance but only

some of the variables are observable in the training

data. ==> Similar to learning the weights for the hidden

units of a Neural Net: Gradient Ascent Procedure

3. The network structure is not known in advance. ==>

Use a heuristic search or constraint-based technique

to search through potential structures.

Michael M. Richter

Parameter Learning

• In order to fully specify the Bayesian network and thus

fully represent the joint parameter probabilitydistribution, it

is necessary to specify for each node X the probability

distribution for X conditional upon X's parents. The

distribution of X conditional upon its parents may have

any form.


2010


2010

Expectation Maximization:

Unobservable Relevant Variables.

• Example:Assume that data points have been uniformly

generated from k distinct Gaussian with the same known

variance.

• Problem find a hypothesis h=<1, 2 ,.., k> that

describes the means of each of the k distributions. In

particular, we are looking for a maximum likelihood

hypothesis for these means.

• We extend the problem description as follows: for each

point xi, there are k hidden variables zi1,..,zik such that

zil=1 if xi was generated by normal distribution N and

ziq= 0 for all qN.


2010

EM Algorithm• Initially: An arbitrary initial hypothesis h=<1, 2 ,.., k> is

chosen.

• The EM Algorithm contains two steps:

– Step 1 (Estimation, E): Calculate the expected value

E[zij] of each hidden variable zij, assuming that the

current hypothesis h=<1, 2 ,.., k> holds.

– Step 2 (Maximization, M): Calculate a new maximum

likelihood hypothesis h’=<1’, 2’ ,.., k’>, assuming

the value taken on by each hidden variable zij is its

expected value E[zij] calculated in step 1. Then replace

the hypothesis h=<1, 2 ,.., k> by the new

hypothesis h’=<1’, 2’ ,.., k’> and iterate.


2010

Problems and Limitations (1)

• A computational problem is exploring a previously

unknown network.

• To calculate the probability of any branch of the network,

all branches must be calculated.

• This process of network discovery is an NP-hard task

which might either be too costly to perform, or impossible

given the number and combination of variables.


2010

Problems and Limitations (2)

• The network relies on the quality and coverage of the

prior beliefs (which is knowledge!!) used in the inference

processing.

• The network is only as useful as this background

knowledge is reliable.

• Both, too optimistic or too pessimistic expectation of the

quality of these prior beliefs will invalidate the results.

• Related to this is the selection of the statistical distribution

induced in modeling the data. Selecting the proper

distribution model to describe the data has a notable

effect on the quality of the resulting network.

Michael M. Richter

Dependency Networks

• They are a generalization and alternative to the Bayesian

network.

• It has also has a graph and probability component but

graph can be cyclic.

• The probability component is as in a Bayesian network.


2010

Michael M. Richter

Loops

• If BP is used on graphs with loops, messages may

circulate indefinitely

• Empirically, a good approximation is still achievable

– Stop after fixed # of iterations

– Stop when no significant change in beliefs

– If solution is not oscillatory but converges, it usually is a good

approximation


2010

Michael M. Richter

Applications

• Bayesian Learning is a standard methods in many

application areas like

– Medicine (classification, prediction)

– Image retrieval and pattern recognition

– Quality control for material

• Some competitors are e.g.

– Support vector machines

– Clustering methods


2010

Michael M. Richter

Tools

• Hugin tool: Implements the propagation algorithm of Lauritzen and

Spiegelhalter.

• A more modern and powerful BBN tool is the AgendaRisk tool With

this tool it is possible to perform fast propagation in large BBNs (with

hundreds of nodes and millions of state combinations)

• GeNIe: http://www2.sis.pitt.edu/~genie/

• WinMine Toolkit,

http://research.microsoft.com/~dmax/winmine/tooldoc.htm

• Weka, Matlab


2010

http://www2.sis.pitt.edu/~genie/


2010

Summary

• Bayes theorem

• Baysian decision

• Maximum a posteriori and maximum likelihood

• The naïve Baysian method and conditional independence

• Gibbs system

• Belief nets and inference in nets and belief revision

• Estimating unknown parameters: EM algorithm

• Limitations


2010

Some References (1)

• Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian

Theory, New York: John Wiley.

• Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B.

(1995) Bayesian Data Analysis, London: Chapman &

Hall, ISBN 0-412-03991-5.

• Ian H.Witten, Eibe Frank: Data Mining Practical Machine

Learning Tools with Java Implementations. Morgan

Kaufmann, 2000.

• David W. Aha: Machine Learning tools.

home.earthlink.net/~dwaha/research/machine-

learning.html

Michael M. Richter

Some References (2)

• Heckerman, David :Tutorial on Learning with Bayesian

Networks. In Jordan, Michael Irwin, Learning in Graphical

Models, Adaptive Computation and Machine Learning,

MIT Press 1998, pp. 301-354, Borgelt, Christian; Kruse,

Rudolf (March 2002). Graphical Models for Data Analysis

and Mining Chichester, D. Heckerman, D. M. Chickering,

C. Meek, R. Rounthwaite, C. Kadie, Dependency

Networks for Inference, Collaborative Filtering, and Data

Visualization, Journal of

• Machine Learning Research, Vol. 1, 2000, pp. 49-75.

http://research.microsoft.com/en-

us/um/people/dmax/WinMine/Tutorial/Tutorial.html


2010

michael m. richterpages.cpsc.ucalgary.ca/~mrichter/ml/ml 2010/concept... · michael m. richter...

Documents