michael m. richterpages.cpsc.ucalgary.ca/~mrichter/ml/ml 2010/concept... · michael m. richter...
TRANSCRIPT
Michael M. Richter University of Calgary
2010
Machine Learning
Michael M. Richter
Bayesian Learning
Email: [email protected]
Michael M. Richter
Topic
• This is concept learning the probabilistic way.
• That means, everything that is stated is done in an exact
way but not always true.
• That means, the learned concept is equipped with a
probability for being correct.
University of Calgary
2010
Michael M. Richter University of Calgary
2010
History
• Bayesian Decision Theory came long before Version
Spaces, Decision Tree Learning and Neural Networks. It
was studied in the field of Statistical Theory and more
specifically, in the field of Pattern Recognition.
• Bayesian Decision Theory is at the basis of important
learning schemes such as the Naïve Bayes Classifier,
Learning Bayesian Belief Networks and the EM
Algorithm.
• Bayesian Decision Theory is also useful as it provides a
framework within which many non-Bayesian classifiers
can be studied (See [Mitchell, Sections 6.3, 4,5,6]).
Michael M. Richter
: Why Bayesian Classification?• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain types of learning problems
• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Michael M. Richter
Maximum Likelihood
• Suppose there are a number of hypothesis generated and
that for each one there is a probability for being the right
one is calculated.
• The maximum likelihood principle says one should
choose the hypothesis h with the highest probability.
• P(h│D) is the a posteriori probability of h (after seeing the
data D).
• P(h) is the a priori probability of h.
• P(D│h) is the likelihood of D under h.
University of Calgary
2010
Michael M. Richter
Part 1
The Naïve Bayesian Approach
University of Calgary
2010
Michael M. Richter
Basic Formulas for Probabilities
• Product Rule : probability P(AB) of a conjunction of two events A and B:
•Sum Rule: probability of a disjunction of two events A and B:
•Theorem of Total Probability: if events A1, …., An are mutually exclusive then
)()|()()|(),( APABPBPBAPBAP
)()()()( ABPBPAPBAP
)()|()(1
i
n
i
i APABPBP
Michael M. Richter University of Calgary
2010
– Event Y = y Observed example event
– Event Z = z Correctness of hypothesis „z“
– D: Data
P(h | D) =
P(h ) · P(D | h)
P(D) Probability that
h is a correct
hypothesis for
data DProb(h) to
be a correct
hypothesis
Prob(D) to be
observed, if h
is correct
Prob( D)
to be
observed
P(Z = z | Y = y) =
P(Z = z) · P(Y = y | Z = z)
P(Y = y)
A Basic Learning Scenario (1)
Michael M. Richter University of Calgary
2010
• Notation:
– P(h) is a-priori-probability of h
– P(D | h) is likelihood of D under h
– P(h | D) is a-posteriori-probability of h given D
• The basic theorem of Bayes Rule:
P(h | D) =P(h ) · P(D | h)
P(D)
A Basic Learning Scenario (2)
This theorem makes applications possible because it reduces the unknown conditional probability to ones that are known a priori.
Michael M. Richter University of Calgary
2010
A Basic Learning Scenario (3)
• Learner has hypotheses h1, …,hn and uses observed
data D.
• Wanted: Some
h { h1,..., hk },
for which P(h | D) is maximal:
(Maximum-a-posteriori-Hypothesis).
• A posteriori means: After seeing data
• Background knowledge: A-priori-probability Pr(h) of h
• A priori means: before seeing data
Michael M. Richter University of Calgary
2010
Bayesian Classification and
Decision (1)
• The Bayes decision rule selects the class with minimum conditional risk.
• In the case of minimum-error-rate classification, the rule will select the class with the maximum posteriori probability.
• Suppose there are k classes, c1, c2, ..., ck.
• Given a feature vector x:
• The minimum-error-rate rule will assign it to the class cj if
P(cj | x) > P(ci | x) for all i j.
Michael M. Richter University of Calgary
2010
Bayesian Classification and Decision (2)
• An equivalent but more useful criterion for minimum-error-
rate classification is:
• Choose class cj so that
P(x | cj)P(cj) > P(x | ci)P(ci) for all i j
• This relies on Bayes theorem.
• Note: There can no method exist that finds with higher
probability the correct hypothesis.
• But: That can change if one has additional knowledge.
Michael M. Richter University of Calgary
2010
Example
• Assume:
• (1) A lab test D for a form of cancer has 98% chance of
giving a positive result if the cancer is present, and 97%
chance of giving a negative result if the cancer is absent.
• (2) 0.8% of population has this cancer: P(cancer)=0.008
and P(~cancer)=0.992
• What is probability that the cancer is present for a positive
result?
• P(cancer|D) = P(D|cancer)P(cancer) / P(D) = 0.98*0.008
/(0.98*0.008 + 0.03*0.992)=0.21
Michael M. Richter University of Calgary
2010
MAP and ML
• Given some data D and a hypothesis space H, what is the
most probable hypothesis hH; i.e., P(h|D) is maximal?
• This hypothesis is called the maximal a posteriori
hypothesis hMAP
hMAP = = argmaxhH P(D|h)P(h)
• Again:
hMAP is optimal in the sense that no method can exist that
finds with higher correct probability.
• If P(h) = P(h’) for all h, h’H then this reduces to the
maximum likelihood principle
hML = argmaxhH P(D|h)
Michael M. Richter University of Calgary
2010
The Gibbs Classifier
• Bayes Optimal is optimal but expensive; it uses all
hypotheses in H. Non-optimal but much more efficient is
the GIBBS-classifier Algorithm:
• Given:
• Sample S set of data {x1,... ,xm} D), hypotheses space
H with a probability distribution P and some to be
classified.
• Method:
• 1. Select h H randomly according to P (that is similar to
GA!)
• 2. Output: h(x)
• Surprisingly: E(errorGibbs) 2 E(errorBayesOptimal)
Michael M. Richter University of Calgary
2010
The Naïve Bayesian Algorithm (1)
• Learning Scenario:
– Examples x1,...,xm, xi = (ai1,...,ain) for attributes A1, ..., An;
– Hypotheses H = { h1,..., hk } for classes
– Class of x ist C(x)
• Two ways to proceed:
– 1) Using Bayes optimal classification
– 2) Do not access H for classification.
• Method 2) avoids to overview all hypothesis in H what is
often very difficult and impractical.
Michael M. Richter University of Calgary
2010
Estimation of Probabilities from
Samples
How do we estimate P(C) ?
E.g. Simple Binomial estimation
Count # of instances with
C = -1, and with C = +1
X1 X2 … XN C
0 1 1 +1
1 0 1 -1
1 1 0 +1
… … … …
0 0 0 +1
Two classes: -1, +1
N boolean attributes
How do we estimate P(X1,…,XN|C) ?
Count P(X1,…,XN|C=+1)
Count P(X1,…,XN|C=-1)
Very complex tasks!
Michael M. Richter University of Calgary
2010
Conditional Independence
• Conditional independence is supposed to simplify the estimation task.
• Def.:
• (i) Y is independent of Z, if for all y Y, z Z
P(Y = y, Z = z) = P(Y = y) · P(Z = z)
(ii) X is conditionally independent of Y given Z if P(X=x,Y=y | Z=z) = P(X=x|Z=z) P(Y=y|Z=z)
Another formulation:
P(X = x | Y = y, Z = z) = P(X = x | Z = z)
This reduces the complexity for n variables from O(2n) in the product space to O(n) !
Michael M. Richter University of Calgary
2010
The Naïve Bayesian Algorithm (2)
• Given x = (a1,...,an): The (conditional) independence assumption says:
P(a1,… ,an| h) = P(a1| h)P(a2| h)… P(an| h)
• The assumption is called „naive“.
• This reduces the parameter estimation from the product space (which is =(2n)) to the sum of the attribute spaces (which is O(n)).
• However, it is not always satisfied (e.g. thunder is not independent to rain).
• The goal is now to avoid the knowledge about P(h) for all h H.
Michael M. Richter University of Calgary
2010
The Naïve Bayesian Algorithm (3)
• Therefore we proceed: .
• hMAP = h { h1,..., hk }, for which
P(C(x)=h | x = (a1,...,an)) is maximal.
• Equivalent:
P(x = (a1,...,an) | C(x)=h) · P(C(x)=h) is maximal.
• The probabilities from the right side are estimated from a
given set S of examples. Without the independence
assumption this would be impractible because S needs to
be too large.
Michael M. Richter
Example• A naive Bayes classiers adopts the assumption of conditional
independence. Given:
P(pneumonia) = 0.01, P(flu) = 0.05
P(cough | pneumonia) = 0.9, P(fever | pneumonia) = 0.9,
P(chest-pain | pneumonia) = 0.8,
P(cough | flu) = 0.5, P(fever | flu) = 0.9,
P(chest-pain | flu) = 0.1
• Suppose a patient had cough, fever, but no chest pain. What is the
probability ratio between pneumonia and flu? What is the best
diagnosis?
• Solution:
Probability ratio = 0.01 * 0.9 * 0.9 * (1 - 0.8) / 0.05 * 0.5 * 0.9 * (1 -
0.1) = 0.08
• So flu is at least ten times more likely than pneumonia.
University of Calgary
2010
Michael M. Richter
Discussion (1)• Advantages:
• Tends to work well despite strong assumption of
conditional independence.
• Experiments show it to be quite competitive with other
classification methods on standard UCI datasets.
• Although it does not produce accurate probability
estimates when its independence assumptions are
violated, it may still pick the correct maximum-probability
class in many cases.
– Able to learn conjunctive concepts in any case
University of Calgary
2010
Michael M. Richter
Discussion (2)
• Disadvantages:
• Does not perform any search of the hypothesis
space. Directly constructs a hypothesis from parameter
estimates that are easily calculated from the training data.
– Strong bias
• Not guarantee consistency with training data.
• Typically handles noise well since it does not even focus
on completely fitting the training data.
University of Calgary
2010
Michael M. Richter
Part 2
Belief Networks
University of Calgary
2010
Michael M. Richter University of Calgary
2010
Bayesian Belief Networks (1)
• Discussing the independence assumption:
• Positive: makes computation feasible
• Negative: Is it often not satisfied
• Reason: There are causal or influential relations between
the attributes.
• Such relation are background knowledge.
• Idea: Make them visible in a graph.
• Conditional independence is now only between a subsets
of variables.
• Belief networks combine both.
Michael M. Richter University of Calgary
2010
Bayesian Belief Networks (2)
• A Bayesian belief net (BBN) is a directed graph, together
with an associated set of probability tables.
• The nodes represent variables, which can be discrete or
continuous.
• The edges represent causal/influential relationships
between variables.
• Nodes not connected by edges are independent.
Michael M. Richter
Causality (1)
• Although Bayesian networks are often used to represent
causal relationships, this need not be the case: a directed
edge from u to v does not require that Xv is causally
dependent on Xu.
• Example:
• The graphs:
A→B→C and C→B→A
• are equivalent: that is they impose exactly the same
conditional independence requirements.
University of Calgary
2010
Michael M. Richter
Causality (2)
• A causal network is a Bayesian network with an explicit
requirement that the relationships be causal.
• The additional semantics of the causal networks specify
that if a node X is actively caused to be in a given state x
(an action written as do(X=x)), then the probability density
function changes to the one of the network obtained by
cutting the links from X's parents to X, and setting X to the
caused value x.
• Using these semantics, one can predict the impact of
external interventions from data obtained prior to
intervention.
University of Calgary
2010
Michael M. Richter
Influence Diagrams
• The network can represent influence diagrams.
• Such diagrams are used to represent decision models.
• Therefore they are a method to support decision making.
University of Calgary
2010
Michael M. Richter University of Calgary
2010
Example (1)
Temperature
Cloudiness
Winds
Rain
Temperature:
cold, mild, hot
Cloudes:
none, partial, covered
Winds:
No, mild, strong
Conditinal probability table
Umbrella
Michael M. Richter University of Calgary
2010
Example (2)
StormBusTourGroup
Lightning Campfire
Thunder ForestFire
Each node is asserted to be conditionally independent of
its non-descendants, given its immediate parents
Associated with each
node is a conditional
probability table, which
specifies the conditional
distribution for the
variable given its
immediate parents in
the graph
Michael M. Richter University of Calgary
2010
Inference in Bayesian Networks (1)
• In general:
• Calculate conditional probabilities along the directed
edges.
• This can be done in a forward or backward mode.
• Example forward mode:
• Suppose we have the edge A B, then we get
P(B) = P(B|A)P(A) + P(B|not A)P(not A)
and
P(not B) = P(not B|A)P(A) + P(not B|not A)P(not A)
Michael M. Richter University of Calgary
2010
Inference in Bayesian Networks (2)
• Suppose we want to calculate P(AB | E).
• Using P(A,B) = P(A|B) p(B) we get:
• P(AB | E) = P(A | E) * P(B | AE)
P(AB | E) = P(B | E) * P(A | BE)
• Therefore:
• P(A | BE) = ( P(A | E) * P(B | AE) ) / P(B | E)
(another version of Bayes' Theorem).
Michael M. Richter
Example (1)
•Age •Income
•House
•Owner
•Voting
•Pattern
•Newspaper
•Preference
•Living
•Location
How likely are elderly richpeople to buy Sun?
P( paper = Sun | Age>60, Income > 60k)
Michael M. Richter
Example (2)
•Age •Income
•House
•Owner
•Voting
•Pattern
•Newspaper
•Preference
•Living
•Location
How likely are elderly richpeople who voted liberal
to buy Herald?
P( paper = H | Age>60,
Income > 60k, v = liberal)
Michael M. Richter
Unobserved Variables
• Bayesian networks can be used to answer probabilistic
queries about unobserved variables
• They can be used to find out updated knowledge of the
state of a subset of variables when other variables (the
evidence variables) are observed.
• This process of computing the posterior distribution of
variables given evidence is called probabilistic inference.
A Bayesian network can thus be considered a
mechanism for automatically applying Bayes’ theorem to
complex problems.
University of Calgary
2010
Michael M. Richter University of Calgary
2010
Inference in Bayesian Networks (2)
• In the network we can chain on several edges:
• Find the probability of H given that A1, A2, A3 and E have
happened:
• P(H | A1A2A3E) = ( P(H | E) * P(A1A2A3 | HE) ) /
P(A1A2A3 | E) because:
P(A1A2A3 | E) = P(A1 | A2A3E) * P(A2A3 | E) = P(A1 |
A2A3E) * P(A2 | A3E) P(A3 | E).
• With independence this simplifies.
• E.g. we get:
• P(H | A1AEI) = ( P(H | E) * P(A1 | HE) ) * P(A2 | HE) ) /
( P(A1 | E) * P(A2 | E) )
Michael M. Richter University of Calgary
2010
Recalculation (1)
True False
P(A) = 0.1 P(~A) = 0.9
A
B
CConsider
the net:
Given probabilities:
True False
P(B) = 0.4 P(~B) = 0.6
A
B
True
False
True False
True
P(C | AB) = 0.8
P(~C | AB) = 0.2
False
P(C | A~B) = 0.6
P(C | A~B) = 0.6
True
P(C | ~AB) = 0.5
P(~C | ~AB) = 0.5
False
P(C | ~A~B) = 0.5
P(~C | ~A~B) = 0.5
Michael M. Richter University of Calgary
2010
Recalculation (2)Calculation of the probability of C:
p(C) =p(CAB) + p(C~AB) + p(CA~B) + p(C~A~B)=p(C | AB) * p(AB) + p(C | ~AB) * p(~AB) + p(C | A~B) * p(A~B) + p(C | ~A~B) * p(~A~B)=p(C | AB) * p(A) * p(B) + p(C | ~AB) * p(~A) * p(B) + p(C | A~B) * p(A) * p(~B) + p(C | ~A~B) * p(~A) * p(~B)=0.518
Recalculation of P(A) and P(B) If we know that C is true using Bayes rule:
P(B | C) =( P( C | B) * P(B) ) / P(C)=( ( P(C | AB) * P(A) + P(C | ~AB) * P(~A) ) * P(B) ) / P(C)=( (0.8 * 0.1 + 0.5 * 0.9) * 0.4 ) / 0.518 = 0.409
P(A | C) =( P( C | A) *
P(A) ) / P(C)=( ( pPC | AB) * P(B) + P(C | A~B) * P(~B) ) * P(A) ) / P(C)=( (0.8 * 0.4 + 0.6 * 0.6) * 0.1 ) / 0.518 = 0.131
Michael M. Richter University of Calgary
2010
Complete and Incomplete Information
1. The network structure is given in advance and all the
variables are fully observable in the training examples.
==> Trivial Case: just estimate the conditional
probabilities.
2. The network structure is given in advance but only
some of the variables are observable in the training
data. ==> Similar to learning the weights for the hidden
units of a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance. ==>
Use a heuristic search or constraint-based technique
to search through potential structures.
Michael M. Richter
Parameter Learning
• In order to fully specify the Bayesian network and thus
fully represent the joint parameter probabilitydistribution, it
is necessary to specify for each node X the probability
distribution for X conditional upon X's parents. The
distribution of X conditional upon its parents may have
any form.
University of Calgary
2010
Michael M. Richter University of Calgary
2010
Expectation Maximization:
Unobservable Relevant Variables.
• Example:Assume that data points have been uniformly
generated from k distinct Gaussian with the same known
variance.
• Problem find a hypothesis h=<1, 2 ,.., k> that
describes the means of each of the k distributions. In
particular, we are looking for a maximum likelihood
hypothesis for these means.
• We extend the problem description as follows: for each
point xi, there are k hidden variables zi1,..,zik such that
zil=1 if xi was generated by normal distribution N and
ziq= 0 for all qN.
Michael M. Richter University of Calgary
2010
EM Algorithm• Initially: An arbitrary initial hypothesis h=<1, 2 ,.., k> is
chosen.
• The EM Algorithm contains two steps:
– Step 1 (Estimation, E): Calculate the expected value
E[zij] of each hidden variable zij, assuming that the
current hypothesis h=<1, 2 ,.., k> holds.
– Step 2 (Maximization, M): Calculate a new maximum
likelihood hypothesis h’=<1’, 2’ ,.., k’>, assuming
the value taken on by each hidden variable zij is its
expected value E[zij] calculated in step 1. Then replace
the hypothesis h=<1, 2 ,.., k> by the new
hypothesis h’=<1’, 2’ ,.., k’> and iterate.
Michael M. Richter University of Calgary
2010
Problems and Limitations (1)
• A computational problem is exploring a previously
unknown network.
• To calculate the probability of any branch of the network,
all branches must be calculated.
• This process of network discovery is an NP-hard task
which might either be too costly to perform, or impossible
given the number and combination of variables.
Michael M. Richter University of Calgary
2010
Problems and Limitations (2)
• The network relies on the quality and coverage of the
prior beliefs (which is knowledge!!) used in the inference
processing.
• The network is only as useful as this background
knowledge is reliable.
• Both, too optimistic or too pessimistic expectation of the
quality of these prior beliefs will invalidate the results.
• Related to this is the selection of the statistical distribution
induced in modeling the data. Selecting the proper
distribution model to describe the data has a notable
effect on the quality of the resulting network.
Michael M. Richter
Dependency Networks
• They are a generalization and alternative to the Bayesian
network.
• It has also has a graph and probability component but
graph can be cyclic.
• The probability component is as in a Bayesian network.
University of Calgary
2010
Michael M. Richter
Loops
• If BP is used on graphs with loops, messages may
circulate indefinitely
• Empirically, a good approximation is still achievable
– Stop after fixed # of iterations
– Stop when no significant change in beliefs
– If solution is not oscillatory but converges, it usually is a good
approximation
University of Calgary
2010
Michael M. Richter
Applications
• Bayesian Learning is a standard methods in many
application areas like
– Medicine (classification, prediction)
– Image retrieval and pattern recognition
– Quality control for material
• Some competitors are e.g.
– Support vector machines
– Clustering methods
University of Calgary
2010
Michael M. Richter
Tools
• Hugin tool: Implements the propagation algorithm of Lauritzen and
Spiegelhalter.
• A more modern and powerful BBN tool is the AgendaRisk tool With
this tool it is possible to perform fast propagation in large BBNs (with
hundreds of nodes and millions of state combinations)
• GeNIe: http://www2.sis.pitt.edu/~genie/
• WinMine Toolkit,
http://research.microsoft.com/~dmax/winmine/tooldoc.htm
• Weka, Matlab
University of Calgary
2010
Michael M. Richter University of Calgary
2010
Summary
• Bayes theorem
• Baysian decision
• Maximum a posteriori and maximum likelihood
• The naïve Baysian method and conditional independence
• Gibbs system
• Belief nets and inference in nets and belief revision
• Estimating unknown parameters: EM algorithm
• Limitations
Michael M. Richter University of Calgary
2010
Some References (1)
• Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian
Theory, New York: John Wiley.
• Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B.
(1995) Bayesian Data Analysis, London: Chapman &
Hall, ISBN 0-412-03991-5.
• Ian H.Witten, Eibe Frank: Data Mining Practical Machine
Learning Tools with Java Implementations. Morgan
Kaufmann, 2000.
• David W. Aha: Machine Learning tools.
home.earthlink.net/~dwaha/research/machine-
learning.html
Michael M. Richter
Some References (2)
• Heckerman, David :Tutorial on Learning with Bayesian
Networks. In Jordan, Michael Irwin, Learning in Graphical
Models, Adaptive Computation and Machine Learning,
MIT Press 1998, pp. 301-354, Borgelt, Christian; Kruse,
Rudolf (March 2002). Graphical Models for Data Analysis
and Mining Chichester, D. Heckerman, D. M. Chickering,
C. Meek, R. Rounthwaite, C. Kadie, Dependency
Networks for Inference, Collaborative Filtering, and Data
Visualization, Journal of
• Machine Learning Research, Vol. 1, 2000, pp. 49-75.
http://research.microsoft.com/en-
us/um/people/dmax/WinMine/Tutorial/Tutorial.html
University of Calgary
2010