cmpt 882 -machine learning bayesian learningoschulte/fagan.pdf · cmpt 882 -machine learning...

CMPT 882 - Machine Learning

Bayesian LearningLecture Scribe for Week 4

Jan 30th & Feb 4th

Stephen [email protected]

Overview:� Introduction

- Who was Bayes?- Bayesian Statistics Versus Classical Statistics- Bayesian Learning (in a nutshell)� Prerequisites From Probability Theory- Basic Terms- Basic Formulas� Bayes’ Theorem- Derivation- Significance- Example� Naive Bayes Classifier� Bayesian Belief Networks- Conditional Independence- Representation- Inference- Causation- A Brief History of Causation

Introduction:Who was Bayes?

"It is impossible to understand a man’s work unlessyou understand something of his character and unless you

understand something of his environment."

Thomas Bayes?(possible, but not probable)

Reverend Thomas Bayes (1702-1761) was an English theologian and mathematician.Motivated by his religious beliefs, he proposed the well known argument for the existence ofGod known as the argument by design. Basically, the argument is: without assuming theexistence of God, the operation of the universe is extremely unlikely, therefore, since theoperation of the universe is a fact, it is very likely that God exists. To back up thisargument, Bayes produced a general mathematical theory which introduced probabilisticinferences (a method for calculating the probability that an event will occur in the future fromthe frequency with which it has occurred in prior trials). Central to this theory was atheorem, now known as Bayes’ Theorem (1764), which states that one’s evidence confirmsthe likelihood of an hypothesis only to the degree that the appearance of this evidencewould be more probable with the assumption of the hypothesis than without it (see below forits formal statement).

Bayesian Statistics Versus Classical Statistics:The central difference between Bayesian and Classical statistics is that in Bayesian

statistics we assume that we know the probability of any event (before any calculations) andthe classical statistician does not. The probabilities that the Bayesian assumes we knoware called prior probabilities.

Bayesian Learning (in a nutshell):As Bayesians, we assume that we have a prior probability distribution for all events.

This gives us a quantitative method to weight the evidence that we come across during

learning. Such methods allow us to construct a more detailed ranking of the alternativehypotheses than if we were only concerned with consistency of the hypotheses with theevidence (though as we will see, consistency based learning is a subclass of Bayesianlearning). As a result, Bayesian methods provides practical learning algorithms (thoughthey require prior probabilities - see below) such as Naive Bayes learning and Bayesianbelief network learning. In addition to this, Bayesian methods are thought to provide auseful conceptual framework with which we can get a standard for evaluating other learningalgorithms.

Prerequisites From Probability TheoryIn order to confidently use Bayesian learning methods, we will need to be familiar with a

few basic terms and formulas from Probability Theory.

Basic Termsa. Random Variable

Since our concern is with machine learning, we can think of randomvariables as being like attributes which can take various values.

e.g.� � � � � � � � �

sunny, rain, cloudy, snow a. Domain

This is the set of possible values that a random variable can take. Itcould be finite or infinite.

b. Probability Distribution

This is a mapping from a domain (see above) to values in 0,1 � . Whenthe domain has a finite of countably infinite number of distinct elements,then the sum of all of the probabilities given by the probabilitydistribution equals 1.

e.g. P � � � � � � � � � �0.7,0.2,0.08,0.02 �

so, P � � � � � � � � �sunny

�0.7, and so on.

c. Event

Each assignment of a domain value to a random variable is called anevent

e.g.� � � � � � � �

rain

Basic Formulasa. Conditional Probability

This formula allows you to calculate the probability of an event A giventhat event B is assumed to have been obtained. This probability isdenoted by P � A|B

.

P � A|B � � P � A � B �P � B �

b. Product Rule

This rule, derived from a, gives the probability of a conjunction of eventsA and B:

P � A � B � � P � A|B � P � B � � P � B|A � P � A �c. Sum Rule

This rule gives the probability of the disjunction of events A and B:

P � A � B � � P � A � � P � B � � P � A � B �d. Theorem of Total Probability

If events A1, . . . ,An are mutually exclusive with �i � 1

n

P � A i � � 1, then

P � B � � �i � 1

n

P � B|A i � P � A i � .

Bayes’ TheoremThe informal statement of Bayes’ Theorem is that one’s evidence confirms the likelihood

of an hypothesis only to the degree that the appearance of this evidence would be moreprobable with the assumption of the hypothesis than without it. Formally, in the specialcase for machine learning we get:

P � h|D � � P � D|h � P � h �P � D �

where:� D is a set training data.� h is a hypothesis.� P � h|D � is the posterior probability, i.e. the conditional probabilityof h after the training data (evidence) is presented.� P � h � is the prior probability of hypothesis h. This non-classical quantity is often

found by looking at data from the past (or in the training data).� P � D � is the prior probability of the training data D. This quantity is often aconstant value, P � D � � P � D|h � P � h � � P � D| � h � P � � h � , which can be computedeasily when we insist that P � h|D � and P � � h|D � sum to 1.� P � D|h � is the probability of D given h, and is called the likelihood. This quantityis often easy to calculate since we sometimes assign it the value 1 when D and hare consistent, and assign it 0 when they are inconsistent.

It should be noted that Bayes’ Theorem is completely general and can be applied to anysituation where one wants to calculate a conditional probability and one has knowledge ofprior probabilities. It’s generality is demonstrated through its derivation, which is verysimple.

To help our intuitive understanding of Bayes’ theorem, consider the example where wesee some clouds in the sky and we are wondering what the chances of rain are. That is we

want to know P rain|clouds ! . By Bayes’ Theorem, we know that this is equal toP " clouds|rain # P " rain #

P " clouds # . Here are some properties of Bayes’ theorem which make this formula more

intuitive:$ The more likely P clouds|rain ! is, the more likely P rain|clouds ! is.$ If P clouds|rain ! % 0, the P rain|clouds ! % 0. If we take all of the probabilities tobe 0 or 1, then we get the propositional calculus.$ Bayes’ theorem is only useable when P clouds ! & 0. However there isresearch about extending Bayes’ theorem to handle cases like P clouds ! % 0 (e.g.belief revision).$ The more likely P rain ! is, the more likely P rain|clouds ! is.$ If P clouds ! % 1, then P rain|clouds ! % P rain ! .$ The more surprising you evidence (the smaller that P clouds ! is), the larger itseffect (the larger P rain|clouds ! is).

Derivation of Bayes’ TheoremThe derivation of this famous theorem is quite trivial. It is short and only uses the

definition of conditional probability and the commutitivity of conjunction.P " D|h # P " h #

P " D # % P " D ' h #P " h # P h ! 1

P " D #% P " D ' h #

P " D #% P " h ' D #

P " D #% P h|D !

Despite this formal simplicity, Bayes’ Theorem is still considered an important result.

SignificanceBayes’ Theorem is important for several reasons:

1. Bayesians regard the theorem as a rule for updating beliefs in response tonew evidence.

2. The posterior probability, P h|D ! , is a quantity that people find hard to assess(they are more used to calculating P D|h ! ). The theorem expresses this quantityin terms that are more accessible.

3. It forms the basis for some practical learning algorithms (see below).

The general Bayesian learning strategy is:

1. Start with your prior probabilities, P H ! .2. Use data D to form P H|D ! .3. Adopt most likely hypothesis given P H|D ! .

Bayes’ Theorem is used to choose the hypothesis that has the highestprobability of being correct, given some set of training data. We call such anhypothesis a maximum a posteriori (MAP) hypothesis, and denote it by hMAP.

hMAP (h ) H

arg max P * h|D +(

h ) H

arg max P , D|h - P , h -P , D -

(h ) H

arg max P * D|h + P * h +

Notice that P * D + is omitted from the denominator is the last line because it isessentially a constant with respect to the class of hypotheses, H. If all of thehypotheses have the same prior probability, . i, j / P * hi + ( P * hj + 0 , then we define themaximum likelihood (ML) hypothesis as:

hML (hi ) H

arg max P * D|hi + .

Example

1 2 3 4 5 6 7 8 7 9 P * cancer + ( 0.08 P * : cancer + ( 0.992

P * ; |cancer + ( 0.98 P * < ,cancer + ( 0.2

P * ; | : cancer + ( 0.03 P * < , : cancer + ( 0.97

Where ; and < represent positive and negativecancer-test results, respectively.

= > 4 ? 8 2 @ 5 9What is the probability that I have cancer given that mycancer-test result it positive?(i.e. what is P * cancer| ; + ?)

A 7 B C > B 7 8 2 @ 5 9By Bayes’ Theorem, P * cancer| ; + ( P , D |cancer - P , cancer -

P , D - .

We know P * ; |cancer + and P * cancer + , but we must calculateP * ; + as follows:

P * ; + ( P * ; E cancer + F P * ; E : cancer +( P * ; |cancer + P * cancer + F P * ; | : cancer + P * : cancer +

So,

P G cancer| H I J P K L |cancer M P K cancer MP K L M

J P K L |cancer M P K cancer MP K L |cancer M P K cancer M N P K L | O cancer M P K O cancer M

J 0.98 P 0.0080.98 P 0.008 N 0.03 P 0.92

J 0.00780.0078 N 0.0298

J 0.21

We see that P G cancer| H I is still less than 1/2, however thisdoes not imply any particular action. For different people,actions will vary (even with the same information) dependingon their subjective utility judgements.

Another thin to notice is that this value, P G cancer| H I J 0.21, isoften confused (even by doctors) with P G H |cancer I J 0.98.

Naive Bayes ClassifierThe naive Bayes classifier is a highly practical Bayesian learning method that can be

used when:

1. the amount of training data is moderate or large (so that the frequency of theevents in the data accurately reflect their probability of occurring outside of thetraining data) and

2. when the attributes values that describe instances are independent given theclassification (see below). That is, given the target value (i.e. classification), v, ofan instance that has attributes a1,a2, . . . ,an, P G ai Q a2 Q . . . Q an I J R

iP G ai|v I .

Here is what the naive Bayes classifier does:S Let x be an instance described by a conjunction of attribute values and let f G x Ibe a target function whose range is some finite set V (representing the classes).S The learner is provided with a set of training examples of the target functionand is then asked to classify (i.e. predict the target value of) a new instancewhich is described by a tuple of attributes, T a1,a2, . . . ,an U .S The learner assigns to the new instance the most probable target value, vMAP,given the attribute values that describe it, where

vMAP Vvj W V

arg max P X v j|a1,a2, . . . ,an Y

Vvj W V

arg maxP Z a1,a2,...,an |vj [ P Z vj [

P Z a1,a2,...,an [\ by Bayes’ theorem

Vvj W V

arg max P X a1,a2, . . . ,an|v j Y P X v j Y\ because P X a1,a2, . . . ,an Y is a constant

given the instance

Vvj W V

arg max P X v j Yi

]P X ai|v j Y

\ by the assumption of conditional

independence of the attributes

given the target value

V vNB

\ denoting the target value outputted

by the naive Bayes classifier.

To calculate this value, the learner first estimates the P X v j Y values from the training databy simply counting the frequency with which each v j occurs in the data. The P X ai|v j Y valuesare then calculated by counting the frequency with which ai occurs in the training examplesthat get the target value v j. Thus, the importance of having large training set is due to thefact that they determine the accuracy of these critical values.

If the number of attribute values is n and the number of distinct target values is k, thenthe learner only needs to calculate n ^ k such P X ai|v j Y values. Computationally, this is verycheap compared with the number of P X a1,a2, . . . ,an|v j Y values that would have to becalculated if we did not have the assumption of conditional independence.

Another interesting thing to note about the naive bayes learning method is that it doesn’tperform an explicit search through the hypothesis space. Instead it merely counts thefrequency of various data combinations in the training set to calculate probabilities.

For examples of the naive Bayes classifier, see section 6.9.1 and section 6.10 of thetext.

Bayesian Belief NetworksIn many cases, the condition of complete conditional independence cannot be met, and

so naive Bayes classifier will not learn successfully. However, as we have seen, to removethis condition completely is computationally very expensive: we would have to find anumber of conditional probabilities equal to the number of instances times the number oftarget values (as opposed to merely n ^ k). Bayesian belief networks (aka Bayes nets,belief nets, probability nets, causal nets) offer us a compromise:

"A Bayesian belief network describes the probability distributiongoverning a set of variables by specifying a set of conditionalindependence assumptions along with a set of conditionalprobabilities." (page 184 of text)

To summarize, Bayes’ nets provide compact representations of joint probability distributionsin systems with a lot of independencies (but some dependencies).

In addition to the information contained in the training data, Bayesian nets allow us toincorporate any prior knowledge we have about the dependencies (and independencies)among the variables. This method of stating conditional independence that apply tosubsets of the variables is less constraining than the global assumption of conditionalindependence made by the naive Bayes classifier.

Conditional IndependenceLet X, Y, and Z be three discrete valued random variables where each can take on

values from the domains V _ X ` , V _ Y ` , and V _ Z ` , respectively. We say that X is conditionallyindependent of Y given Z provided

_ a x i,y j, zk ` P _ X b x i|Y b y j c Z b zk ` b P _ X b x i|Z b zk `where x i d V _ X ` , y j d V _ Y ` , and zk d V _ Z ` . This expression is abbreviated asP _ X|Y c Z ` b P _ X|Z ` . This definition easily extends to sets of variables as well (see text,page 185).

RepresentationBayesian belief networks are graphically represented by a directed acyclic graph and

associated probability matrices which describe the prior and conditional probabilities of thevariables. In the graph (network), each variable is represented by a node. The directedarcs between the nodes indicate that the variables are conditionally independent of itsnon-descendents given its immediate predecessors in the network. X is a descendent of Yif there is a directed path from Y to X. Associated with every node is a probability matrixwhich describes the probability distribution for that variable given the values of itsimmediate predecessors.

Example of a Bayesian Belief Network

from http://www.gpfn.sk.ca/~daryle/papers/bayesian_networks/bayes.html

For the above graph, the probability matrix for the Alarm node given the events ofEarthquake and Burglary might look like this:

Earthquake Burglary P e A|V e E f g V e B f f P e h A|V e E f g V e B f fyes yes 0.90 0.10

yes no 0.20 0.80

no yes 0.90 0.10

no no 0.01 0.99

We can use the information in such tables to calculate the probability for any desiredassignment i y1, . . . ,yn j to the tuple of network variables i Y1, . . . ,Yn j using

P e y1, . . . ,yn f k n

i l 1

mP e y i|Parents e y i f f .

InferenceGiven a specified Bayes network, we may want to infer the value of a specific variable

given the observed values of the other variables. Since we are dealing with probabilities,we will likely not get a specific value. Instead we will calculate a probability distribution forthe variable and then output the most probable value(s).

In the above example, we were wondering where an apple tree is sick given that it islosing its leaves. The result is that chances are slightly in favor of the tree not being sick.This calculation is straightforward when the values of all of the other nodes are known, butwhen only a subset of the variables are known the problem becomes (NP) hard. There’s alot of research being done on methods of probabilistic inference in Bayesian nets as well ason devising effective algorithms for learning Bayesian networks from training data.

CausationAnother reason for the popularity of Bayesian belief networks is that they are thought to

be a convenient way to represent causal knowledge. We already know that the arrows in aBayes net indicate that the variables are conditionally independent of their non-descendentsgiven their immediate predecessors in the network. This statement is known as the Markovcondition. What does this condition imply about causation? Consider the following simplegraph:

This graph might represent the fact that the switch causes the light to be on. However, theissue is not this simple. Since the graph only represents statistical information, we mightinfer that there is some kind of causal relationship, but we don’t know the details of such arelationship. For example, we consistently find that when we turn the switch on, the lightcomes on, however with similar consistency we find that when the light is on, so is theswitch. More specifically, there are two issues concerning causation in Bayesian nets:

1. Are there unobserved common causes? That is, are there othervariables, not represented in the network, that are causally related to both a nodeand that nodes parent. For example, consider the following two graphs:

In the first (two node) graph, the fact that smoking causes cancer isrepresented. In the second graph, a gene which causes both smoking and canceris represented. If such a common cause existed, but was not represented in ourBayesian network, then our inferences from the network would likely beinaccurate. In cases where we are unsure whether there is an unobservedcommon cause, we have two options:

n Fisher suggested that controlled (randomized) experiments would helpto uncover unobserved common causes. However, sometimes suchexperiments are impossible due to ethical constraints or where data isuncontrollable.n Or, we could assume there are no unobserved common causes (aslong as inferences appear accurate).

Question: Are there ways, other than controlled experiments, to determine

if there might be common causes? (see work from CMU and by Pearl)

2. What way do the causal relationships go? That is, on the assumptionthat there are no unobservable common causes, how do we determine what is acause and what is an effect? For example, given our training data, can wedistinguish between the following two graphs:

o p q r s t o p q r s uNotation: Let A v w B denote that A is independent of B.

Yes, we can tell the difference using the independence relation. In Graph Awe find that x sick v w dry y but in Graph B we find that x sick v w dry|loses y . So in GraphA, if we alter P x sick y then dry wouldn’t change, but in Graph B if we changeP x sick y , then dry would change. Essentially, we want to determine whether or notP x sick|dry y z P x sick y holds. To do so we could look at our data and determinewhether the percentage of dry trees that are sick is the same as the percentage ofnon-dry trees that are sick.

A Brief History of Causation

18th Century, Philosophy, Hume: - Causality isn’t real; we can’t see it.

- We can’t distinguish cause from causation

19th Century, Philosophy, Mill: - Mill’s methods for causal inference:

- There are no unobserved causes.

- An effect is caused by a conjunction of variables.

- Methods fail for disjunctions of causes.

- (1970: Winston reinvents Mill’s methods)

20th Century, Philosophy, Positivism: - Causation isn’t real.

- Causation isn’t scientific (not definable in FO Logic).

- Causation is defeasible (situation dependent).

Statistics, Pearson: - No cause from correlation.

Fisher: - Randomized experiments can determine causes.

1980s Philosophy, Lewis-Stalnaker: - Theory of Counterfactuals and Causation

- logic of "If I were to ..."

- We can reason about things yet to happen.

Tetrad Group: - Causal Graphs

Comp. Sci., Pearl: - Bayesian graphs as compact representations

of probability distributions.

- Causality wins 2001 Lakatos Award.

- How to behave given causal information.

cmpt 882 -machine learning bayesian learningoschulte/fagan.pdf · cmpt 882 -machine learning...

Documents