combining the two examples...combining the two examples • i am at work, my neighbor john calls to...

1

Bayes NetsRepresenting and Reasoning

about Uncertainty(Continued)

Combining the Two Examples• I am at work, my neighbor John calls to say that

my alarm went off, my neighbor Mary doesn’t call. Sometimes the alarm is set off by a minor

earthquake. Is there a burglar?

Burglary Earthquake

JohnCalls MaryCalls

Alarm

2

Earthquake Example• I am at work, my neighbor John calls to say that

my alarm went off, neighbor Mary doesn’t call. Sometimes the alarm is set off by a minor


Burglary Earthquake

JohnCalls MaryCalls

Alarm

1: Define the variables that completely describe the problem.




Burglary Earthquake

JohnCalls MaryCalls

Alarm

3




Burglary Earthquake

JohnCalls MaryCalls

Alarm

2: Define the links between variables.• The resulting directed graph must be acyclic• If node X has parents Y1,..,Yn, any variable that is not a descendent of X is conditionally independent of X given (Y1,..,Yn)




Burglary Earthquake

JohnCalls MaryCalls

AlarmP(B=true) = 0.001

P(E=true) = 0.002

4




Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T

P(A = True|B=b,E=e)B E

0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A




Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

3: Add a probability table for each node. The table for node X contains P(X|Parent Values) for

each possible combination of parent values

5

Computing a Joint Entry• Any entry in the joint probability table can be

computed: Probability that both John and Mary calls, the alarm goes off, but there is no earthquake

or burglar.

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

Computing a Joint EntryP(J ^ M ^ A ^ ¬B ^ ¬ E) = P(J | M ^ A ^ ¬ B ^ ¬ E) P(M ^ A ^ ¬ B ^ ¬ E)

= P(J | A) P(M ^ A ^ ¬ B ^ ¬ E)

= P(J | A) P(M | A ^ ¬ B ^ ¬ E)

= P(J | A) P(M | A) P(A| ¬ B ^ ¬ E) P(¬ B ^ ¬ E)

= P(J | A) P(M | A) P(A| ¬ B ^ ¬ E) P(¬ B) P(¬ E)

= 0.90 x 0.70 x 0.001 x 0.999 x 0.998 = 0.0006

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

6

Computing a Joint EntryP(J ^ M ^ A ^ ~B ^ ~E) = P(J | M ^ A ^ ~B ^ ~E) P(M ^ A ^ ~B ^ ~E)

= P(J | A) P(M ^ A ^ ~B ^ ~E)

= P(J | A) P(M | A ^ ~B ^ ~E)

= P(J | A) P(M | A) P(A| ~B ^ ~E) P(~B ^ ~E)

= P(J | A) P(M | A) P(A| ~B ^ ~E) P(~B) P(~E)

= 0.90 x 0.70 x 0.001 x 0.999 x 0.998 = 0.0006

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

We would need 25 entries to store the entire joint

distribution table.

But we need to store only 10 values by representing the dependencies between variables

Inference• Any inference operation of the form P(values of

some variables | values of the other variables) can be computed: Probability that both John and Mary

call given that there was a burglar.

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

7

Inference

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001 P(E=True) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

∑

∑==

BY

BMJX

YP

XP

BP

BMJPBMJP

contain that entries All

^^contain that entries All

)(

)(

)(

),,()|,(

Inference

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001 P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

∑

∑==

BY

BMJX

YP

XP

BP

BMJPBMJP

contain that entries All

^^contain that entries All

)(

)(

)(

),,()|,(

We know how to compute these sums

because we know how to compute the

joint � as written, we still need to

compute most of the entire joint table

8

Bayes Net: Formal Definition

• Bayes Net = directed acyclic graph represented by:

– Set of vertices V

– Set of directed edges E joining vertices. No cycles are

allowed.

• With each vertex is associated:

– The name of a random variable

– A probability distribution table indicating how the

probability of the variable’s values depends on all the

possible combinations of values of its parents

Bayes Net: Formal Definition• Bayes Net = directed acyclic graph represented by:

– Set of vertices V

– Set of directed edges E joining vertices. No cycles are

allowed.

• With each vertex is associated:

– The name of a random variable

– A probability distribution table indicating how the

probability of the variable’s values depends on all the

possible combinations of values of its parents

Bayes Nets are also called Belief NetworksThe tables associated with the vertices are called

Conditional Probability Tables (CPT)

All the definitions can be extended to using continuous

random variables instead of discrete variables

9

Bayes Net Construction

• Choose a set of variables and an ordering {X1,..,Xm}

• For each variable Xi for i = 1 to m:

1. Add the variable Xi to the network

2. Set Parents(Xi) to be the minimal subset of {X1,..,Xi-1} such that Xi is conditionally independent of all the other members of {X1,..,Xi-1} given Parents(Xi)

3. Define the probability table describing

P(Xi | Parents(Xi))

Bayes Net Construction

• Choose a set of variables and an ordering {X1,..,Xm}

• For each variable Xi for i = 1 to m:

1. Add the variable Xi to the network

2. Set Parents(Xi) to be the minimal subset of {X1,..,Xi-1} such that Xi is conditionally independent of all the other members of {X1,..,Xi-1} given Parents(Xi)

3. Define the probability table describing

P(Xi | Parents(Xi))

If Xi has k parents, we need to store 2k

entries to represent the CPT � Storage is

exponential in the number of parents, not in

the total number of variables m. In many

problems k << m.

The structure of the

network depends on the

initial ordering of the

variables

10

Example: Symptoms & Diagnosis

• The diagnosis problem: What is the most

likely disease given observed symptoms

• Variables V = {Flu, Measles, Fever, Spots}

Flu Measles

Fever Spots

Try creating the network by using a different ordering of the variables…..

Another Examples• The lawn may be wet because the sprinkler

was on or because it was raining (or both).

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)CP(C) = 0.5

11

Computing the Joint: The General Case

• Any entry in the joint distribution table can be computed

• Consequently, any conditional probability can be computed

∏

∏

=

=

−−

−−

−−−−

−−

−−

−−

=

=====

====

×====

×====

====

×====

====

m

i

iii

m

i

iiii

mm

mmmm

mmmm

mm

mmmm

mm

XxXP

xXxXxXxXP

xXxXxXP

xXxXxXxXP

xXxXxXxXP

xXxXxXP

xXxXxXxXP

xXxXxXP

1

1

112211

222211

22221111

112211

112211

112211

2211

))(Parents tosassignment|(

),,|(

),.(

),,|(

),,|(

),,(

),,|(

),,(

K

M

K

K

K

K

K

K

∏

∏

=

=

−−

−−

−−−−

−−

−−

−−

=

=====

====

×====

×====

====

×====

====

m

i

iii

m

i

iiii

mm

mmmm

mmmm

mm

mmmm

mm

XxXP

xXxXxXxXP

xXxXxXP

xXxXxXxXP

xXxXxXxXP

xXxXxXP

xXxXxXxXP

xXxXxXP

1

1

112211

222211

22221111

112211

112211

112211

2211

))(Parents tosassignment|(

),,|(

),.(

),,|(

),,|(

),,(

),,|(

),,(

K

M

K

K

K

K

K

K

Computing the Joint: The General Case

• Any entry in the joint distribution table can be computed

• Consequently, any conditional probability can be computed

We can do this because, by

construction, Xi is independent of all

the other variables given Parents(Xi)

12

Inference: The General Case

• Inference = Computing a conditional probability:

P(Value for some variable(s) | Values for other variables)




)|( 21 EEP

“Query” variables

Example: Disease

“Evidence” variables

Example: Symptoms

Also called “belief updating”

13




∑

∑==

2

21

contain that entriesjoint All

^contain that entriesjoint All

2

2121

)(

)(

)(

),()|(

EY

EEX

YP

XP

EP

EEPEEP

We can compute any conditional probability so we can perform solve any inference problem in principle

So Far…

• Methodology for building Bayes nets.

• Requires exponential storage in the maximum

number of parents of any node, not in the total number of nodes.

• We can compute the value of any assignment to the variables (entry in the joint distribution) in time linear in the number of variables.

• We can compute the answer to any question (any conditional probability)

14




We can compute any conditional probability so we can perform solve any inference problem in principle

∑

∑==

2

21

contain that entriesjoint All

^contain that entriesjoint All

2

2121

)(

)(

)(

),()|(

EY

EEX

YP

XP

EP

EEPEEP

Problem: if E2 involves k binary variables and we have a total of

m variables, what is the

complexity of this computation?

Inference: The Bad News

• Computing the conditional probabilities by

enumerating all relevant entries in the joint

is expensive:

Exponential in the number of variables!

• Even worse:

Solving for general queries in Bayes nets is

NP-hard!

15

Possible Solutions

• Approximate methods

– Approximate the joint distributions by drawing samples

• Exact methods

– Factorization and variable elimination

– Exploit special network structure (e.g., trees)

– Transform the network structure

Approximate Method: Sampling• Sampling = Very powerful technique in many

probabilistic problems

• General idea:

– It is often difficult to compute and represent exactly the

probability distribution of a set of variables

– But, it is often easy to generate examples from the distribution

……………….…….

0.001F F…T

0.29F T…T

0.94T F…T

0.95T T…T

P(X1=x1,X2=x2 ,…,Xm = xm)x1 x2...xm

The

nu

mbe

r of

row

s t

oo

la

rge

fo

r th

e t

ab

le t

o b

e

co

mp

ute

d e

xp

licitly

16

Approximate Method: Sampling• Sampling = Very powerful technique in many

probabilistic problems (stochastic simulation)

• General idea:

– It is often difficult to compute and represent exactly the

probability distribution of a set of variables

– But, it is often easy to generate examples from the distribution

x1 x2.......................xm

F F F T F T…F T

…….

F T T F F F…T F

T F T T T F…T T

T T F T F F…T FP

For a large number of samples,

P(X1=x1,X2=x2,…,Xm = xm)

is approximately equal to:

# of samples with

X1=x1 and X2=x2 …and Xm = xm

Total # of samples

Sampling Example• Generate a set of variable assignments with the same

distribution as the joint distribution represented by the network

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)CP(C) = 0.5

17

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.50F

0.10T

P(S = True)C

P(C) = 0.5

Sampling1. Randomly choose C.

C = True with probability 0.5 � C = True

2. Randomly choose S. S = True with probability 0.10 �

3. Randomly choose R. R = True with probability 0.80 �

4. Randomly choose W. W = True with probability 0.90 �

T

WRSC

0.20F

0.80T

P(R = True)C

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.50F

0.10T

P(S = True)C

P(C) = 0.5



2. Randomly choose S. S = True with probability 0.10 � S = False

3. Randomly choose R. R = True with probability 0.80 � R = True

4. Randomly choose W. W = True with probability 0.90 � W = True

FT

WRSC

0.20F

0.80T

P(R = True)C

18

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)C

P(C) = 0.5






TFT

WRSC

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)C

P(C) = 0.5






TTFT

WRSC

19

Sampling for Inference: Example• Suppose that we want to compute P(W = True | C = True)

(In words: How likely is it that the grass will be wet given that the sky is cloudy)

• Compute lots of samples of (C,S,R,W)– Nc = Number of samples for which C = True

– Ns = Number of samples for which W = True and C = True

– N = Total number of samples

• Nc/N approximates P(C = True)

• Ns/N approximates P(W = True and C = True)

• Therefore:Ns/Nc approximates:

P(W = True and C = True)/ P(C = True) =

P(W = True | C = True)

Sampling for Inference: General Case• Suppose that we want to compute P(E1| E2) (In words: How

likely is it that the variable assignments in E1 are satisfied given the assignments in E2)

• Compute lots of samples– Nc = Number of samples for which the assignments in E2 are satisfied

– Ns = Number of samples for which the assignments in E1 are satisfied

– N = Total number of samples

• Nc/N approximates P(E2)

• Ns/N approximates P(E1 and E2)

• Therefore:Ns/Nc approximates:

P(E1 and E2)/ P(E2) = P(E1|E2)

20

Problem with Sampling• Probability is so low for some assignments of variables

that that will likely never be seen in the samples (unless a very large number of samples is drawn).

• Example: P(JohnCalls = True | Earthquake = True)

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

Problem with Sampling• Probability is so low for some assignments of variables

that that they will likely never be seen in the samples (unless a very large number of samples is drawn).

• Example: P(JohnCalls = True | Earthquake = True)

Burglary Earthquake

JohnCalls MaryCalls

Alarm

P(B=True) = 0.001

P(E=true) = 0.002

0.001F F

0.29F T

0.94T F

0.95T T


0.05F

0.90T

P(J = True|A=a)A

0.01F

0.70T

P(M = True|A=a)A

P(E=True) is so small that it is unlikely that

we will draw many samples with E=True �Ns/Nc is a poor approximation of

P(JohnCalls = True | Earthquake = True)

21

Solution: Likelihood Weighting

• Suppose that E2 contains a variable assignment of the form Xi = v

• Current approach:– Generate samples until enough of them contain Xi = v

– Such samples are generated with probability p = P(Xi = v | Parents(Xi))

• Likelihood Weighting:– Generate only samples with Xi = v

– Weight each sample by ω = p

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)C

P(C) = 0.5

Likelihood Weighting

Example: Suppose that

we want to compute

an inference with

E2 = (Sprinkler = True ,

Wet Grass = True)

22

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.50F

0.10T

P(S = True)C

P(C) = 0.5

Likelihood Weighting1. Randomly choose C.





0.20F

0.80T

P(R = True)C

ω = 1.0

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.50F

0.10T

P(S = True)C

P(C) = 0.5






0.20F

0.80T

P(R = True)C

ω = 1.0

C is not one of the evidence variables, so

we take a random

sample as before

23

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.50F

0.10T

P(S = True)C

P(C) = 0.5



2. Set S = True



0.20F

0.80T

P(R = True)C

ω = 1.0 x 0.10

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.50F

0.10T

P(S = True)C

P(C) = 0.5



2. Set S = True



0.20F

0.80T

P(R = True)C

ω = 1.0 x 0.10

S is one of the evidence

variables, so we fix its

value without sampling

At the same time, we update

the current weight of the

sample by P(S = True | C)

24

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)C

P(C) = 0.5



2. Set S = True



ω = 1.0 x 0.10

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)C

P(C) = 0.5



2. Set S = True


4. Set W = True

ω = 1.0 x 0.10 x 0.99

25

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)C

P(C) = 0.5



2. Set S = True


4. Set W = True

ω = 1.0 x 0.10 x 0.99

W is one of the evidence

variables, so we fix its value without sampling

At the same time, we update

the current weight of the

sample by P(W = True | S,R)

Cloudy

Rain

Wet Grass

Sprinkler

0.01F F

0.90F T

0.90T F

0.99T T

P(W = True)S R

0.20F

0.80T

P(R = True)C

0.50F

0.10T

P(S = True)C

P(C) = 0.5



2. Set S = True


4. Set W = True

TTTT

WRSC

w = 0.099

New sample

Weight associated with the

sample. We increment the

“number” of samples by ω:

Nc Nc + ω

26

Likelihood Weighting

• Nc = 0; Ns = 0;

1. Generate a random assignment of the variables,

fixing the variables assigned in E2

2. Assign the sample a weight ω = probability that this sample would have been generated if we did not fix

the value of the variables in E2

3. Nc Nc + ω

4. If the sample matches E1 Ns Ns + ω

5. Repeat until we have “enough” samples

• Ns/Nc is an estimate of P(E1|E2)

Possible Solutions

• Approximate methods

– Approximate the joint distributions by drawing samples

• Exact methods

– Factorization and variable elimination

– Exploit special network structure (e.g., trees)

– Transform the network structure

Before that, let’s first look at

other independence information

that can be extracted

27

More Powerful Statements about Conditional Independence

Windshield wipers

working Headlights working

Battery charged

Assume that these nodes are part

of a much larger network and may be very far away from each other

(slightly contrived example)

More Powerful Statements about Conditional Independence

Windshield wipers

working Headlights working

Battery charged

For this type of engine, once we know B, the

knowledge of W is irrelevant to H.

H and W are conditionally independent given B:

P(H|W,B) = P(H|B)

How can we find out this independence relation?

28

More General• Given 2 sets of nodes S1 and S2

• Given a set of nodes E to which we have assigned values

(the evidence set)

• Are S1 and S2 conditionally independent given E?

• Why is it important and useful?

P (assignments to S1 | E and assignments to S2) = P (assignments to S1 | E)

S1 S2E

More General• How can we find if S1 and S2 are conditionally independent

given E?

• Why is it important and useful?

We can simplify any computation that contains something like P(S1 | E , S2) by P(S1 | E)

Intuitively E stands in between or “blocks” S1

from S2

P (assignments to S1 | E and assignments to S2) = P (assignments to S1 | E)

S1 S2E

29

Blockage: Formal Definition• A path from a node X to a node Y is

blocked by a set E if either:

X Y…….. ……..

X Y…….. ……..

X Y…….. ……..

(1)

(2)

(3)

Blockage: Formal Definition• A path from a node X to a node Y is

blocked by a set E if either:

X Y…….. ……..

X Y…….. ……..

X Y…….. ……..

(1)

(2)

(3)

The path passes

through one

node from E

The path passes

through one node

from E with

diverging arrows

The path has

converging arrows on

a node such that it is

not in E and neither

are its descendents

30

D-Separation Theorem• If every (undirected) path from a node in a

set S1 to one in a set S2 is blocked by E,

then E d-separates S1 and S2.

…..…..

…..…..

…..…..

S1 S2E

D-Separation Theorem• If E d-separates S1 and S2, then S1 and S2

are conditionally independent given E.• P(S2 | S1 , E) = P(S2 | E)

• P(S1 | S2 , E) = P(S1 | E)

…..…..

…..…..

…..…..

S1 S2E

31

Aches and Fever?

Flu and Malaria?

Subway and ExoticTrip?

So Far…• Methodology for building Bayes nets.

• Requires exponential storage in the maximum number of parents of any node.

• We can compute the value of any assignment to the variables (entry in the joint distribution) in time linear in the number of variables.

• We can compute the answer to any question (any conditional probability)

• But inference is NP-hard in general

• Sampling (stochastic sampling) for approximate inference

• D-separation criterion to be used for extracting additional independence relations (and simplifying inference)

combining the two examples...combining the two examples • i am at work, my neighbor john calls to...

Documents