inference algorithms for bayes networks. outline bayes nets are popular representations in ai, and...

Inference Algorithms for Bayes Networks

Outline

Bayes Nets are popular representations in AI, and researchers have developed many inference techniques for them.

We will consider two types of algorithms:1) Exact inference (with 2 subtypes)– Enumeration– Variable elimination– Other techniques not covered: Junction tree, loop-set conditioning, …

2) Approximate inference (sampling) (with 3 sub-types)– Rejection sampling– Likelihood weighting– Gibbs sampling

First: Notation

I’m going to assume all variables are binary.

For random variable A, I will write the event that A is true as +a, and –a for A is false.

Similarly for the other variables.

C

A B

D E

Technique 1: EnumerationThis is the “brute-force” approach to BN inference.

C

A B

D E

Suppose I want to know P(+a | +b, +e).Algorithm:1) If query is conditional (yes in this case),

rewrite with def. of cond. prob.

2) Use marginalization to rewrite marginal probabilities in terms of the joint probability. e.g.,

3) Use the Bayes Net equation to determine the joint probability.

𝑃 (+𝑎|+𝑏 ,+𝑒)=𝑃 (+𝑎 ,+𝑏 ,+𝑒)𝑃 (+𝑏 ,+𝑒)

𝑃 (+𝑎 ,+𝑏 ,+𝑒)=∑𝑐∑𝑑

𝑃 (+𝑎 ,+𝑏 ,𝑐 ,𝑑 ,+𝑒)

Speeding up Enumeration

∑𝑐∑𝑑

𝑃 (+𝑎 )𝑃 (+𝑏) 𝑃 (𝑐|+𝑎 ,+𝑏)𝑃 (𝑑|𝑐 ) 𝑃 (+𝑒∨𝑐)

¿𝑃 (+𝑎 ) 𝑃 (+𝑏 )∑𝑐∑𝑑

𝑃 (𝑐|+𝑎 ,+𝑏)𝑃 (𝑑|𝑐 ) 𝑃 (+𝑒∨𝑐)

¿𝑃 (+𝑎 ) 𝑃 (+𝑏 )∑𝑐

𝑃 (𝑐|+𝑎 ,+𝑏 )𝑃 (+𝑒∨𝑐)∑𝑑

𝑃 (𝑑|𝑐 )

Pulling out terms:

Each term in the sum is faster.But: the total number of terms (things to add up) remains the same.In the worst case, this is still exponential in the number of nodes.

Maximize Independence

If you can, it helps to create the BN so that it has as few edges as possible.

Alarm

Burglary Earthquake

John calls Mary calls

Let’s re-create the network on the left, but start with the “John Calls” node and gradually add more nodes and edges.

Let’s see how many edges/dependencies we end up with.



Alarm

Burglary Earthquake


John calls Mary calls?



Alarm

Burglary Earthquake



Alarm

? ?



Alarm

Burglary Earthquake



Alarm

Burglary

??

?



Alarm

Burglary Earthquake



Alarm

Burglary Earthquake

?

??

?



Alarm

Burglary Earthquake



Alarm

Burglary Earthquake

Causal Direction

Moral: Bayes Nets tend to be the most compact, and most efficient, when edges go from causes to effects.

Alarm

Burglary Earthquake



Alarm

Burglary Earthquake

Causal direction Non-causal direction

Technique 2: Variable Elimination

C

A B

D E

Suppose I want to know P(+a | +b, +e).Algorithm:1) If query is conditional (yes in this case),

rewrite with def. of cond. prob.

2) For each marginal distribution, apply variable elimination to find that probability. e.g., for

a. Join C & D (multiplication)b. Eliminate D (marginalization)c. Join C & +e (multiplication)d. Eliminate C (marginalization)e. Join +a & +e (multiplication)f. Join +b & (+a, +e) (multiplication)g. Done.

𝑃 (+𝑎|+𝑏 ,+𝑒)=𝑃 (+𝑎 ,+𝑏 ,+𝑒)𝑃 (+𝑏 ,+𝑒)

𝑃 (+𝑎 ,+𝑏 ,+𝑒)

Eliminating D

Bayes Net now provides: P(D, C | +a, +b) Eliminating D will compute P(C | +a, +b)

C

A B

E

For each c, compute: P(c | +a, +b) = d P(d, c | +a, +b)

C, D

A B

E

Eliminating C

Bayes Net now provides: P(+e, C | +a, +b) Eliminating C will compute P(+e | +a, +b)

E

A B

Compute: P(+e | +a, +b) = c P(+e, c | +a, +b)

C, E

A B

Joining +a, +b, and +e

Bayes Net now provides: P(+e | +a, +b) P(+a), P(+b) Joining +a, +b, and +e will compute P(+e, +a, +b)

A, B, E

Compute: P(+e, +a , +b) = P(+e | +a, +b) * P(a) * P(b)

E

A B

Notes on Time ComplexityFor graphs that are trees with N nodes, variable elimination can perform inference in time O(N).

For general graphs, variable elimination can perform inference in time that O(2w), where w is the “tree-width” of the graph.(However, this depends on the order in which variables are eliminated, and it is hard to figure out the best order.)

Intuitively, tree-width is a measure of how close a graph is to an actual tree.

In the worst case, this can mean a time complexity that is exponential in the size of the graph.

Exact inference in BNs is known to be NP-hard.

Approximate Inference via Sampling

Penny Nickel Count Probability

Heads Heads 0 ?

Heads Tails 0 ?

Tails Heads 0 ?

Tails Tails 0 ?



Heads Heads

Heads Tails 1 1

Tails Heads

Tails Tails


Heads Heads

Heads Tails 1 1

Tails Heads

Tails Tails



Heads Heads 1 .5

Heads Tails 1 .5

Tails Heads 0

Tails Tails 0


Heads Heads 1 .5

Heads Tails 1 .5

Tails Heads 0

Tails Tails 0



Heads Heads 2 .67

Heads Tails 1 .33

Tails Heads 0

Tails Tails 0



Heads Heads 53 .2465

Heads Tails 56 .2605

Tails Heads 52 .2419

Tails Tails 54 .2512

As the number of samples increases, our estimates should approach the true joint distribution.

Conveniently, we get to decide how long we want to spend to figure out the probabilities.

Generating Samples from a BN

A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6

Sample generation algorithm:

For each variable X that has not been assigned, but whose parents have all been assigned:

1. r a random number in the range [0, 1]2. If r < P(+x | parents(X)), then assign X +x3. Else, X -x

For this example:

At first, A is the only variable whose parents have been assigned (since it has no parents).

r 0.30.3 < P(+a), so we assign A +a

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3


A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6




For this example: Current Sample: +a

Next, both B and C have all their parents assigned. Let’s choose B.

r .9.9 >= P(+b | +a), so we set B -b

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3


A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6




For this example: Current Sample: +a, -b

Quiz: what variable would be assigned next?

If r .4, what would this variable be assigned?

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3


A

B C

D

A P(A)

+a .6

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6




For this example: Current Sample: +a, -b, -c

Now D has all its parents assigned.

If r .2, what would D be assigned?


A

B C

D

A P(A)

+a .6

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6




For this example: Current Sample: +a, -b, -c, +d

That completes this sample.

We can now increase the count of (+a, -b, -c, +d) by 1,and move on to the next sample.

Quiz: Approximating Queries

Suppose I generate a bunch of samples for a BN with variables A, B, C, and get these counts.

What are these probabilities?P(+a, -b, -c)?

P(+a, -c)?

P(-a | -b, -c)?

P(-b | +a)?

A B C Count

+a +b +c 20

+a +b -c 30

+a -b +c 50

+a -b -c 30

-a +b +c 30

-a +b -c 20

-a -b +c 80

-a -b -c 40

Total 300

Technique 3: Rejection Sampling

Rejection sampling is the fancy name given to the procedure you just used to compute, eg., P(-a | -b, -c).

To compute this, you ignore (or “reject”) samples where B = +b or C = +c, since they don’t match the evidence in the query.

Consistency

Rejection sampling is a consistent approximate inference technique.

Consistency means that as the number of samples increases, the estimated value of the probability for a query approaches its true value.

In the limit of infinite samples, consistent sampling techniques give the correct probabilities.

Room for Improvement

Efficiency of Rejection Sampling: If you’re interested in a query like P(+a | +b, +c), you’ll reject 5 out of 6 samples, since only 1 out of 6 samples have the right evidence (+b and +c).

So most samples are useless for your query.

Technique 4: Likelihood Weighting

A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6

Sample generation algorithm:Initialize: sample {}, P(sample) 1For each variable X that has not been assigned, but whose parents have all been assigned:

1. If X is an evidence node:a. assign X the value from the queryb. P(sample) P(sample) * P(X|parents(X))

2. Otherwise, assign X as normal, P(sample) unchangedFor this example: Sample: {} P(sample): 1

At first, A is the only variable whose parents have been assigned (since it has no parents).

r 0.30.3 < P(+a), so we assign A +a

Query of interest: P(+c | +b, +d)

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

Likelihood Weighting

A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6



2. Otherwise, assign X as normal, P(sample) unchangedFor this example: Sample: {+a} P(sample): 1

B and C have their parents assigned. Let’s do B next.

B is an evidence node, so we choose B +b (from the query)Also, P(+b|+a) = .7, so we update P(sample) 0.7


B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3


A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6



2. Otherwise, assign X as normal, P(sample) unchangedFor this example: Sample: {+a, +b} P(sample): 0.7

C has its parents assigned. It is NOT an evidence node.

r .8.8 >= P(+c | +a), so C -cP(sample) is NOT UPDATED.


B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3


A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6



2. Otherwise, assign X as normal, P(sample) unchangedFor this example: Sample: {+a, +b, -c} P(sample): 0.7

D has its parents assigned.

How do the sample and P(sample) change?


B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3


A

B C

D

A P(A)

+a .6

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6



2. Otherwise, assign X as normal, P(sample) unchangedFor this example: Sample: {+a, +b, -c, +d} P(sample): 0.42


Likelihood Weighting vs. Rejection Sampling

A B C Count

+a +b +c 20

+a +b -c 30

+a -b +c 50

+a -b -c 30

-a +b +c 30

-a +b -c 20

-a -b +c 80

-a -b -c 40

Total 300

Rejection Sampling

A B C Probabilistic Count

-a +b +c 23.58

-a +b -c 68.3

-a -b +c 90.6

-a -b -c 40.6

Total 223.08

Likelihood Weightingfor query P(+c | -a)

Both are consistent.

Requires fewer samples to get good estimates.But solves just one query at a time.

Needs LOTS of samples.Can answer any query.

Further room for improvement

A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6

Example query of interest: P(+d | +b, +c)

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

If we generate samples using likelihood weighting, the choice of sample for D takes into account the evidence.

However, the choice of sample for A does NOT take into account the evidence.

So we may generate lots of samples that are very unlikely, and don’t contribute much to our overall counts.

Quiz: what is P(+a | +b, +c)? And P(-a | +b, +c)?

Technique 5: Gibbs Sampling

Named after physicist Josiah Gibbs (you may have heard of Gibbs Free Energy).

This is a special case of a more general algorithm called Metropolis-Hastings, which is itself a special case of Markov-Chain Monte Carlo (MCMC) estimation.

Gibbs Sampling

A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6

Sample generation algorithm:Initialize: sample {Arandom, +b, -c, D random}Repeat:

1. pick a non-evidence variable X2. Get a random number r in the range [0, 1]3. If r < P(X | all other variables), set X +x4. Otherwise, set X -x5. Add 1 to the count for this new sample

For this example: Sample: {-a, +b, -c, +d}

A and D are non-evidence. Randomly choose D to re-set.

r 0.7P(+d | -a, +b, -c) = P(+d | +b, -c) = .6r >= .6, so D = -d

Query of interest: P(-d | +b, -c)

B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

Gibbs Sampling

A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6



For this example: Sample: {-a, +b, -c, -d}

A and D are non-evidence. Randomly choose D to re-set.

r 0.9P(+d | -a, +b, -c) = P(+d | +b, -c) = .6r >= .6, so D = -d (no change)


B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

Gibbs Sampling

A

B C

D

A P(A)

+a .6

A C P(C|A)

+a +c .4

-a +c .9

A B P(B|A)

+a +b .7

-a +b .6



For this example: Sample: {-a, +b, -c, -d}

A and D are non-evidence. Randomly choose A to re-set.

r 0.3P(+a | +b, -c, -d) = P(+a | +b, -c) = ?What is A after this step?


B C D P(D|B,C)

+b +c +d .5

+b -c +d .6

-b +c +d .2

-b -c +d .3

Details of Gibbs Sampling

1. To compute P(X | all other variables), it is enough to consider only the Markov Blanket of X: – X’s parents, X’s children, and the parents of X’s children. – Everything else will be conditionally independent of X, given its Markov

Blanket.

2. Unlike Rejection Sampling and Likelihood Weighting, samples in Gibbs Sampling are NOT independent.

3. Nevertheless, Gibbs Sampling is consistent.

4. It is very common to discard the first N (often N ~= 1000) samples from a Gibbs sampler. The first N samples are called the “burn-in” period.

inference algorithms for bayes networks. outline bayes nets are popular representations in ai, and...

Documents