directed graphical modelswcohen/10-601/networks-1.pdf · 2016. 3. 21. · machine learning 10-601 ....

Directed Graphical Models

William W. Cohen Machine Learning 10-601

MOTIVATION FOR GRAPHICAL MODELS

Recap: A paradox of induction

•  A black crow seems to support the hypothesis “all crows are black”.

•  A pink highlighter supports the hypothesis “all non-black things are non-crows”

•  Thus, a pink highlighter supports the hypothesis “all crows are black”.

)(CROW)(BLACK

lyequivalentor

)(BLACK)(CROW

xxx

xxx

¬⇒¬∀

⇒∀

whut?

whut?

B = black C = crow

crows non-crows

black

not black

collect statistics for P(B=b|C=c)

Logical reasoning versus common-sense reasoning

BLACK(jim) FLY(jim) BIRD(jim) EATS(jim,carrion) …

Another difficult problem: common-sense reasoning •  Tweety is a bird. •  Most birds can fly.

•  Opus is a penguin. •  Penguins are birds. •  Penguins cannot fly.

•  We’d like to be able to conclude: – Opus cannot fly, and Tweety can

B(tweety)∀* x :B(x)⇒ F(x)

Pg(opus)∀x,Pg(x)⇒ B(x)∀x :Pg(x)⇒¬F(x)

default reasoning F(tweety) ✔ F(opus)∧¬F(opus)✖

Logically…




B(tweety)∀* x :B(x)∧¬Pg(x)⇒ F(x)


default reasoning F(tweety) ? ¬F(opus)✔

NO: F(tweety) only provable if he’s provably NOT a penguin




B(tweety)∀* x :B(x)∧¬Pg(x)∧¬Dodo(x)∧¬Dead(x)∧¬...⇒ F(x)

default reasoning F(tweety) ? F(opus) ✔

NO: F(tweety) only provable if he’s provably not a penguin and not dead and …

Recap: The Joint Distribution

Recipe for making a joint distribution of M variables:

1.  Make a truth table listing all

combinations of values of your variables (if there are M Boolean variables then the table will have 2M

rows). 2.  For each combination of values, say

how probable it is.

Example: Boolean variables A, B, C

A B C Prob 0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

Another difficult problem: common-sense reasoning

B(tweety)∀* x :B(x)∧¬Pg(x)⇒ F(x)


¬F(opus)F(tweety)


Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)

Pr(F | B,Pg)

Pr(B | Pg)Pr(Pg)

•  Tweety is a bird. • Most non-penguin birds can

7ly.

•  Opus is a penguin. •  Penguins are birds. •  Penguins cannot 7ly.

Pg B Pr(F=0|Pg,B) Pr(F=1|Pg,B)

0 0 0.90 0.10

0 1 0.01 0.99

1 0 0.5 0.5

1 1 0.9999 0.0001

Pg Pr(B=0|Pg) Pr(B=1|Pg)

0 0.90 0.10

1 0 1

∀* x :B(x)∧¬Pg(x)⇒ F(x)

∀x,Pg(x)⇒ B(x)

Pg Pr(Pg=0) Pr(Pg=1)

0 0.90 0.10

Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)The joint for an experiment: I pick an object, say in Frick Park, and measure: can it fly, is it a bird, is it a penguin.

• Tweety is a bird. • Most birds can 7ly.

• Opus is a penguin. • Penguins are birds. • Penguins cannot 7ly.


Can Tweety fly? Pr(F =1| B =1)

= Pr(F =1| B =1,Pg = pg)Pr(B =1| Pg = pg)pg=0,1∑ Pr(Pg = pg)

= Pr(F =1| B =1,Pg =1)Pr(B =1| Pg =1)Pr(Pg =1) + Pr(F =1| B =1,Pg = 0)Pr(B =1| Pg = 0)Pr(Pg = 0)

Tweety is a 7lying penguin

Tweety is a 7lying non-

penguin bird

If flying penguins are rare, it depends: •  do non-penguins birds fly? Pr(F|B=1,Pg=0) •  are all or most birds non-penguins? Pr(B=1|Pg=0) •  are non-penguins common? Pr(Pg=0)



Pr(F | B,Pg)

Pr(B | Pg)Pr(Pg)

Have we solved the common-sense reasoning problem?

No: how do we (1) chose the conditional probabilities you need to model a task and (2) use them algorithmically to answer questions ?

No: How do we invent numbers for all the rows of the CPTs?



Pr(F | B,Pg)

Pr(B | Pg)Pr(Pg)

Have we solved the common-sense reasoning problem?

Yes: We use directed graphical models. •  Semantics: how

to specify them •  Inference: how to

use them

Yes: We use directed graphical models. •  Learning: how to

find parameters

Probabilities and probabilistic inference •  Why is logic attractive?

-  There are well-understood algorithms for reasoning with a logical theory.

•  E.g: we can use a computer to determine if B(x)èF(x).

•  What about probabilities?

-  We can do some math manually and answer many questions.

•  Not really satisfying

-  We can answer questions algorithmically with the joint

•  Eg: we can compute Pr(F=1|B=1) •  But: this is not tractable for large models.

-  How can we answer questions algorithmically and efficiently? •  Answer: Graphical models

Probabilities and probabilistic inference • Directed graphical models

- Today: examples and semantics - Wednesday: inference algorithms - Next week:

•  learning in graphical models •  using graphical models to specify learning algorithms:

Naïve Bayes, LDA, HMMs, …

GRAPHICAL MODELS: SEMANTICS AND DEFINITIONS

•  I have 3 standard d20 dice, 1 loaded die.

•  Experiment: (A) pick a d20 uniformly at random then (B) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?

A B P(A=fair)=0.75 P(A=loaded)=0.25

P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 A P(A)

Fair 0.75

Loaded 0.25

A B P(B|A)

Fair Critical 0.1

Fair Noncritical 0.9

Loaded Critical 0.5

Loaded Noncritical 0.5

Example: practical problem 1 made easy

P(A,B)=P(B|A)P(A)


•  Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?

A B P(A=fair)=0.75 P(A=loaded)=0.25

P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 A P(A)

Fair 0.75

Loaded 0.25

A B P(B|A)

Fair Critical 0.1


Loaded Critical 0.5



What is Pr(A=1|B=1)?

P(A,B)=P(B|A)P(A)

We have the information we need to answer other questions as well

Example of inference

In general: any chain-rule decomposition gives a DGM G:

•  G has one node per random variable

•  If P(X|Y1,…,Yk) is a factor in the decomposition, then

•  G has edges fromY1 àX,…,YkàX

•  X is annotated with a conditional probability table (CPT) encoding P(X=x|Y1=y1,…,Yk=yk) for each tuple (x,y1,…,yk)

A B

P(A=fair)=0.75 P(A=loaded)=0.25 P(B=critical | A=fair=0.1

P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5

A P(A)

Fair 0.75

Loaded 0.25 A B P(B|A)

Fair Critical 0.1


Loaded Critical 0.5



P(A,B)=P(B|A)P(A)

Applied chain rule


•  Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die.

B A

B P(B)

critical …

non-critical …

B A P(A|B)

Critical Fair …

Noncritical Fair …

Critical Loaded …

Noncritical Loaded …

There’s more than one network for any distribution

P(A,B)=P(A|B)P(B)


•  Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?

B A


A B

•  The moral: we have two things here

•  a “generative story”, “causal model”, …

•  a joint probability distribution e.g. P(A,B)

•  a decomposition: P(A,B)=P(B|A)P(A)

•  another decomposition: P(A,B)=P(A|B)P(B) -- totally valid!

• it’s usually cleaner to pick one that fits a “generative story”

P(A,B)=P(A|B)P(B)

P(A,B)=P(B|A)P(A)

The (highly practical) Monty Hall problem

• You’re in a game show. Behind one door is a prize. Behind the others, goats.

• You pick one of three doors, say #1

• The host, Monty Hall, opens one door, revealing…a goat!

3

You now can either

•  stick with your guess

•  always change doors

•  flip a coin and pick a new door randomly according to the coin

Example: practical problem 2


•  You’re in a game show. Behind one door is a prize. Behind the others, goats.

•  You pick one of three doors, say #1 •  The host, Monty Hall, opens one

door, revealing…a goat! •  You now can either stick with your

guess or change doors

A B

First guess The money

C The revealed goat D

Stick, or swap?

E Second guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

D P(D)

Stick 0.5

Swap 0.5

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

( ) ( )( ) ( )

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

∉∧=

∉∧≠

====

otherwise0},{ if5.0},{ if0.1

),|( bacbabacba

bBaAcCP

W


A B


C The goat D

Stick or swap?

E Second guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

( ) ( )( ) ( )

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

∉∧=

∉∧≠

====

otherwise0},{ if5.0},{ if0.1

),|( bacbabacba

bBaAcCP

A C D P(E|A,C,D)

… … … …

P(E = e | A,C,D)

=

1.0 if e = a( )∧ d = stick( )1.0 if e∉ {a,c}( )∧(d = swap)

0 otherwise

#

$%%

&%%

'

(%%

)%%

If you stick: you win if your first guess was right. If you swap: you win if your first guess was wrong.


A B


C The goat D

Stick or swap?

E Second guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

A C D P(E|A,C,D)

… … … …

…again by the chain rule:

P(A,B,C,D,E) =

P(E|A,C,D) *

P(D) *

P(C | A,B ) *

P(B ) *

P(A)

We could construct the joint and compute P(E=B|D=swap)


A B


C The goat D

Stick or swap?

E Second guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

A C D P(E|A,C,D)

… … … …

…again by the chain rule:

P(A,B,C,D,E) =

P(E | A,B,C,D) *

P(D | A,B,C) *

P(C | A,B ) *

P(B | A) *

P(A)

We could construct the joint and compute P(E=B|D=swap)


A B


C The goat D

Stick or swap?

E Second guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

A C D P(E|A,C,D)

… … … …

The joint table has…?

3*3*3*2*3 = 162 rows

The conditional probability tables (CPTs)

shown have … ?

3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows

Big questions:

•  why are the CPTs smaller?

• how much smaller are the CPTs than the joint?

•  can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs?


A B


C The goat D

Stick or swap?

E Second guess

A P(A)

1 0.33

2 0.33

3 0.33

B P(B)

1 0.33

2 0.33

3 0.33

A B C P(C|A,B)

1 1 2 0.5

1 1 3 0.5

1 2 3 1.0

1 3 2 1.0

… … … …

A C D P(E|A,C,D)

… … … …

Why is the CPTs representation smaller? Follow the money! (B)

( ) ( )( )

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

=∧∉

=∧=

=

=

otherwise0)(},{ if0.1

if0.1

),,|(

swapdcaestickdae

DCAeEP

),,,|( ),,|(

,,,,

dDbCbBaAeEPdDcCaAeEP

edcba

======

====

∀

E is conditionally independent of B given A,D,C

DCABE ,,|⊥

>< BDCAEI },,,{,

Conditional Independence formalized

Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}:

P(R=x ⏐ M=y ^ L=z) = P(R=x ⏐ M=y) More generally:

Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments ⏐ S2’s assignments & S3’s assignments)= P(S1’s assignments ⏐ S3’s assignments)


A B


C The goat D

Stick or swap?

E Second guess

What are the conditional indepencies? •  I<A, {B}, C> ? •  I<A, {C}, B> ? •  I<E, {A,C}, B> ? •  I<D, {E}, B> ? •  …

Recap: Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

- V is a set of vertices. - E is a set of directed edges joining vertices. No loops of

any length are allowed.

Each vertex in V contains the following information: - The name of a random variable - A probability distribution table indicating how the

probability of this variable’s values depends on all possible combinations of parent values.

Building a Bayes Net •  Choose a set of relevant variables. •  Choose an ordering for them •  Assume they’re called X1 .. Xm (where X1 is first in ordering, etc) •  For i = 1 to m:

-  Add the Xi node to the network -  Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we

have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi )

-  Define the probability table of •  P(Xi =k ⏐ Assignments of Parents(Xi ) ).

The general case

P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =

P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) = P(Xn=xn ⏐ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) = P(Xn=xn ⏐ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ⏐…. X2=x2 ^ X1=x1) * P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) = : : =

( ) ( ) ( )( )( )

( ) ( )( )∏

∏

=

=−−

=

=

=∧==

n

iiii

n

iiiii

XxXP

xXxXxXP

1

11111

Parents of sAssignment

…

So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Question: given a network can I find a chain-rule decomposition of the joint?

GRAPHICAL MODELS: DETERMINING CONDITIONAL

INDEPENDENCIES

What Independencies does a Bayes Net Model?

•  In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non-

descendants in the graph given the value of all its parents.

• This follows from

• But what else does it imply?

P(X1…Xn ) = P(Xi | parents(Xi ))i=1

n

∏

= P(Xi | X1…Xi−1)i=1

n

∏


• Example:

Z

Y

X

Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).

Quick proof that independence is symmetric

• Assume: P(X|Y, Z) = P(X|Y) • Then:

),()()|,(),|(

YXPZPZYXPYXZP =

)()|()(),|()|(

YPYXPZPZYXPZYP

=

(Bayes’s Rule) (Chain Rule) (By Assumption) (Bayes’s Rule)

)()|()()|()|(

YPYXPZPYXPZYP

=

)|()(

)()|( YZPYP

ZPZYP==


•  Let I<X,Y,Z> represent X and Z being conditionally independent given Y.

•  I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.

Y

X Z


•  I<X,{U},Z>? No. •  I<X,{U,V},Z>? Yes. • Maybe I<X, S, Z> iff S acts a cutset

between X and Z in an undirected version of the graph…?

Z

V U

X

V

Things get a little more confusing •  X has no parents, so we know all its parents’

values trivially •  Z is not a descendant of X

•  So, I<X,{},Z>, even though there’s a undirected path from X to Z through an unknown variable Y.

•  What if we do know the value of Y, though? Or one of its descendants?

Z X

Y Y

The “Burglar Alarm” example •  Your house has a twitchy burglar

alarm that is also sometimes triggered by earthquakes.

•  Earth arguably doesn’t care whether your house is currently being burgled

•  While you are on vacation, one of your neighbors calls and tells you your hom’s burglar alarm is ringing. Uh oh!

Burglar Earthquake

Alarm

Phone Call

Things get a lot more confusing •  But now suppose you learn

that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.

•  Earthquake “explains away” the hypothetical burglar.

•  But then it must not be the case that

I<Burglar,{Phone Call}, Earthquake>, even though

I<Burglar,{}, Earthquake>!

Burglar Earthquake

Alarm

Phone Call

d-separation to the rescue • Fortunately, there is a relatively simple algorithm for

determining whether two variables in a Bayesian network are conditionally independent: d-separation.

• Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ...

ie. X and Z are dependent iff there exists an unblocked path

A path is “blocked” when...

• There exists a variable Y on the path such that -  it is in the evidence set E

-  the arcs putting Y in the path are “tail-to-tail”

• Or, there exists a variable Y on the path such that -  it is in the evidence set E

-  the arcs putting Y in the path are “tail-to-head”

• Or, ...

Y

Y

unknown “common causes” of X and Z impose dependency

unknown “causal chains” connecting X an Z impose dependency

A path is “blocked” when… (the funky case)

• … Or, there exists a variable V on the path such that -  it is NOT in the evidence set E

-  neither are any of its descendants -  the arcs putting Y on the path are “head-to-head”

Y Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z

d-separation to the rescue, cont’d •  Theorem [Verma & Pearl, 1998]:

-  If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I<X, E, Z>.

•  d-separation can be computed in linear time using a depth-first-search-like algorithm.

•  Be careful: d-separation finds what must be conditionally independent -  “Might”: Variables may actually be independent when they’re not d-

separated, depending on the actual probabilities involved

d-separation example

A B

C D

E F

G

I

H

J

• I<C, {}, D>? • I<C, {A}, D>? • I<C, {A, B}, D>? • I<C, {A, B, J}, D>? • I<C, {A, B, E, J}, D>?

directed graphical modelswcohen/10-601/networks-1.pdf · 2016. 3. 21. · machine learning 10-601 ....

Documents