directed graphical modelswcohen/10-601/networks-1.pdf · 2016. 3. 21. · machine learning 10-601 ....
TRANSCRIPT
Directed Graphical Models
William W. Cohen Machine Learning 10-601
MOTIVATION FOR GRAPHICAL MODELS
Recap: A paradox of induction
• A black crow seems to support the hypothesis “all crows are black”.
• A pink highlighter supports the hypothesis “all non-black things are non-crows”
• Thus, a pink highlighter supports the hypothesis “all crows are black”.
)(CROW)(BLACK
lyequivalentor
)(BLACK)(CROW
xxx
xxx
¬⇒¬∀
⇒∀
whut?
whut?
B = black C = crow
crows non-crows
black
not black
collect statistics for P(B=b|C=c)
Logical reasoning versus common-sense reasoning
BLACK(jim) FLY(jim) BIRD(jim) EATS(jim,carrion) …
Another difficult problem: common-sense reasoning • Tweety is a bird. • Most birds can fly.
• Opus is a penguin. • Penguins are birds. • Penguins cannot fly.
• We’d like to be able to conclude: – Opus cannot fly, and Tweety can
B(tweety)∀* x :B(x)⇒ F(x)
Pg(opus)∀x,Pg(x)⇒ B(x)∀x :Pg(x)⇒¬F(x)
default reasoning F(tweety) ✔ F(opus)∧¬F(opus)✖
Logically…
Another difficult problem: common-sense reasoning • Tweety is a bird. • Most birds can fly.
• Opus is a penguin. • Penguins are birds. • Penguins cannot fly.
• We’d like to be able to conclude: – Opus cannot fly, and Tweety can
B(tweety)∀* x :B(x)∧¬Pg(x)⇒ F(x)
Pg(opus)∀x,Pg(x)⇒ B(x)∀x :Pg(x)⇒¬F(x)
default reasoning F(tweety) ? ¬F(opus)✔
NO: F(tweety) only provable if he’s provably NOT a penguin
Another difficult problem: common-sense reasoning • Tweety is a bird. • Most birds can fly.
• Opus is a penguin. • Penguins are birds. • Penguins cannot fly.
• We’d like to be able to conclude: – Opus cannot fly, and Tweety can
B(tweety)∀* x :B(x)∧¬Pg(x)∧¬Dodo(x)∧¬Dead(x)∧¬...⇒ F(x)
default reasoning F(tweety) ? F(opus) ✔
NO: F(tweety) only provable if he’s provably not a penguin and not dead and …
Recap: The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all
combinations of values of your variables (if there are M Boolean variables then the table will have 2M
rows). 2. For each combination of values, say
how probable it is.
Example: Boolean variables A, B, C
A B C Prob 0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
Another difficult problem: common-sense reasoning
B(tweety)∀* x :B(x)∧¬Pg(x)⇒ F(x)
Pg(opus)∀x,Pg(x)⇒ B(x)∀x :Pg(x)⇒¬F(x)
¬F(opus)F(tweety)
Another difficult problem: common-sense reasoning
Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)
Pr(F | B,Pg)
Pr(B | Pg)Pr(Pg)
• Tweety is a bird. • Most non-penguin birds can
7ly.
• Opus is a penguin. • Penguins are birds. • Penguins cannot 7ly.
Pg B Pr(F=0|Pg,B) Pr(F=1|Pg,B)
0 0 0.90 0.10
0 1 0.01 0.99
1 0 0.5 0.5
1 1 0.9999 0.0001
Pg Pr(B=0|Pg) Pr(B=1|Pg)
0 0.90 0.10
1 0 1
∀* x :B(x)∧¬Pg(x)⇒ F(x)
∀x,Pg(x)⇒ B(x)
Pg Pr(Pg=0) Pr(Pg=1)
0 0.90 0.10
Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)The joint for an experiment: I pick an object, say in Frick Park, and measure: can it fly, is it a bird, is it a penguin.
• Tweety is a bird. • Most birds can 7ly.
• Opus is a penguin. • Penguins are birds. • Penguins cannot 7ly.
Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)
Can Opus fly?
Pr(F =1| Pg =1)
= Pr(F =1| B = b,Pg =1)Pr(B = b | Pg =1)b=0,1∑
= Pr(F =1| B =1,Pg =1)Pr(B =1| Pg =1)+Pr(F | B = 0,Pg =1)Pr(B = 0 | Pg =1)
= 0
= 1
Pg B Pr(F=0|Pg,B) Pr(F=1|Pg,B)
0 0 0.90 0.10
0 1 0.01 0.99
1 0 0.5 0.5
1 1 0.9999 0.0001
∀* x :B(x)∧¬Pg(x)⇒ F(x)
Unlikely: Pr(F=1|B=1,Pg=1)
• Tweety is a bird. • Most birds can 7ly.
• Opus is a penguin. • Penguins are birds. • Penguins cannot 7ly.
Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)
Can Tweety fly? Pr(F =1| B =1)
= Pr(F =1| B =1,Pg = pg)Pr(B =1| Pg = pg)pg=0,1∑ Pr(Pg = pg)
= Pr(F =1| B =1,Pg =1)Pr(B =1| Pg =1)Pr(Pg =1) + Pr(F =1| B =1,Pg = 0)Pr(B =1| Pg = 0)Pr(Pg = 0)
Tweety is a 7lying penguin
Tweety is a 7lying non-
penguin bird
If flying penguins are rare, it depends: • do non-penguins birds fly? Pr(F|B=1,Pg=0) • are all or most birds non-penguins? Pr(B=1|Pg=0) • are non-penguins common? Pr(Pg=0)
Pr(F =1| B =1)
= Pr(F =1| B =1,Pg = pg)Pr(B =1| Pg = pg)pg=0,1∑ Pr(Pg = pg)
= Pr(F =1| B =1,Pg =1)Pr(B =1| Pg =1)Pr(Pg =1) + Pr(F =1| B =1,Pg = 0)Pr(B =1| Pg = 0)Pr(Pg = 0)
• Quiz…. • https://piazza.com/class/ij382zqa2572hc?
cid=421 • https://piazza.com/class/ij382zqa2572hc?
cid=420
Another difficult problem: common-sense reasoning
Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)
Pr(F | B,Pg)
Pr(B | Pg)Pr(Pg)
Have we solved the common-sense reasoning problem?
No: how do we (1) chose the conditional probabilities you need to model a task and (2) use them algorithmically to answer questions ?
No: How do we invent numbers for all the rows of the CPTs?
Another difficult problem: common-sense reasoning
Pr(F,B,Pg) = Pr(F | B,Pg)Pr(B | Pg)Pr(Pg)
Pr(F | B,Pg)
Pr(B | Pg)Pr(Pg)
Have we solved the common-sense reasoning problem?
Yes: We use directed graphical models. • Semantics: how
to specify them • Inference: how to
use them
Yes: We use directed graphical models. • Learning: how to
find parameters
Probabilities and probabilistic inference • Why is logic attractive?
- There are well-understood algorithms for reasoning with a logical theory.
• E.g: we can use a computer to determine if B(x)èF(x).
• What about probabilities?
- We can do some math manually and answer many questions.
• Not really satisfying
- We can answer questions algorithmically with the joint
• Eg: we can compute Pr(F=1|B=1) • But: this is not tractable for large models.
- How can we answer questions algorithmically and efficiently? • Answer: Graphical models
Probabilities and probabilistic inference • Directed graphical models
- Today: examples and semantics - Wednesday: inference algorithms - Next week:
• learning in graphical models • using graphical models to specify learning algorithms:
Naïve Bayes, LDA, HMMs, …
GRAPHICAL MODELS: SEMANTICS AND DEFINITIONS
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (A) pick a d20 uniformly at random then (B) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?
A B P(A=fair)=0.75 P(A=loaded)=0.25
P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 A P(A)
Fair 0.75
Loaded 0.25
A B P(B|A)
Fair Critical 0.1
Fair Noncritical 0.9
Loaded Critical 0.5
Loaded Noncritical 0.5
Example: practical problem 1 made easy
P(A,B)=P(B|A)P(A)
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?
A B P(A=fair)=0.75 P(A=loaded)=0.25
P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 A P(A)
Fair 0.75
Loaded 0.25
A B P(B|A)
Fair Critical 0.1
Fair Noncritical 0.9
Loaded Critical 0.5
Loaded Noncritical 0.5
Example: practical problem 1 made easy
What is Pr(A=1|B=1)?
P(A,B)=P(B|A)P(A)
We have the information we need to answer other questions as well
Example of inference
In general: any chain-rule decomposition gives a DGM G:
• G has one node per random variable
• If P(X|Y1,…,Yk) is a factor in the decomposition, then
• G has edges fromY1 àX,…,YkàX
• X is annotated with a conditional probability table (CPT) encoding P(X=x|Y1=y1,…,Yk=yk) for each tuple (x,y1,…,yk)
A B
P(A=fair)=0.75 P(A=loaded)=0.25 P(B=critical | A=fair=0.1
P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5
A P(A)
Fair 0.75
Loaded 0.25 A B P(B|A)
Fair Critical 0.1
Fair Noncritical 0.9
Loaded Critical 0.5
Loaded Noncritical 0.5
Example: practical problem 1 made easy
P(A,B)=P(B|A)P(A)
Applied chain rule
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die.
B A
B P(B)
critical …
non-critical …
B A P(A|B)
Critical Fair …
Noncritical Fair …
Critical Loaded …
Noncritical Loaded …
There’s more than one network for any distribution
P(A,B)=P(A|B)P(B)
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?
B A
There’s more than one network for any distribution
A B
• The moral: we have two things here
• a “generative story”, “causal model”, …
• a joint probability distribution e.g. P(A,B)
• a decomposition: P(A,B)=P(B|A)P(A)
• another decomposition: P(A,B)=P(A|B)P(B) -- totally valid!
• it’s usually cleaner to pick one that fits a “generative story”
P(A,B)=P(A|B)P(B)
P(A,B)=P(B|A)P(A)
There’s more than one network for any distribution
• There are lots of decompositions of an model with N variables
• They are all correct
• Some are better than others….
P(A,B,C,D) = P(A | B,C,D)P(B,C,D)= P(A | B,C,D)P(B |C,D)P(C,D)= P(A | B,C,D)P(B |C,D)P(C |D)P(D)
P(A,B,C,D) = P(D | A,B,C)P(A,B,C)= P(D | A,B,C)P(C | A,B)P(A,B)= P(D | A,B,C)P(C | A,B)P(B | A)P(A)
There’s more than one network for any distribution
Suppose there are some conditional independencies
• P(A|B,C,D)=P(A|B)
• P(B|C,D)=P(B|C)
Then the first decomposition can be simplified and compressed, the second can’t
P(A,B,C,D) = P(A | B,C,D)P(B,C,D)= P(A | B,C,D)P(B |C,D)P(C,D)= P(A | B,C,D)P(B |C,D)P(C |D)P(D)= P(A | B)P(B |C)P(C |D)P(D)
P(A,B,C,D) = P(D | A,B,C)P(A,B,C)= P(D | A,B,C)P(C | A,B)P(A,B)= P(D | A,B,C)P(C | A,B)P(B | A)P(A)
The (highly practical) Monty Hall problem
• You’re in a game show. Behind one door is a prize. Behind the others, goats.
• You pick one of three doors, say #1
• The host, Monty Hall, opens one door, revealing…a goat!
3
You now can either
• stick with your guess
• always change doors
• flip a coin and pick a new door randomly according to the coin
Example: practical problem 2
Slide 31
The (highly practical) Monty Hall problem
• You’re in a game show. Behind one door is a prize. Behind the others, goats.
• You pick one of three doors, say #1 • The host, Monty Hall, opens one
door, revealing…a goat! • You now can either stick with your
guess or change doors
A B
First guess The money
C The revealed goat D
Stick, or swap?
E Second guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
D P(D)
Stick 0.5
Swap 0.5
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
( ) ( )( ) ( )
⎪⎭
⎪⎬
⎫
⎪⎩
⎪⎨
⎧
∉∧=
∉∧≠
====
otherwise0},{ if5.0},{ if0.1
),|( bacbabacba
bBaAcCP
W
Slide 32
The (highly practical) Monty Hall problem
A B
First guess The money
C The goat D
Stick or swap?
E Second guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
( ) ( )( ) ( )
⎪⎭
⎪⎬
⎫
⎪⎩
⎪⎨
⎧
∉∧=
∉∧≠
====
otherwise0},{ if5.0},{ if0.1
),|( bacbabacba
bBaAcCP
A C D P(E|A,C,D)
… … … …
P(E = e | A,C,D)
=
1.0 if e = a( )∧ d = stick( )1.0 if e∉ {a,c}( )∧(d = swap)
0 otherwise
#
$%%
&%%
'
(%%
)%%
If you stick: you win if your first guess was right. If you swap: you win if your first guess was wrong.
Slide 33
The (highly practical) Monty Hall problem
A B
First guess The money
C The goat D
Stick or swap?
E Second guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
A C D P(E|A,C,D)
… … … …
…again by the chain rule:
P(A,B,C,D,E) =
P(E|A,C,D) *
P(D) *
P(C | A,B ) *
P(B ) *
P(A)
We could construct the joint and compute P(E=B|D=swap)
Slide 34
The (highly practical) Monty Hall problem
A B
First guess The money
C The goat D
Stick or swap?
E Second guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
A C D P(E|A,C,D)
… … … …
…again by the chain rule:
P(A,B,C,D,E) =
P(E | A,B,C,D) *
P(D | A,B,C) *
P(C | A,B ) *
P(B | A) *
P(A)
We could construct the joint and compute P(E=B|D=swap)
Slide 35
The (highly practical) Monty Hall problem
A B
First guess The money
C The goat D
Stick or swap?
E Second guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
A C D P(E|A,C,D)
… … … …
The joint table has…?
3*3*3*2*3 = 162 rows
The conditional probability tables (CPTs)
shown have … ?
3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows
Big questions:
• why are the CPTs smaller?
• how much smaller are the CPTs than the joint?
• can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs?
Slide 36
The (highly practical) Monty Hall problem
A B
First guess The money
C The goat D
Stick or swap?
E Second guess
A P(A)
1 0.33
2 0.33
3 0.33
B P(B)
1 0.33
2 0.33
3 0.33
A B C P(C|A,B)
1 1 2 0.5
1 1 3 0.5
1 2 3 1.0
1 3 2 1.0
… … … …
A C D P(E|A,C,D)
… … … …
Why is the CPTs representation smaller? Follow the money! (B)
( ) ( )( )
⎪⎭
⎪⎬
⎫
⎪⎩
⎪⎨
⎧
=∧∉
=∧=
=
=
otherwise0)(},{ if0.1
if0.1
),,|(
swapdcaestickdae
DCAeEP
),,,|( ),,|(
,,,,
dDbCbBaAeEPdDcCaAeEP
edcba
======
====
∀
E is conditionally independent of B given A,D,C
DCABE ,,|⊥
>< BDCAEI },,,{,
Slide 37
Conditional Independence formalized
Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}:
P(R=x ⏐ M=y ^ L=z) = P(R=x ⏐ M=y) More generally:
Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,
P(S1’s assignments ⏐ S2’s assignments & S3’s assignments)= P(S1’s assignments ⏐ S3’s assignments)
Slide 38
The (highly practical) Monty Hall problem
A B
First guess The money
C The goat D
Stick or swap?
E Second guess
What are the conditional indepencies? • I<A, {B}, C> ? • I<A, {C}, B> ? • I<E, {A,C}, B> ? • I<D, {E}, B> ? • …
Slide 39
Recap: Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:
- V is a set of vertices. - E is a set of directed edges joining vertices. No loops of
any length are allowed.
Each vertex in V contains the following information: - The name of a random variable - A probability distribution table indicating how the
probability of this variable’s values depends on all possible combinations of parent values.
Building a Bayes Net • Choose a set of relevant variables. • Choose an ordering for them • Assume they’re called X1 .. Xm (where X1 is first in ordering, etc) • For i = 1 to m:
- Add the Xi node to the network - Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we
have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi )
- Define the probability table of • P(Xi =k ⏐ Assignments of Parents(Xi ) ).
The general case
P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =
P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) = P(Xn=xn ⏐ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) = P(Xn=xn ⏐ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ⏐…. X2=x2 ^ X1=x1) * P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) = : : =
( ) ( ) ( )( )( )
( ) ( )( )∏
∏
=
=−−
=
=
=∧==
n
iiii
n
iiiii
XxXP
xXxXxXP
1
11111
Parents of sAssignment
…
So any entry in joint pdf table can be computed. And so any conditional probability can be computed.
Question: given a network can I find a chain-rule decomposition of the joint?
GRAPHICAL MODELS: DETERMINING CONDITIONAL
INDEPENDENCIES
What Independencies does a Bayes Net Model?
• In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non-
descendants in the graph given the value of all its parents.
• This follows from
• But what else does it imply?
P(X1…Xn ) = P(Xi | parents(Xi ))i=1
n
∏
= P(Xi | X1…Xi−1)i=1
n
∏
What Independencies does a Bayes Net Model?
• Example:
Z
Y
X
Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).
Quick proof that independence is symmetric
• Assume: P(X|Y, Z) = P(X|Y) • Then:
),()()|,(),|(
YXPZPZYXPYXZP =
)()|()(),|()|(
YPYXPZPZYXPZYP
=
(Bayes’s Rule) (Chain Rule) (By Assumption) (Bayes’s Rule)
)()|()()|()|(
YPYXPZPYXPZYP
=
)|()(
)()|( YZPYP
ZPZYP==
What Independencies does a Bayes Net Model?
• Let I<X,Y,Z> represent X and Z being conditionally independent given Y.
• I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.
Y
X Z
What Independencies does a Bayes Net Model?
• I<X,{U},Z>? No. • I<X,{U,V},Z>? Yes. • Maybe I<X, S, Z> iff S acts a cutset
between X and Z in an undirected version of the graph…?
Z
V U
X
V
Things get a little more confusing • X has no parents, so we know all its parents’
values trivially • Z is not a descendant of X
• So, I<X,{},Z>, even though there’s a undirected path from X to Z through an unknown variable Y.
• What if we do know the value of Y, though? Or one of its descendants?
Z X
Y Y
The “Burglar Alarm” example • Your house has a twitchy burglar
alarm that is also sometimes triggered by earthquakes.
• Earth arguably doesn’t care whether your house is currently being burgled
• While you are on vacation, one of your neighbors calls and tells you your hom’s burglar alarm is ringing. Uh oh!
Burglar Earthquake
Alarm
Phone Call
Things get a lot more confusing • But now suppose you learn
that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.
• Earthquake “explains away” the hypothetical burglar.
• But then it must not be the case that
I<Burglar,{Phone Call}, Earthquake>, even though
I<Burglar,{}, Earthquake>!
Burglar Earthquake
Alarm
Phone Call
d-separation to the rescue • Fortunately, there is a relatively simple algorithm for
determining whether two variables in a Bayesian network are conditionally independent: d-separation.
• Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ...
ie. X and Z are dependent iff there exists an unblocked path
A path is “blocked” when...
• There exists a variable Y on the path such that - it is in the evidence set E
- the arcs putting Y in the path are “tail-to-tail”
• Or, there exists a variable Y on the path such that - it is in the evidence set E
- the arcs putting Y in the path are “tail-to-head”
• Or, ...
Y
Y
unknown “common causes” of X and Z impose dependency
unknown “causal chains” connecting X an Z impose dependency
A path is “blocked” when… (the funky case)
• … Or, there exists a variable V on the path such that - it is NOT in the evidence set E
- neither are any of its descendants - the arcs putting Y on the path are “head-to-head”
Y Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z
d-separation to the rescue, cont’d • Theorem [Verma & Pearl, 1998]:
- If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I<X, E, Z>.
• d-separation can be computed in linear time using a depth-first-search-like algorithm.
• Be careful: d-separation finds what must be conditionally independent - “Might”: Variables may actually be independent when they’re not d-
separated, depending on the actual probabilities involved
d-separation example
A B
C D
E F
G
I
H
J
• I<C, {}, D>? • I<C, {A}, D>? • I<C, {A, B}, D>? • I<C, {A, B, J}, D>? • I<C, {A, B, E, J}, D>?