graphical) models) bayesian)networks) seman&cs)&)...

Daphne Koller

Seman&cs)&)Factoriza&on)

Probabilis&c)Graphical)Models) Bayesian)Networks)

Representa&on)

Daphne Koller

•  Grade •  Course Difficulty •  Student Intelligence •  Student SAT •  Reference Letter

P(G,D,I,S,L)

Daphne Koller

Intelligence Difficulty

Grade

Letter

SAT

Daphne Koller


Grade

Letter

SAT

0.3 0.08 0.25

0.4 g2(B)

0.02 0.9 i1,d0 0.7 0.05 i0,d1

0.5

0.3 g1(A) g3(C)

0.2 i1,d1

0.3 i0,d0

l1 l0

0.99 0.4 0.1 0.9 g1

0.01 g3 0.6 g2

0.2 0.95

s0 s1

0.8 i1 0.05 i0

0.4 0.6 d1 d0

0.3 0.7 i1 i0

Daphne Koller


Grade

Letter

SAT

0.3 0.08 0.25

0.4 g2

0.02 0.9 i1,d0 0.7 0.05 i0,d1

0.5

0.3 g1 g3

0.2 i1,d1

0.3 i0,d0

l1 l0

0.99 0.4 0.1 0.9 g1

0.01 g3 0.6 g2

0.2 0.95

s0 s1

0.8 i1 0.05 i0

0.4 0.6 d1 d0

0.3 0.7 i1 i0

P(d0, i1, g3, s1, l1) =

Daphne Koller

Bayesian Network •  A Bayesian network is: – A directed acyclic graph (DAG) G whose

nodes represent the random variables X1,…,Xn – For each node Xi a CPD P(Xi | ParG(Xi))

•  The BN represents a joint distribution via the chain rule for Bayesian networks

P(X1,…,Xn) = Πi P(Xi | ParG(Xi))

Daphne Koller

BN Is a Legal Distribution: P ≥ 0

Daphne Koller

P Factorizes over G •  Let G be a graph over X1,…,Xn. •  P factorizes over G if

P(X1,…,Xn) = Πi P(Xi | ParG(Xi))

Daphne Koller

Genetic Inheritance

Homer

Bart

Marge

Lisa Maggie

Clancy Jackie

Selma

Genotype

Phenotype

AA, AB, AO, BO, BB, OO

A, B, AB, O

Daphne Koller

BNs for Genetic Inheritance

GHomer

GBart

GMarge

GLisa GMaggie

GClancy GJackie

GSelma

BClancy BJackie

BSelma BHomer BMarge

BBart BLisa BMaggie

Daphne'Koller'

Reasoning'Pa1erns'

Probabilis3c'Graphical'Models' Bayesian'Networks'

Representa3on'

Daphne'Koller'

The Student Network


Grade

Letter

SAT

0.3 0.08 0.25

0.4 g2

0.02 0.9 i1,d0 0.7 0.05 i0,d1

0.5

0.3 g1 g3

0.2 i1,d1

0.3 i0,d0

l1 l0

0.99 0.4 0.1 0.9 g1

0.01 g3 0.6 g2

0.2 0.95

s0 s1

0.8 i1 0.05 i0

0.4 0.6 d1 d0

0.3 0.7 i1 i0

Daphne'Koller'


Grade

Letter

SAT

Causal Reasoning

P(l1) ≈ 0.5

P(l1 | i0 ) ≈

P(l1 | i0 , d0) ≈

Difficulty Intelligence

0.39

0.51

Daphne'Koller'


Grade

Letter

SAT

Evidential Reasoning P(d1) = 0.4 P(i1) = 0.3 P(d1 | g3) ≈ P(i1 | g3) ≈

Student gets a C

0.3 0.08 0.25

0.4 g2

0.02 0.9 i1,d0 0.7 0.05 i0,d1

0.5

0.3 g1 g3

0.2 i1,d1

0.3 i0,d0

0.63 0.08

Daphne'Koller'


Grade

Letter

SAT

Intercausal Reasoning

Class is hard!

Student gets a C

P(i1 | g3, d1) ≈ 0.11

P(d1) = 0.4 P(i1) = 0.3 P(d1 | g3) ≈ 0.63 P(i1 | g3) ≈ 0.08

Daphne'Koller'

Intercausal Reasoning Explained

X2 X1

Y OR

X1 X2 Y' Prob(

0' 0' 0 0.25

0' 1' 1 0.25

1' 0' 1 0.25

1' 1 1 0.25

Daphne'Koller'

Difficulty Intelligence Difficulty

Grade

Letter

SAT

Intercausal Reasoning II

Class is hard!

Student gets a B

P(i1) = 0.3

P(i1 | g2, d1) ≈ 0.34 P(i1 | g2) ≈ 0.175

g2'

Daphne'Koller'

Student Aces the SAT •  What happens to the posterior

probability that the class is hard? Intelligence

Grade

Letter

SAT

Student gets a C

Difficulty

Student aces the SAT

Daphne'Koller'

Intelligence

Grade

Letter

SAT

Student Aces the SAT

P(i1 | g3, s1) ≈ 0.58 P(d1 | g3, s1) ≈ 0.76 Difficulty

Student aces the SAT

P(d1) = 0.4 P(i1) = 0.3 P(d1 | g3) ≈ 0.63 P(i1 | g3) ≈ 0.08

Student gets a C

Daphne'Koller'

Flow'of''Probabilis3c'Influence'


Representa3on'

Daphne'Koller'

When can X influence Y? •  X � Y •  X � Y •  X � W � Y •  X � W � Y •  X � W � Y •  X � W � Y


Grade

Letter

SAT

Daphne'Koller'

Active Trails •  A trail X1 ─ … ─ Xn is active if: it has no v-structures Xi-1 � Xi � Xi+1

Daphne'Koller'

When can X influence Y Given evidence about Z

•  X � Y •  X � Y

•  X � W � Y •  X � W � Y •  X � W � Y •  X � W � Y


Grade

Letter

SAT

W ∉ Z W ∈ Z

Daphne'Koller'

When can X influence Y given evidence about Z

•  S ― I ― G ― D allows influence to flow when: Intelligence Difficulty

Grade

Letter

SAT

Daphne'Koller'

Active Trails •  A trail X1 ─ … ─ Xn is active given Z if:

– for any v-structure Xi-1 � Xi � Xi+1 we have that Xi or one of its descendants ∈Z

– no other Xi is in Z

Daphne Koller

Preliminaries*

Probabilis-c*Graphical*Models* Independencies*

Representa-on*

Daphne Koller

Independence •  For events α, β, P α ⊥ β if: – P(α, β) = P(α) P(β) – P(α|β) = P(α) – P(β|α) = P(β)

•  For random variables X,Y, P X ⊥ Y if: – P(X, Y) = P(X) P(Y) – P(X|Y) = P(X) – P(Y|X) = P(Y)

Daphne Koller

Independence

I D Prob i0 d0 0.42 i0 d1 0.18 i1 d0 0.28 i1 d1 0.12

I Prob i0 0.6 i1 0.4

D Prob d0 0.7 d1 0.3

P(I,D)

I D G Prob. i0 d0 g1 0.126 i0 d0 g2 0.168 i0 d0 g3 0.126 i0 d1 g1 0.009 i0 d1 g2 0.045 i0 d1 g3 0.126 i1 d0 g1 0.252 i1 d0 g2 0.0224 i1 d0 g3 0.0056 i1 d1 g1 0.06 i1 d1 g2 0.036 i1 d1 g3 0.024

I D

G

Daphne Koller

Conditional Independence •  For (sets of) random variables X,Y,Z P (X ⊥ Y | Z) if: – P(X, Y|Z) = P(X|Z) P(Y|Z) – P(X|Y,Z) = P(X|Z) – P(Y|X,Z) = P(X|Z) – P(X,Y,Z) ∝ φ1(X,Y) φ2(Y,Z)

Daphne Koller

Conditional Independence

X2 X1

Coin

Daphne Koller

Conditional Independence I S G Prob. i0 s0 g1 0.114 i0 s0 g2 0.1938 i0 s0 g3 0.2622 i0 s1 g1 0.006 i0 s1 g2 0.0102 i0 s1 g3 0.0138 i1 s0 g1 0.252 i1 s0 g2 0.0224 i1 s0 g3 0.0056 i1 s1 g1 0.108 i1 s1 g2 0.0096 i1 s1 g3 0.0024

S Prob s0 0.95 s1 0.05

P(S,G | i0)

S G Prob. s0 g1 0.19 s0 g2 0.323 s0 g3 0.437 s1 g1 0.01 s1 g2 0.017 s1 g3 0.023

G Prob. g1 0.2 g2 0.34 g3 0.46

I

G S

Daphne Koller

Conditioning can Lose Independences I D G Prob. i0 d0 g1 0.126 i0 d0 g2 0.168 i0 d0 g3 0.126 i0 d1 g1 0.009 i0 d1 g2 0.045 i0 d1 g3 0.126 i1 d0 g1 0.252 i1 d0 g2 0.0224 i1 d0 g3 0.0056 i1 d1 g1 0.06 i1 d1 g2 0.036 i1 d1 g3 0.024

I D Prob. i0 d0 0.282 i0 d1 0.02 i1 d0 0.564 i1 d1 0.134

P(I,D | g1)

Daphne Koller

Bayesian(Networks(

Probabilis2c(Graphical(Models( Independencies(

Representa2on(

Daphne Koller

Independence & Factorization

•  Factorization of a distribution P implies independencies that hold in P

•  If P factorizes over G, can we read these independencies from the structure of G?

P(X,Y,Z) ∝ φ1(X,Z) φ2(Y,Z) P(X,Y) = P(X) P(Y) X,Y independent

(X ⊥ Y | Z)

Daphne Koller

Flow of influence & d-separation

Definition: X and Y are d-separated in G given Z if there is no active trail in G between X and Y given Z

I D

G

L

S

Notation: d-sepG(X, Y | Z)

Daphne Koller

Factorization Independence: BNs Theorem: If P factorizes over G, and d-sepG(X, Y | Z) then P satisfies (X ⊥ Y | Z)

I D

G

L

S

Daphne Koller

Any node is d-separated from its non-descendants given its parents Intelligence Difficulty

Grade

Letter

SAT

Job

Happy

Coherence

If P factorizes over G, then in P, any variable is independent of its non-descendants given its parents

Daphne Koller

I-maps •  d-separation in G P satisfies

corresponding independence statement

•  Definition: If P satisfies I(G), we say that G is an I-map (independency map) of P

I(G) = {(X ⊥ Y | Z) : d-sepG(X, Y | Z)}

Daphne Koller

I-maps I D Prob i0 d0 0.42 i0 d1 0.18 i1 d0 0.28 i1 d1 0.12

I D Prob. i0 d0 0.282 i0 d1 0.02 i1 d0 0.564 i1 d1 0.134

G1

P2

I D I D

P1

G2

Daphne Koller

Factorization Independence: BNs Theorem: If P factorizes over G, then G is

an I-map for P

Daphne Koller

Independence Factorization Theorem: If G is an I-map for P, then P

factorizes over G

I D

G

L

S

Daphne Koller

Summary Two equivalent views of graph structure: •  Factorization: G allows P to be represented •  I-map: Independencies encoded by G hold in P

If P factorizes over a graph G, we can read from the graph independencies that must hold in P (an independency map)

Daphne'Koller'

Naïve'Bayes'


Representa5on'

Daphne'Koller'

Naïve'Bayes'Model'

Class

X1 X2 Xn .'.'.'(Xi ⊥ Xj | C) for all Xi, Xj

Daphne'Koller'

Naïve'Bayes'Classifier'

Class

X1 X2 Xn .'.'.'

Daphne'Koller'

Bernoulli'Naïve'Bayes'for'Text'

cat dog buy .'.'.'Document Label

P(“cat” appears | Label)

Financial cat dog buy sell …

Pets 0.001 0.001 0.2 0.3 … 0.3 0.4 0.02 0.0001 …

Daphne'Koller'

Mul5nomial'Naïve'Bayes'for'Text'

W1 W2 Wn .'.'.'Document Label Financial

cat dog buy sell …

Pets 0.001 0.001 0.02 0.02 … 0.02 0.03 0.003 0.001 …

Daphne'Koller'

Summary •  Simple approach for classification

–  Computationally efficient –  Easy to construct

•  Surprisingly effective in domains with many weakly relevant features

•  Strong independence assumptions reduce performance when many features are strongly correlated

Daphne Koller

Applica'on:+Diagnosis+

Probabilis'c+Graphical+Models+ Bayesian+Networks+

Representa'on+

Daphne Koller

Medical Diagnosis: Pathfinder (1992) •  Help pathologist diagnose lymph node

pathologies (60 different diseases) •  Pathfinder I: Rule-based system •  Pathfinder II used naïve Bayes and got

superior performance

Heckerman et al.

Daphne Koller

Medical Diagnosis: Pathfinder (1992) •  Pathfinder III: Naïve Bayes with better

knowledge engineering •  No incorrect zero probabilities •  Better calibration of conditional probabilities –  P(finding | disease1) to P(finding | disease2) – Not P(finding1 | disease) to P(finding2 | disease)

Heckerman et al.

Daphne Koller

Medical Diagnosis: Pathfinder (1992) •  Pathfinder IV: Full Bayesian network – Removed incorrect independencies – Additional parents led to more accurate

estimation of probabilities •  BN model agreed with expert panel in 50/53

cases, vs 47/53 for naïve Bayes model •  Accuracy as high as expert that designed

the model Heckerman et al.

Daphne Koller

CPCS

# of parameters: 21000 to 133,931,430 to 8254

M. Pradhan , G. Provan , B. Middleton , M.Henrion, UAI 94

Daphne Koller

Medical Diagnosis (Microsoft)

Thanks to: Eric Horvitz, Microsoft Research

Daphne Koller

- Medical Diagnosis (Microsoft)

Thanks to: Eric Horvitz, Microsoft Research

Daphne Koller

Fault Diagnosis •  Microsoft troubleshooters

Daphne Koller

Fault Diagnosis •  Many examples: – Microsoft troubleshooters – Car repair

•  Benefits: – Flexible user interface – Easy to design and maintain

graphical) models) bayesian)networks) seman&cs)&)...

Documents