learning bayesian networks most slides by nir friedman some by dan geiger

21
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

.

Learning Bayesian networks

Most Slides by Nir Friedman

Some by Dan Geiger

Page 2: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

2

Known Structure -- Incomplete Data

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Data contains missing values

We consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Page 3: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

3

Learning Parameters from Incomplete Data

Incomplete data: Posterior distributions can become interdependent Consequence:

ML parameters can not be computed separately for each multinomial

Posterior is not a product of independent posteriors

X

Y|X=Hm

X[m]

Y[m]

Y|X=T

Page 4: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

4

Learning Parameters from Incomplete Data (cont.).

In the presence of incomplete data, the likelihood can have multiple global maxima

Example: We can rename the values of hidden variable

H If H has two values, likelihood has two global

maxima

Similarly, local maxima are also replicated Many hidden variables a serious problem

H Y

Page 5: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

5

Expectation Maximization (EM) A general purpose method for learning from incomplete dataIntuition: If we had access to counts, then we can estimate

parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

Page 6: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

6

Expectation Maximization (EM)

1.30.41.71.6

X Z N (X,Y )X Y #H

THHT

Y

??HTT

TT?TH

HTHT

HHTT

P(Y=H|X=T, Z=T, ) = 0.4

Expected CountsP(Y=H|X=H,Z=T,) = 0.3

Data

Current model

These numbers are placed for illustration; they have not been computed.

X

YZ

Page 7: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

7

EM (cont.)

TrainingData

X1 X2 X3

H

Y1 Y2 Y3

Initial network (G,0)

Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)

Computation

(E-Step)

Reparameterize

X1 X2 X3

H

Y1 Y2 Y3

Updated network (G,1)

(M-Step)

Reiterate

Page 8: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

8

L(

|D)

Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem

Page 9: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

9

EM in Practice

Initial parameters: Random parameters setting “Best” guess from other source

Stopping criteria: Small change in likelihood of data Small change in parameter values

Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones

Page 10: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

10

The setup of the EM algorithm

We start with a likelihood function parameterized by .

The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network).

The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize.

The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)

(Because P(x,y| ) = P(x| ) P(y|x, )).

Page 11: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

11

The goal of EM algorithm

The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)

The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference.

The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).

For independent points (xi, yi), i=1,…,m, we can similarly write:

i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,)

We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.

Page 12: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

12

The Mathematics involvedRecall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y).

The expectation of a function L(Y) is given by E[L(Y)] = y L(y) p(y).

A bit harder to comprehend example: E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)

The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds

E[aX+bY] = a E[X] + b E[Y]

Q( |’)

Page 13: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

13

The Mathematics involved (Cont.)Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields

Log P(x |) = P(y|x, ’) log P(x ,y|) - P(y|x, ’) log P(y |x, ) y y

= E’[log p(x,y|)] = Q( |’) We now observe that

= log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y

0 (relative entropy)So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).

Page 14: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

14

The EM algorithm itselfInput: A likelihood function p(x,y| ) parameterized by .

Initialization: Fix an arbitrary starting value ’ Repeat

E-step: Compute Q( | ’) = E’[log P(x,y| )]

M-step: ’ argmax Q(| ’)

Until = log P(x| ) – log P(x|’) <

Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

Page 15: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

16

Expectation Maximization (EM)

In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum.

Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

Page 16: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

17

Gradient Ascent:Follow gradient of likelihood w.r.t. to parameters

L(

|D)

MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem

Page 17: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

18

MLE from Incomplete Data

Both Ideas:Find local maxima only.Require multiple restarts to find approximation to the global maximum.

Page 18: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

19

Gradient Ascent

Main result

Theorem GA:

)],[|,(1)|(log

,,

mopaxPDP

iimpaxpax iiii

Requires computation: P(xi,pai|o[m],) for all i, m

Inference replaces taking derivatives.

Page 19: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

20

Gradient Ascent (cont)

m pax ii

moPmoP ,

)|][()|][(

1

m paxpax iiii

moPDP

,,

)|][(log)|(log

ii pax

moP

,

)|][(

How do we compute ?

Proof:

Page 20: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

21

Gradient Ascent (cont)

Since:

ii pax

ii opaxP

','

),,','(

=1

ii pax ','

iindi

ndii

d paxPopaPopaxoP

),'|'()|,'(),,','|(

ii iipax pax

ndiii

ndii

d opaPpaxPopaxoP

, ','

)|,(),|(),,,|(

ii iiiipax pax

ii

pax

opaxPoP

, ','','

)|,,()|(

ipaix

opaxPoP ii,

)|,,()|(

Page 21: Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

22

Gradient Ascent (cont)

Putting all together we get

m paxpax iiii

moP

moP

DP

,,

)|][(

)|][(

1)|(log

m pax

ii

ii

mopaxP

moP ,

)|][,,(

)|][(

1

m pax

ii

ii

mopaxP

,

)],[|,(

)],[|,(1)|(log

,,

mopaxPDP

iimpaxpax iiii