learning bayesian networks. dimensions of learning modelbayes netmarkov net datacompleteincomplete...

Learning Bayesian Networks

Dimensions of Learning

Model Bayes net Markov net

Data Complete Incomplete

Structure Known Unknown

Objective Generative Discriminative

Bayes net(s)data

X1 truefalsefalsetrue

X2 1532

X3 0.7-1.65.96.3

...

.

.

....

Learning Bayes netsfrom data

X1

X4

X9

X3

X2

X5

X6

X7

X8

Bayes-netlearner

+prior/expert information

From thumbtacks to Bayes nets

Thumbtack problem can be viewed as learningthe probability for a very simple BN:

X heads/tails

X1 X2 XN...

toss 1 toss 2 toss N

The next simplest Bayes net

Xheads/tails Y heads/tails

tailsheads “heads” “tails”



X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

?



X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"



X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"

two separatethumbtack-likelearning problems

A bit more difficult...


Three probabilities to learn:X=heads

Y=heads|X=heads

Y=heads|X=tails



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

heads

tails



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

??

?



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

3 separate thumbtack-like problems

In general …

Learning probabilities in a Bayes netis straightforward if

• Complete data

• Local distributions from the exponential family (binomial, Poisson, gamma, ...)

• Parameter independence

• Conjugate priors

Incomplete data makes parameters dependent


X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

Solution: Use EM

• Initialize parameters ignoring missing data

• E step: Infer missing values usingcurrent parameters

• M step: Estimate parameters using completed data

• Can also use gradient descent

Learning Bayes-net structure

Given data, which model is correct?

X Ymodel 1:

X Ymodel 2:

Bayesian approach

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Bayesian approach:Model averaging


X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

averagepredictions

Bayesian approach:Model selection


X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Keep the best model:- Explanation- Understanding- Tractability

To score a model,use Bayes’ theorem

Given data d:

)|()()|( mpmpmp dd

dmpmpmp )|(),|()|( dd

"marginallikelihood"

modelscore

likelihood

Thumbtack example

)(

)#(

)(

)#(

)##(

)(

)1(

)|()1()|(

1#1#

##

t

t

h

h

th

th

th

th

th

th

d

dmpmp

th

d

conjugateprior

X heads/tails

More complicated graphs


3 separate thumbtack-like learning problems

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)()|(

t

t

h

h

th

th

t

t

h

h

th

th

t

t

h

h

th

th

th

th

th

th

th

thmp

d X

Y|X=heads

Y|X=tails

Model score for adiscrete Bayes net

ii r

k ijk

ijkijkn

i

q

j ijij

ij N

Nmp

11 1 )(

)(

)(

)()|(

d

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

:

:

# cases where = and =

number of states of

number of instances of parents of

ik Pa pa

1 1

Computation ofmarginal likelihood

Efficient closed form if

• Local distributions from the exponential family (binomial, poisson, gamma, ...)

• Parameter independence

• Conjugate priors

• No missing data (including no hidden variables)

Structure search• Finding the BN structure with the highest

score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995)

• Heuristic methods

–Greedy–Greedy with restarts–MCMC methods score

all possiblesingle changes

anychangesbetter?

performbest

change

yes

no

returnsaved structure

initializestructure

Structure priors

1. All possible structures equally likely

2. Partial ordering, required / prohibited arcs

3. Prior(m) Similarity(m, prior BN)

Parameter priors

• All uniform: Beta(1,1)

• Use a prior Bayes net

Parameter priors

Recall the intuition behind the Beta prior for the

thumbtack:

• The hyperparameters h and t can be thought

of as imaginary counts from our prior

experience, starting from "pure ignorance"

• Equivalent sample size = h + t

• The larger the equivalent sample size, the more

confident we are about the long-run fraction

Parameter priors

x1

x4

x9

x3

x2

x5

x6

x7

x8

+equivalent

samplesize

imaginarycount

for anyvariable

configuration

parameter priors for any Bayes net structure for X1…Xn

parametermodularity

x1

x4

x9

x3

x2

x5

x6

x7

x8

prior network+equivalent sample size

data

improved network(s)

x1 truefalsefalsetrue

x2 falsefalsefalsetrue

x3 truetruefalsefalse

...

.

.

....

Combining knowledge & data

x1

x4

x9

x3

x2

x5

x6

x7

x8

learning bayesian networks. dimensions of learning modelbayes netmarkov net datacompleteincomplete...

Documents

data d slide

prior x headstails slide

xy model

tractability slide

discrete bayes net slide

bayes nets data x

heads y1y1 y2y2 case

x headstails x1x1 x2x2