learning bayesian networks. dimensions of learning modelbayes netmarkov net datacompleteincomplete...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Dimensions of Learning
Model Bayes net Markov net
Data Complete Incomplete
Structure Known Unknown
Objective Generative Discriminative
Bayes net(s)data
X1 truefalsefalsetrue
X2 1532
X3 0.7-1.65.96.3
...
.
.
....
Learning Bayes netsfrom data
X1
X4
X9
X3
X2
X5
X6
X7
X8
Bayes-netlearner
+prior/expert information
From thumbtacks to Bayes nets
Thumbtack problem can be viewed as learningthe probability for a very simple BN:
X heads/tails
X1 X2 XN...
toss 1 toss 2 toss N
The next simplest Bayes net
Xheads/tails Y heads/tails
X
X1
X2
XN
Y
Y1
Y2
YN
case 1
case 2
case N
"parameterindependence"
The next simplest Bayes net
Xheads/tails Y heads/tails
X
X1
X2
XN
Y
Y1
Y2
YN
case 1
case 2
case N
"parameterindependence"
two separatethumbtack-likelearning problems
A bit more difficult...
Xheads/tails Y heads/tails
Three probabilities to learn:X=heads
Y=heads|X=heads
Y=heads|X=tails
A bit more difficult...
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
heads
tails
A bit more difficult...
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
??
?
A bit more difficult...
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
3 separate thumbtack-like problems
In general …
Learning probabilities in a Bayes netis straightforward if
• Complete data
• Local distributions from the exponential family (binomial, Poisson, gamma, ...)
• Parameter independence
• Conjugate priors
Incomplete data makes parameters dependent
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
Solution: Use EM
• Initialize parameters ignoring missing data
• E step: Infer missing values usingcurrent parameters
• M step: Estimate parameters using completed data
• Can also use gradient descent
Bayesian approach
Given data, which model is correct? more likely?
X Ymodel 1:
X Ymodel 2:
7.0)( 1 mp
3.0)( 2 mp
Data d
1.0)|( 1 dmp
9.0)|( 2 dmp
Bayesian approach:Model averaging
Given data, which model is correct? more likely?
X Ymodel 1:
X Ymodel 2:
7.0)( 1 mp
3.0)( 2 mp
Data d
1.0)|( 1 dmp
9.0)|( 2 dmp
averagepredictions
Bayesian approach:Model selection
Given data, which model is correct? more likely?
X Ymodel 1:
X Ymodel 2:
7.0)( 1 mp
3.0)( 2 mp
Data d
1.0)|( 1 dmp
9.0)|( 2 dmp
Keep the best model:- Explanation- Understanding- Tractability
To score a model,use Bayes’ theorem
Given data d:
)|()()|( mpmpmp dd
dmpmpmp )|(),|()|( dd
"marginallikelihood"
modelscore
likelihood
Thumbtack example
)(
)#(
)(
)#(
)##(
)(
)1(
)|()1()|(
1#1#
##
t
t
h
h
th
th
th
th
th
th
d
dmpmp
th
d
conjugateprior
X heads/tails
More complicated graphs
Xheads/tails Y heads/tails
3 separate thumbtack-like learning problems
)(
)#(
)(
)#(
)##(
)(
)(
)#(
)(
)#(
)##(
)(
)(
)#(
)(
)#(
)##(
)()|(
t
t
h
h
th
th
t
t
h
h
th
th
t
t
h
h
th
th
th
th
th
th
th
thmp
d X
Y|X=heads
Y|X=tails
Model score for adiscrete Bayes net
ii r
k ijk
ijkijkn
i
q
j ijij
ij N
Nmp
11 1 )(
)(
)(
)()|(
d
N X x
r X
q X
N N
ijk i i ij
i i
i i
ij ijkk
r
ij ijkk
ri i
:
:
:
# cases where = and =
number of states of
number of instances of parents of
ik Pa pa
1 1
Computation ofmarginal likelihood
Efficient closed form if
• Local distributions from the exponential family (binomial, poisson, gamma, ...)
• Parameter independence
• Conjugate priors
• No missing data (including no hidden variables)
Structure search• Finding the BN structure with the highest
score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995)
• Heuristic methods
–Greedy–Greedy with restarts–MCMC methods score
all possiblesingle changes
anychangesbetter?
performbest
change
yes
no
returnsaved structure
initializestructure
Structure priors
1. All possible structures equally likely
2. Partial ordering, required / prohibited arcs
3. Prior(m) Similarity(m, prior BN)
Parameter priors
Recall the intuition behind the Beta prior for the
thumbtack:
• The hyperparameters h and t can be thought
of as imaginary counts from our prior
experience, starting from "pure ignorance"
• Equivalent sample size = h + t
• The larger the equivalent sample size, the more
confident we are about the long-run fraction
Parameter priors
x1
x4
x9
x3
x2
x5
x6
x7
x8
+equivalent
samplesize
imaginarycount
for anyvariable
configuration
parameter priors for any Bayes net structure for X1…Xn
parametermodularity