all of graphical models
Post on 03-Jun-2018
236 Views
Preview:
TRANSCRIPT
-
8/12/2019 All of Graphical Models
1/135
All of Graphical Models
Xiaojin Zhu
Department of Computer SciencesUniversity of WisconsinMadison, USA
Tutorial at ICMLA 2011
http://find/http://goback/ -
8/12/2019 All of Graphical Models
2/135
The Whole Tutorial in One Slide
Given GM = joint distribution p(x1, . . . , xn)
Do inference = p(XQ| XE), in generalXQ XE {x1 . . . xn}
Ifp(x1, . . . , xn) not given, estimate it from data
http://find/http://goback/ -
8/12/2019 All of Graphical Models
3/135
Outline
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential FamilyMaximizing Problems
Parameter Learning
Structure Learning
http://find/ -
8/12/2019 All of Graphical Models
4/135
Outline
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential FamilyMaximizing Problems
Parameter Learning
Structure Learning
http://find/ -
8/12/2019 All of Graphical Models
5/135
Life without Graphical Models
. . . is fine mathematically:
The universe is reduced to a set of random variablesx1, . . . , xn e.g., x1, . . . , xn1 can be the discrete or continuous features e.g., xn y can be the discrete class label
The joint p(x1, . . . , xn) completely describes how the universe
works
Machine learning: estimate p(x1, . . . , xn) from training
data X(1), . . . , X (N), where X(i) = (x(i)1 , . . . , x
(i)n )
Prediction: y = argmaxp(xn| x1, . . . , x
n), a.k.a.
inference by the definition of conditional probability
p(xn| x
1, . . . , x
n) =
p(x1, . . . , xn
, xn)
vp(x1, . . . , x
n
, xn=v)
http://find/ -
8/12/2019 All of Graphical Models
6/135
Conclusion
Life without graphical models is just fine So why are we still here?
http://find/http://goback/ -
8/12/2019 All of Graphical Models
7/135
Life can be Better for Computer Scientists
Given GM = joint distribution p(x1, . . . , xn) exponential nave storage (2n for binary r.v.) hard to interpret (conditional independence)
Do inference = p(XQ| XE), in generalXQ XE {x1 . . . xn} Often cant do it computationally
Ifp(x1, . . . , xn) not given, estimate it from data
Cant do it either
http://find/http://goback/ -
8/12/2019 All of Graphical Models
8/135
Acknowledgments Before We Start
Much of this tutorial is based on
Koller & Friedman, Probabilistic Graphical Models. MIT 2009
Wainwright & Jordan, Graphical Models, ExponentialFamilies, and Variational Inference. FTML 2008
Bishop, Pattern Recognition and Machine Learning. Springer2006.
http://find/ -
8/12/2019 All of Graphical Models
9/135
Outline
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential FamilyMaximizing Problems
Parameter Learning
Structure Learning
http://find/http://goback/ -
8/12/2019 All of Graphical Models
10/135
Graphical-Model-Nots
Graphical model is the study ofprobabilistic models
Just because there is a graph with nodes and edges doesntmean its GM
These are not graphical models
neural network decision tree network flow HMM template
http://find/http://goback/ -
8/12/2019 All of Graphical Models
11/135
Outline
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential FamilyMaximizing Problems
Parameter Learning
Structure Learning
http://find/http://goback/ -
8/12/2019 All of Graphical Models
12/135
Bayesian Network
A directed graph has nodes X= (x1, . . . , xn), some of themconnected by directed edges xi xj
A cycle is a directed path x1 . . . xk where x1=xk A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributionssatisfying
{p | p(X) =i
p(xi| P a(xi))}
where P a(xi) is the set of parents ofxi.
p(xi| P a(xi)) is the conditional probability distribution(CPD) at xi
By specifying the CPDs for all i, we specify a particulardistribution p(X)
http://find/ -
8/12/2019 All of Graphical Models
13/135
Example: Alarm
Binary variables
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
P(B, E,A,J, M)
= P(B)P( E)P(A | B, E)P(J | A)P( M | A)= 0.001 (1 0.002) 0.94 0.9 (1 0.7)
.000253
http://find/http://goback/ -
8/12/2019 All of Graphical Models
14/135
Example: Naive Bayes
y y
x x. . .1 d x
d
p(y, x1, . . . xd) =p(y)di=1p(xi| y)
Used extensively in natural language processing Plate representation on the right
http://find/ -
8/12/2019 All of Graphical Models
15/135
No Causality Whatsoever
P(A)=a
P(B|A)=b
P(B|~A)=c
A
B
B
A
P(B)=ab+(1a)c
P(A|B)=ab/(ab+(1a)c)
P(A|~B)=a(1b)/(1ab(1a)c)
The two BNs are equivalent in all respects
Bayesian networks imply no causality at all
They only encode the joint probability distribution (hencecorrelation)
However, people tend to design BNs based on causal relations
( )
http://find/ -
8/12/2019 All of Graphical Models
16/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in d
topic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
E l L Di i hl All i (LDA)
http://find/ -
8/12/2019 All of Graphical Models
17/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in d
topic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
E l L Di i hl All i (LDA)
http://find/ -
8/12/2019 All of Graphical Models
18/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in d
topic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
E l L Di i hl All i (LDA)
http://find/ -
8/12/2019 All of Graphical Models
19/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in d
topic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
E l L t t Di i hl t All ti (LDA)
http://find/ -
8/12/2019 All of Graphical Models
20/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
E l L t t Di i hl t All ti (LDA)
http://find/ -
8/12/2019 All of Graphical Models
21/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
Example: Latent Dirichlet Allocation (LDA)
http://find/ -
8/12/2019 All of Graphical Models
22/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
Example: Latent Dirichlet Allocation (LDA)
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
23/135
Example: Latent Dirichlet Allocation (LDA)
Nd
w
D
T
z
A generative model for p(,,z,w| , ):For each topic t
t Dirichlet()For each document d
Dirichlet()For each word position in dtopic z Multinomial()word w Multinomial(z)
Inference goals: p(z| w,,), argmax,p(, | w,,)
Some Topics by LDA on the Wish Corpus
http://find/ -
8/12/2019 All of Graphical Models
24/135
Some Topics by LDA on the Wish Corpus
p(word | topic)
troops election love
Conditional Independence
http://find/ -
8/12/2019 All of Graphical Models
25/135
Conditional Independence
Two r.v.s A, B are independent ifP(A, B) =P(A)P(B) orP(A|B) =P(A) (the two are equivalent)
Two r.v.s A, B are conditionally independent given C if
P(A, B| C) =P(A | C)P(B| C) orP(A | B, C) =P(A | C) (the two are equivalent)
This extends to groups of r.v.s
Conditional independence in a BN is precisely specified by
d-separation(directed separation)
d-Separation Case 1: Tail-to-Tail
http://find/ -
8/12/2019 All of Graphical Models
26/135
d-Separation Case 1: Tail-to-Tail
C
A B
C
A B
A, B in general dependent
A, B conditionally independent given C
C is a tail-to-tail node, blocks the undirected path A-B
d-Separation Case 2: Head-to-Tail
http://find/ -
8/12/2019 All of Graphical Models
27/135
d-Separation Case 2: Head-to-Tail
A C B A C B
A, B in general dependent A, B conditionally independent given C
C is a head-to-tail node, blocks the path A-B
d-Separation Case 3: Head-to-Head
http://find/http://goback/ -
8/12/2019 All of Graphical Models
28/135
d Separation Case 3: Head to Head
A B A B
C C
A, B in general independent
A, B conditionallydependentgiven C,or any of Csdescendants
C is a head-to-head node,unblocksthe path A-B
d-Separation
http://find/ -
8/12/2019 All of Graphical Models
29/135
d Separation
Any groups of nodes A and B are conditionally independentgiven another group C, if all undirected paths from any node
in A to any node in B are blocked A path is blocked if it includes a node x such that either
The path is head-to-tail or tail-to-tail at x and x C, or The path is head-to-head at x, and neither x nor any of its
descendants is in C.
d-Separation Example 1
http://find/ -
8/12/2019 All of Graphical Models
30/135
d Separation Example 1
The path from A to B not blocked by either E or F A, B dependent given C
A
C
B
F
E
d-Separation Example 2
http://find/ -
8/12/2019 All of Graphical Models
31/135
d Separation Example 2
The path from A to B is blocked both at E and F A, B conditionally independent given F
A
B
F
E
C
Outline
http://find/ -
8/12/2019 All of Graphical Models
32/135
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Markov Random Fields
http://find/ -
8/12/2019 All of Graphical Models
33/135
The efficiency of directed graphical model (acyclic graph,locally normalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set ofnodes (note: full of loops!)
Define a nonnegative potential function C :XC R+
An undirected graphical model (aka Markov Random Field)on the graph is a family of distributions satisfying
p|p(X) =
1
ZC
C
(XC
) Z=
CC(XC)dX is the partition function
Example: A Tiny Markov Random Field
http://find/ -
8/12/2019 All of Graphical Models
34/135
p y
x x1 2
C
x1, x2 {1, 1} A single clique C(x1, x2) =e
ax1x2
p(x1, x2) = 1Ze
ax1x2
Z= (ea + ea + ea + ea)
p(1, 1) =p(1, 1) =ea/(2ea + 2ea) p(1, 1) =p(1, 1) =ea/(2ea + 2ea)
When the parameter a >0, favor homogeneous chains
When the parameter a
-
8/12/2019 All of Graphical Models
35/135
g
Real-valued feature functions f1(X), . . . , f k(X)
Real-valued weights w1, . . . , wk
p(X) = 1
Zexp
ki=1
wifi(X)
Example: The Ising Model
http://find/ -
8/12/2019 All of Graphical Models
36/135
s xs xtst
This is an undirected model with x {0, 1}.
p(x) = 1
Zexp
sV
sxs+
(s,t)E
stxsxt
fs(X) =xs, fst(X) =xsxt ws= s, wst= st
http://find/http://goback/ -
8/12/2019 All of Graphical Models
37/135
Example: Gaussian Random Field
-
8/12/2019 All of Graphical Models
38/135
p(X) N(, ) = 1
(2)n/2||1/2exp
1
2(X )1(X )
Multivariate Gaussian
The n n covariance matrix positive semi-definite
Let = 1 be the precision matrix
xi, xj are conditionally independent given all other variables, if
and only ifij = 0
When ij = 0, there is an edge between xi, xj
Conditional Independence in Markov Random Fields
http://find/ -
8/12/2019 All of Graphical Models
39/135
Two group of variables A, B are conditionally independent
given another group C, if Remove C and all edges involving C A, B beome disconnected
A
C
B
Factor Graph
http://find/ -
8/12/2019 All of Graphical Models
40/135
For both directed and undirected graphical models
Bipartite: edges between a variable node and a factor node Factors represent computation
A B
C
(A,B,C)
A B
C
A B
C
(A,B,C)f
A B
C
fP(A)P(B)P(C|A,B)
Outline
http://find/ -
8/12/2019 All of Graphical Models
41/135
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
http://find/ -
8/12/2019 All of Graphical Models
42/135
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Inference by Enumeration
http://find/ -
8/12/2019 All of Graphical Models
43/135
Let X= (XQ, XE, XO) for query, evidence, and othervariables.
Infer P(XQ| XE)
By definition
P(XQ| XE) =P(XQ, XE)
P(XE) =
XO
P(XQ, XE, XO)XQ,XO
P(XQ, XE, XO)
Summing exponential number of terms: with k variables inXO each taking r values, there are r
k terms
Details of the summing problem
http://find/ -
8/12/2019 All of Graphical Models
44/135
There are a bunch of other variables x1, . . . , xk We sum over r values each variable can take
vrxi=v1
This is exponential (rk): x1...x
k We want
x1...xk
p(X)
For a graphical model, the joint probability factorsp(X) =
mj=1 fj(X(j))
Each factor fj operates on X(j) X
Eliminating a Variable
http://find/ -
8/12/2019 All of Graphical Models
45/135
Rearrange factors
x1...xkf1 . . . f
l f
+l+1 . . . f
+m by whether
x1 X(j)
Obviously equivalent:
x2...xk
f1 . . . f l
x1f+l+1 . . . f
+m
Introduce a new factor fm+1= x
1
f+l+1 . . . f +m
fm+1 contains the union of variables in f+l+1 . . . f
+m except x1
In fact, x1 disappears altogether in
x2...xkf1 . . . f
l f
m+1
Dynamic programming: compute fm+1 once, use it thereafter
Hope: fm+1 contains very few variables
Recursively eliminate other variables in turn
Example: Chain Graph
http://find/ -
8/12/2019 All of Graphical Models
46/135
A B C D
Binary variables
Say we want P(D) =
A,B,CP(A)P(B|A)P(C|B)P(D|C)
Let f1(A) =P(A). Note f1 is an array of size two:P(A= 0)P(A= 1)
f2(A, B) is a table of size four:P(B= 0|A= 0)P(B= 0|A= 1)
P(B= 1|A= 0)P(B= 1|A= 1))
A,B,Cf1(A)f2(A, B)f3(B, C)f4(C, D) =
B,Cf3(B, C)f4(C, D)(A
f1(A)f2(A, B))
Example: Chain Graph
http://find/ -
8/12/2019 All of Graphical Models
47/135
A B C D
f1(A)f2(A, B) an array of size four: match AvaluesP(A= 0)P(B = 0|A= 0)P(A= 1)P(B = 0|A= 1)P(A= 0)P(B = 1|A= 0)
P(A= 1)P(B = 1|A= 1) f5(B)
A f1(A)f2(A, B) an array of size two
P(A= 0)P(B = 0|A= 0) + P(A= 1)P(B = 0|A= 1)P(A= 0)P(B = 1|A= 0) + P(A= 1)P(B = 1|A= 1)
For this example, f5(B) happens to be P(B)
B,Cf3(B, C)f4(C, D)f5(B) =Cf4(C, D)(
Bf3(B, C)f5(B)), and so on
In the end, f7(D) = (P(D= 0), P(D= 1))
Example: Chain Graph
http://find/ -
8/12/2019 All of Graphical Models
48/135
A B C D
Computation for P(D): 12, 6 +
Enumeration: 48, 14 + Saving depends on elimination order. Finding optimal order
NP-hard; there are heuristic methods.
Saving depends more critically on the graph structure (treewidth), can be intractable
Handling Evidence
http://find/ -
8/12/2019 All of Graphical Models
49/135
For evidence variables XE, simply plug in their value e
Eliminate variables XO =X XE XQ The final factor will be the jointf(XQ) =P(XQ, XE=e)
Normalize to answer query:
P(XQ| XE=e) = f(XQ)
XQf(XQ)
Summary: Exact Inference
http://find/ -
8/12/2019 All of Graphical Models
50/135
Enumeration
Variable elimination
Not covered: junction tree (aka clique tree)
Exact, but intractable for large graphs
Outline
http://find/ -
8/12/2019 All of Graphical Models
51/135
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
Inference
Exact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Inference by Monte Carlo
http://find/http://goback/ -
8/12/2019 All of Graphical Models
52/135
Consider the inference problem p(XQ=cQ| XE) where
XQ XE {x1 . . . xn}
p(XQ=cQ| XE) =
1(xQ=cQ)p(xQ| XE)dxQ
If we can draw samples x(1)Q , . . . x(m)Q p(xQ| XE), anunbiased estimator is
p(XQ=cQ| XE) 1
m
m
i=1
1(x
(i)Q=cQ)
The variance of the estimator decreases as V/m
Inference reduces to sampling from p(xQ| XE)
Forward Sampling Example
http://find/ -
8/12/2019 All of Graphical Models
53/135
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
To generate a sample X= (B,E,A,J,M):1. Sample B Ber(0.001): r U(0, 1). If(r
-
8/12/2019 All of Graphical Models
54/135
Say the inference task is P(B = 1 | E= 1, M= 1)
Throw awayall samples except those with (E= 1, M= 1)
p(B= 1 | E= 1, M= 1) 1
m
mi=1
1(B(i)=1)
where m is the number of surviving samples
Can be highly inefficient (note P(E= 1) tiny)
Does not work for Markov Random Fields
Gibbs Sampler Example: P(B = 1 | E= 1,M= 1)
http://find/ -
8/12/2019 All of Graphical Models
55/135
Gibbs sampler is a Markov Chain Monte Carlo (MCMC)
method. Directly sample from p(xQ| XE)
Works for both graphical models Initialization:
Fix evidence; randomly set other variables e.g. X(0) = (B= 0, E= 1, A= 0, J= 0, M= 1)
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
Gibbs Update
http://find/ -
8/12/2019 All of Graphical Models
56/135
For each non-evidence variable xi, fixing all other nodes Xi,resample its value xi P(xi| Xi)
This is equivalent to xi P(xi| MarkovBlanket(xi)) For a Bayesian network MarkovBlanket(xi) includes xis
parents, spouses, and children
P(xi| MarkovBlanket(xi)) P(xi| P a(xi)) yC(xi)
P(y| P a(y))
where P a(x) are the parents ofx, and C(x) the children ofx. For many graphical models the Markov Blanket is small. For example,
B P(B| E= 1, A= 0) P(B)P(A= 0 | B, E= 1)
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
A
J M
B E
P(E)=0.002P(B)=0.001
Gibbs Update
http://find/ -
8/12/2019 All of Graphical Models
57/135
Say we sampled B= 1. Then
X(1) = (B = 1, E= 1, A= 0, J= 0, M= 1) Starting from X(1), sample
A P(A | B= 1, E= 1, J= 0, M= 1) to get X(2)
Move on toJ, then repeat B,A, J, B, A, J . . .
Keep all latersamples. P(B= 1 | E= 1, M= 1) is thefraction of samples with B = 1.
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B E
P(E)=0.002P(B)=0.001
Gibbs Example 2: The Ising Model
http://find/ -
8/12/2019 All of Graphical Models
58/135
xs
A
B
C
D
This is an undirected model with x {0, 1}.
p(x) = 1
Z
expsV
sxs+ (s,t)E
stxsxt
Gibbs Example 2: The Ising Model
http://find/ -
8/12/2019 All of Graphical Models
59/135
xs
A
B
C
D
The Markov blanket ofxs is A, B, C, D
In general for undirected graphical models
p(xs| xs) =p(xs| xN(s))
N(s) is the neighbors ofs.
The Gibbs update is
p(xs= 1 | xN(s)) = 1
exp((s+ tN(s) stxt)) + 1
Gibbs Sampling as a Markov Chain
http://find/ -
8/12/2019 All of Graphical Models
60/135
A Markov chain is defined by a transition matrix T(X
| X) Certain Markov chains have a stationary distribution such
that =T
Gibbs sampler is such a Markov chain with
Ti((Xi, x
i) | (Xi, xi)) =p(x
i| Xi), and stationarydistribution p(xQ| XE) But it takes time for the chain to reach stationary distribution
(mix) Can be difficult to assert mixing
In practice burn in: discard X(0)
, . . . , X (T)
UseallofX(T+1), . . . for inference (they are correlated) Do not thin
Collapsed Gibbs Sampling
http://find/ -
8/12/2019 All of Graphical Models
61/135
In general, Ep[f(X)] 1mmi=1 f(X(i)) ifX(i) p
Sometimes X= (Y, Z) where Zhas closed-form operations
If so,
Ep[f(X)] = Ep(Y)Ep(Z|Y)[f(Y, Z)]
1m
mi=1
Ep(Z|Y(i))[f(Y(i), Z)]
ifY(i) p(Y)
No need to sample Z: it is collapsed Collapsed Gibbs sampler Ti((Yi, y
i) | (Yi, yi)) =p(y
i| Yi)
Note p(yi| Yi) =
p(yi, Z | Yi)dZ
Example: Collapsed Gibbs Sampling for LDA
D
http://find/ -
8/12/2019 All of Graphical Models
62/135
Nd
w
D
T
z
Collapse , , Gibbs update:
P(zi=j|z
i,w
)
n(wi)i,j+ n
(di)i,j+
n()i,j+ W n(di)i,+ T
n(wi)i,j : number of times word wi has been assigned to topic j,
excluding the current position
n(di)i,j: number of times a word from document di has been
assigned to topic j, excluding the current position
n()i,j: number of times any word has been assigned to topic j,
excluding the current position
n
(di)
i,: length of document di, excluding the current position
http://find/ -
8/12/2019 All of Graphical Models
63/135
Outline
-
8/12/2019 All of Graphical Models
64/135
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
InferenceExact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
http://find/ -
8/12/2019 All of Graphical Models
65/135
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
InferenceExact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
The Sum-Product Algorithm
http://find/ -
8/12/2019 All of Graphical Models
66/135
Also known as belief propagation (BP)
Exact if the graph is a tree; otherwise known as loopy BP,
approximate The algorithm involves passing messageson the factor graph
Alternative view: variational approximation (more later)
Example: A Simple HMM
The Hidden Markov Model template (not a graphical model)
http://find/http://goback/ -
8/12/2019 All of Graphical Models
67/135
The Hidden Markov Model template (not a graphical model)
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Observing x1=R, x2=G, the directed graphical model
z1
x =G2
z2
x =R1
Factor graphz1f1 z2f2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
Messages
http://find/ -
8/12/2019 All of Graphical Models
68/135
A message is a vector of length K, where Kis the number ofvalues x takes.There are two types of messages:
1. fx: message from a factor node fto a variable node xfx(i) is the ith element, i= 1 . . . K .
2. xf: message from a variable node x to a factor node f
Leaf Messages
http://find/ -
8/12/2019 All of Graphical Models
69/135
Assume tree factor graph. Pick an arbitrary root, say z2
Start messages at leaves. If a leaf is a factor node f, fx(x) =f(x)
If a leaf is a variable node x, xf(x) = 1
z1f1 z2f2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
f1z1(z1= 1) =P(z1= 1)P(R|z1= 1) = 1/2 1/2 = 1/4f1z1(z1= 2) =P(z1= 2)P(R|z1= 2) = 1/2 1/4 = 1/8
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from Variable to Factor
A node (factor or variable) can send out a message if all other
http://find/ -
8/12/2019 All of Graphical Models
70/135
( ) gincoming messages have arrived
Let x be in factor fs.xfs(x) =
fne(x)\fs
fx(x)
ne(x)\fs are factors connected to x excluding fs.
z1f1 z2f2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
z1f2(z1= 1) = 1/4z1f2
(z1
= 2) = 1/8
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from Factor to Variable
Let x be in factor fs Let the other variables in fs be x1:M
http://find/http://goback/ -
8/12/2019 All of Graphical Models
71/135
Let x be in factor fs. Let the other variables in fs be x1:M.
fsx(x) =x1
. . .xM
fs(x, x1, . . . , xM)
Mm=1
xmfs(xm)
z1f1 z2f2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
f2z2(s) =2
s=1
z1f2(s)f2(z1=s
, z2=s)
= 1/4P(z2
=s|z1
= 1)P(x2
=G|z2
=s)
+1/8P(z2=s|z1= 2)P(x2=G|z2=s)
We getf2z2(z2= 1) = 1/32
f2z2(z2 = 2) = 1/8
Up to Root, Back Down
http://find/http://goback/ -
8/12/2019 All of Graphical Models
72/135
The message has reached the root, pass it back down
z1f1 z2f2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
z2f2(z2= 1) = 1z2f2(z2= 2) = 1
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Keep Passing Down
http://find/ -
8/12/2019 All of Graphical Models
73/135
z1f1 z2f2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
f2z1(s) =2
s=1 z2f2(s)f2(z1=s, z2=s
)= 1P(z2= 1|z1=s)P(x2=G|z2= 1)
+ 1P(z2= 2|z1=s)P(x2=G|z2= 2). We getf2z1(z1= 1) = 7/16f2z1(z1= 2) = 3/8
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
From Messages to Marginals
http://find/ -
8/12/2019 All of Graphical Models
74/135
Once a variable receives all incoming messages, we compute its
marginal asp(x)
fne(x)
fx(x)
In this example
P(z1|x1, x2) f1z1 f2z1 = 1/41/8
7/16
3/8
= 7/643/64
0.70.3
P(z2|x1, x2) f2z2 = 1/32
1/8
0.20.8
One can also compute the marginal of the set ofvariables xsinvolved in a factor fs
p(xs) fs(xs)
xne(f)
xf(x)
Handling Evidence
Ob i
http://find/ -
8/12/2019 All of Graphical Models
75/135
Observing x= v,
we can absorb it in the factor (as we did); or
set messages xf(x) = 0 for all x =v
Observing XE,
multiplying the incoming messages to x / XEgives the joint(not p(x|X
E)):
p(x, XE)
fne(x)
fx(x)
The conditional is easily obtained by normalization
p(x|XE) = p(x, XE)xp(x
, XE)
http://find/ -
8/12/2019 All of Graphical Models
76/135
Outline
Life without Graphical Models
-
8/12/2019 All of Graphical Models
77/135
Life without Graphical Models
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
InferenceExact InferenceMarkov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Example: The Ising Model
http://find/ -
8/12/2019 All of Graphical Models
78/135
s
xs xt
st
The random variables x take values in{0, 1}.
p(x) = 1
Zexp
sV
sxs+ (s,t)E
stxsxt
The Conditional
st
http://find/ -
8/12/2019 All of Graphical Models
79/135
s
xs xt
st
Markovian: the conditional distribution for xs is
p(xs| xs) =p(xs| xN(s))
N(s) is the neighbors ofs.
This reduces to
p(xs= 1 | xN(s)) = 1
exp((s+
tN(s) stxt)) + 1
Gibbs sampling would draw xs like this.
The Mean Field Algorithm for Ising Model
http://find/ -
8/12/2019 All of Graphical Models
80/135
p(xs= 1 | xN(s)) = 1exp((s+
tN(s) stxt)) + 1
Instead of Gibbs sampling, let s be the estimated marginalp(xs = 1)
s 1
exp((s+
tN(s) stt)) + 1
The s are updated iteratively
The Mean Field algorithm is coordinate ascent andguaranteed to converge to a local optimal (more later).
Outline
Life without Graphical Models
http://find/http://goback/ -
8/12/2019 All of Graphical Models
81/135
p
RepresentationDirected Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
InferenceExact Inference
Markov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Exponential Family
http://find/http://goback/ -
8/12/2019 All of Graphical Models
82/135
Let (X) = (1(X), . . . , d(X)) be d sufficient statistics,
where i: X R
Note Xis all the nodes in a Graphical model
i(X) sometimes called a feature function
Let = (1, . . . , d) Rd be canonical parameters.
The exponential family is a family of probability densities:
p(x) = exp(x) A()
Exponential Family
http://find/ -
8/12/2019 All of Graphical Models
83/135
p(x) = exp
(x) A()
The key is the inner product between parameters and
sufficient statistics . A is the log partition function,
A() = log
exp
(x)
(dx)
A= log Z
http://find/http://goback/ -
8/12/2019 All of Graphical Models
84/135
Exponential Family Example 1: Bernoulli
-
8/12/2019 All of Graphical Models
85/135
p(x) =x(1 )1x forx {0, 1} and (0, 1). Does not look like an exponential family!
Can be rewritten as
p(x) = exp (x log + (1 x) log(1 )) Now in exponential family form with
1(x) =x, 2(x) = 1 x, 1= log , 2= log(1 ), andA() = 0.
Overcomplete: 1=2= 1 makes (x) = 1 for all x
Exponential Family Example 1: Bernoulli
http://find/http://goback/ -
8/12/2019 All of Graphical Models
86/135
p(x) = exp (x log + (1 x) log(1 ))
Can be further rewritten as
p(x) = exp (x log(1 + exp()))
Minimal exponential family with(x) =x, = log 1, A() = log(1 + exp()).
Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are
in the exponential family, but not all (e.g., the Laplacedistribution).
Exponential Family Example 2: Ising Model
http://find/ -
8/12/2019 All of Graphical Models
87/135
s
xs xt
st
p(x) = exp
sV
sxs+
(s,t)E
stxsxt A()
Binary random variable xs {0, 1}
d= |V| + |E| sufficient statistics: (x) = (. . . xs . . . xst . . .)
This is a regular ( = Rd), minimal exponential family.
Exponential Family Example 3: Potts Model
s
xs xt
st
http://find/http://goback/ -
8/12/2019 All of Graphical Models
88/135
Similar to Ising model but generalizing xs {0, . . . , r 1}. Indicator functions fsj(x) = 1 ifxs=j and 0 otherwise, and
fstjk(x) = 1 ifxs=j xt=k, and 0 otherwise.
p(x) = exp
sj
sjfsj(x) +stjk
stjkfstjk(x) A()
d= r|V| + r2|E| Regular but overcomplete, because
r1j=0 sj(x) = 1 for any
s V and all x. The Potts model is a special case where the parameters are
tied: stkk
=, and stjk
= forj=k.
http://find/ -
8/12/2019 All of Graphical Models
89/135
Mean Parameters
Let p be any density (not necessarily in exponential family).
-
8/12/2019 All of Graphical Models
90/135
Let p be any density (not necessarily in exponential family).
Given sufficient statistics , the mean parameters= (1, . . . , d) is
i= Ep[i(x)] =
i(x)p(x)dx
The set of mean parameters
M = { Rd | p s.t. Ep[(x)] =}
If(1), (2) M, there must exist p(1), p(2)
The convex combinations ofp(1), p(2) leads to another meanparameter in M
Therefore M is convex
Example: The First Two Moments
http://find/ -
8/12/2019 All of Graphical Models
91/135
Let 1(x) =x, 2(x) =x2
For any p (not necessarily Gaussian) on x, the meanparameters= (1, 2) = (E(x),E(x
2)).
Note V(x) = E(x2) E2(x) =2 21 0 for any p
M is not R2
but rather the subset 1 R, 2 21.
The Marginal Polytope
http://find/ -
8/12/2019 All of Graphical Models
92/135
The marginal polytope is defined for discrete xs Recall M = { Rd | =
x
(x)p(x) for some p}
p can be a point mass function on a particular x.
In fact any p is a convex combination of such point massfunctions.
M =conv{(x), x} is a convex hull, called the marginalpolytope.
http://find/ -
8/12/2019 All of Graphical Models
93/135
-
8/12/2019 All of Graphical Models
94/135
Conjugate Duality
The conjugate dual function A to A is defined as
-
8/12/2019 All of Graphical Models
95/135
A() = sup
A()
Such definition, where a quantity is expressed as the solution to anoptimization problem, is called a variationaldefinition.
For any Ms interior, let () satisfyE()[(x)] = A(()) =.
Then A() = H(p()) the negative entropy.
The dual of the dual gives back A:
A() = supM
A
() For all , the supremum is attained uniquely at the
M0 by the moment matching conditions = E[(x)].
Example: Conjugate Dual for Bernoulli
http://find/ -
8/12/2019 All of Graphical Models
96/135
Recall the minimal exponential family for Bernoulli with(x) =x, A() = log(1 + exp()), = R.
By definition
A() = sup
R
log(1 + exp())
Taking derivative and solve
A() = log + (1 ) log(1 )
i.e., the negative entropy.
http://find/ -
8/12/2019 All of Graphical Models
97/135
The Difficulties with Variational Representation
-
8/12/2019 All of Graphical Models
98/135
A() = supM
A()
Difficult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quitecomplex (exponential number of vertices) The dual function A() usually does not admit an explicit
form.
Variational approximationmodifies the optimization problem
so that it is tractable, at the price of an approximate solution. Next, we cast mean field and sum-product algorithms as
variational approximations.
http://find/ -
8/12/2019 All of Graphical Models
99/135
The Geometry ofM(F)
LetM(F) be the mean parameters of the fully factorizedsub-family. In general, M(F) M
-
8/12/2019 All of Graphical Models
100/135
Recall M is the convex hull of extreme points {(x)}. It turns out the extreme points {(x)} M(F). Example:
The tiny Ising model x1, x2 {0, 1} with = (x1, x2, x1x2)
The point mass distribution p(x= (0, 1)) = 1 is realized as a
limit to the series p(x) = exp(1x1+ 2x2 A()) where1 and 2 .
This series is in F because 12= 0. Hence the extreme point (x) = (0, 1, 0) is in M(F).
The Geometry ofM(F)
http://find/ -
8/12/2019 All of Graphical Models
101/135
Because the extreme points ofM are in M(F), ifM(F)were convex, we would have M = M(F).
But in general M(F) is a true subset ofM
Therefore, M(F) is a nonconvex inner set ofM
M(F)M
( )x
http://find/ -
8/12/2019 All of Graphical Models
102/135
Example: Mean Field for Ising Model
The mean parameters for the Ising model are the node andedge marginals: s = p(xx = 1), st = p(xs = 1, xt = 1)
-
8/12/2019 All of Graphical Models
103/135
edge marginals: s p(xx 1), st p(xs 1, xt 1)
Fully factorized M(F) means no edge. st=st For M(F), the dual function A() has the simple form
A() = sV
H(s) = sV
slog s+ (1 s) log(1 s)
Thus the mean field problem is
L() = supM(F)
sV(slog s+ (1 s) log(1 s))
= max(1...m)[0,1]m
sV
ss+
(s,t)E
stst+sV
H(s)
Example: Mean Field for Ising Model
L() a + + H( )
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
104/135
L() = max(1...m)[0,1]m
sV
ss+ (s,t)E
stst+ sV
H(s) Bilinear in , not jointly concave
But concave in a single dimension s, fixing others.
Iterative coordinate-wise maximization: fixing t fort =s andoptimizing s.
Setting the partial derivative w.r.t. s to 0 yields:
s= 1
1 + exp
(s+
(s,t)Estt)
as weve seen before.
Caution: mean field converges to a local maximum dependingon the initialization of1 . . . m.
The Sum-Product Algorithm as Variational Approximation
http://find/ -
8/12/2019 All of Graphical Models
105/135
A() = supM
A()
The sum-product algorithm makes two approximations:
it relaxes M to an outerset L it replaces the dual A with an approximation.
A() = supL
A()
The Outer Relaxation
For overcomplete exponential families on discrete nodes, themean parameters are node and edge marginals j = p(x = j) k = p(x = j x = k)
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
106/135
sj =p(xs=j), stjk =p(xs=j, xt =k).
The marginal polytope is M = { | p with marginals }. Now consider Rd+ satisfying node normalization and
edge-node marginal consistency conditions:
r1j=0
sj = 1 s V
r1
k=0stjk =sj s, t V, j = 0 . . . r 1
r1j=0
stjk=tk s, t V, k= 0 . . . r 1
Define L= { satisfying the above conditions}.
http://find/ -
8/12/2019 All of Graphical Models
107/135
-
8/12/2019 All of Graphical Models
108/135
Approximating A
-
8/12/2019 All of Graphical Models
109/135
Define the Bethe entropyfor L on loopy graphs in the sameway:
HBethe
(p) = sV
r1
j=0
sjlog sj (s,t)E
j,k
stjk
log stjk
sjtk
Note HBethe is not a true entropy. The second approximation insum-product is to replace A() with HBethe(p).
http://find/ -
8/12/2019 All of Graphical Models
110/135
Summary: Variational Inference
-
8/12/2019 All of Graphical Models
111/135
The sum-product algorithm (loopy belief propagation)
The mean field method
Not covered: Expectation Propagation
Efficient computation. But often unknown bias in solution.
http://find/ -
8/12/2019 All of Graphical Models
112/135
Maximizing ProblemsRecall the HMM example
1/4 1/2
-
8/12/2019 All of Graphical Models
113/135
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
R G B R G B
There are two senses of best states z1:N given x1:N:1. So far we computed the marginal p(zn|x1:N)
We can define best as zn= arg maxkp(zn =k|x1:N)
Howeverz1:Nas a whole may not be the best In fact z1:Ncan even have zero probability!
2. An alternative is to find
z
1:N= arg maxz1:N p(z1:N|x1:N)
finds the most likely state configuration as a whole The max-sum algorithm solves this Generalizes the Viterbi algorithm for HMMs
Intermediate: The Max-Product Algorithm
Simple modification to the sum-product algorithm: replace with
http://find/ -
8/12/2019 All of Graphical Models
114/135
p p g p max in the factor-to-variable messages.
fsx(x) = maxx1. . . max
xMfs(x, x1, . . . , xM)
M
m=1xmfs(xm)
xmfs(xm) =
fne(xm)\fs
fxm(xm)
xleaff(x) = 1
f
leaf
x(x) = f(x)
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the
http://find/ -
8/12/2019 All of Graphical Models
115/135
root Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root toleaves
At the root, multiply incoming messages
pmax = maxx
fne(x)
fx(x)
This is the probability of the most likely state configuration
Intermediate: The Max-Product Algorithm
http://find/ -
8/12/2019 All of Graphical Models
116/135
To identify the configuration itself, keep back pointers: When creating the message
fsx(x) = maxx1
. . . maxxM
fs(x, x1, . . . , xM)M
m=1
xmfs(xm)
for each x value, we separately create Mpointers back to thevalues ofx1, . . . , xM that achieve the maximum.
At the root, backtrack the pointers.
Intermediate: The Max-Product Algorithm
z1f1 z2f2
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
117/135
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from leaff1f1z1(z1= 1) =P(z1= 1)P(R|z1= 1) = 1/2 1/2 = 1/4f1z1(z1= 2) =P(z1= 2)P(R|z1= 2) = 1/2 1/4 = 1/8
The second messagez1f2(z1= 1) = 1/4z1f2(z1= 2) = 1/8
Intermediate: The Max-Product Algorithm
z1f1 z2f2
http://find/ -
8/12/2019 All of Graphical Models
118/135
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
f2z2(z2= 1)
= maxz1
f2(z1, z2)z1f2(z1)
= maxz1 P(z2= 1 | z1)P(x2=G | z2= 1)z1f2(z1)= max(1/4 1/4 1/4, 1/2 1/4 1/8) = 1/64
Back pointer for z2= 1: either z1= 1 orz1= 2
Intermediate: The Max-Product Algorithm
z1f1 z2f2
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
119/135
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
/ /
R G B R G B
The other element of the same message:
f2z2(z2= 2)
= maxz1
f2(z1, z2)z1f2(z1)
= maxz1 P(z2= 2 | z1)P(x2=G | z2= 2)z1f2(z1)
= max(3/4 1/2 1/4, 1/2 1/2 1/8) = 3/32
Back pointer for z2= 2: z1= 1
Intermediate: The Max-Product Algorithm
z1f1 z2f2
P(z )P(x | z ) P(z | z )P(x | z )1 1 1 2 1 2 2
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
120/135
= = 1/21 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
f2z2 =1/64 z1=1,23/32 z1=1
At root z2,
maxs=1,2
f2z2(s) = 3/32
z2= 2 z1= 1
z1:2= arg maxz1:2p(z1:2|x1:2) = (1, 2)
In this example, sum-product and max-product produce the samebest sequence; In general they differ.
From Max-Product to Max-SumThe max-sum algorithm is equivalent to the max-productalgorithm, but work in log space to avoid underflow.
M
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
121/135
fsx(x) = maxx1...xMlog fs(x, x1, . . . , xM) +
m=1
xmfs(xm)
xmfs(xm) =
fne(xm)\fs
fxm(xm)
xleaff(x) = 0
fleafx(x) = log f(x)
When at the root,
logpmax = maxx
fne(x)
fx(x)
The back pointers are the same.
Outline
Life without Graphical Models
Representation
http://goforward/http://find/http://goback/ -
8/12/2019 All of Graphical Models
122/135
Directed Graphical Models (Bayesian Networks)Undirected Graphical Models (Markov Random Fields)
InferenceExact Inference
Markov Chain Monte CarloVariational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Parameter Learning
http://find/ -
8/12/2019 All of Graphical Models
123/135
Assume the graph structure is given
Learning in exponential family: estimate from iid datax1 . . .xn.
Principle: maximum likelihood Distinguish two cases:
fully observed data: all dimensions of xare observed partially observed data: some dimensions of xare unobserved.
http://find/http://goback/ -
8/12/2019 All of Graphical Models
124/135
Partially Observed Data
Each item (x, z) where xobserved, z unobserved
-
8/12/2019 All of Graphical Models
125/135
Full data (x1, z1) . . . (xn, zn), but we only observe x1 . . .xn The incomplete likelihood () = 1n
ni=1logp(xi) where
p(xi) =
p(xi, z)dz
Can be written as () = 1nni=1 Axi() A() New log partition function ofp(z | xi), one per item:
Axi() = log
exp((xi, z
))dz
Expectation-Maximization (EM) algorithm: lower bound Axi
EM as Variational Lower Bound
Mean parameter realizable by any distribution on z whileholding xi fixed:
Mx
i {Rd
|E
[(x
iz
)] f }
http://find/ -
8/12/2019 All of Graphical Models
126/135
M i = { | = p[( i, )] for some p} The variational definition Axi() = supMxi
Axi
()
Trivial variational lower bound:Axi()
i Axi
(i), i Mxi
Lower bound L on the incomplete log likelihood:
() = 1
n
ni=1
Axi() A()
1n
ni=1
i A
xi(i)
A()
L(1, . . . , n, )
http://find/ -
8/12/2019 All of Graphical Models
127/135
Exact EM: The M-Step
I h M i i h ldi h fi d
-
8/12/2019 All of Graphical Models
128/135
In the M-step, maximize holding the s fixed:
arg max
L(1, . . . , n, ) = arg max
A()
= 1nn
i=1 i
The solution () satisfies E()[(x)] =
Standard fully observed maximum likelihood problem, hencethe name M-step
Variational EM
For loopy graphs E-step often intractable.
Cant maximize
i A ( i)
http://find/ -
8/12/2019 All of Graphical Models
129/135
maxiMxi
i Axi
(i)
Improve but not necessarily maximize: generalized EM
The mean field method maximizes
maxiMxi(F)
i Axi
(i)
up to local maximum recallMxi(F) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM
http://find/ -
8/12/2019 All of Graphical Models
130/135
Score-Based Structure Learning
Let M be all allowed candidate features
-
8/12/2019 All of Graphical Models
131/135
LetM be all allowed candidate features Let M M be a log-linear model structure
P(X | M, ) = 1
Zexp
iM
ifi(X) A score for the model Mcan be maxln P(Data | M, )
The score is always better for larger M needs regularization
M and treated separately
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate GaussianN(, )
The graphical model has p nodes x1 xp
http://find/ -
8/12/2019 All of Graphical Models
132/135
The graphical model has p nodes x1, . . . , xp The edge between xi, xj is absent if and only ifij = 0,
where = 1
Equivalently, xi, xj are conditionally independent given other
variables
xx
xx
1
2
3
4
http://find/ -
8/12/2019 All of Graphical Models
133/135
Structure Learning for Gaussian Random Fields
-
8/12/2019 All of Graphical Models
134/135
For centered data, minimize a regularized problem instead:
log || +1
n
n
i=1
X(i)
X(i) + i=j
|ij |
Known as glasso
http://find/ -
8/12/2019 All of Graphical Models
135/135
top related