cs242: probabilistic graphical models - brown...

CS242: Probabilistic Graphical ModelsLecture 7: Exponential Families, Directed Graphs

Instructor: Erik Sudderth Brown University Computer Science

September 25, 2014

Some figures and materials courtesy David Barber,Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/

Graphical Models asExponential Family Distributions

x3

x4

x5

x1

x2

x3

x4

x5

x1

x2

x3

x4

x5

x1

x2

Exponential Families of Distributions

fixed vector of sufficient statistics (features), specifying the family of distributions unknown vector of natural parameters,determine particular distribution in this family

�(x) 2 Rd

✓ 2 ⇥ ✓ Rd

⇥ = {✓ 2 Rd | Z(✓) < 1} ensures we construct a valid distribution

p(x | ✓) = 1

Z(✓)

⌫(x) exp{✓T�(x)}

= ⌫(x) exp{✓T�(x)� �(✓)}�(✓) = logZ(✓)

Z(✓) =

Z

X⌫(x) exp{✓T�(x)} dx

normalization constant or partition function(for discrete variables, integral becomes sum)

reference measure independent of parameters(for many models, we simply have )

Z(✓) > 0

⌫(x) > 0⌫(x) = 1

Factor Graphs & Exponential Families

set of N nodes or vertices,

set of hyperedges linking subsets of nodes

{1, 2, . . . , N}Vnormalization constant (partition function)

F f ✓ V

p(x) =1

Z(✓)

Y

f2F f (xf | ✓f )

Z(✓)

•  A factor graph is created from non-negative potential functions, often defined as

50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS

x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

(a) (b) (c)

Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]).(a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza-tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G.

For example, in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the jointdistribution has one potential for each of the 3 hyperedges:

p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)

Often, these potentials can be interpreted as local dependencies or constraints. Note,however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ),due to interactions with the graph’s other potentials.

In many applications, factor graphs are used to impose structure on an exponentialfamily of densities. In particular, suppose that each potential function is described bythe following unnormalized exponential form:

ψf (xf | θf ) = νf (xf ) exp

∑

a∈Af

θfaφfa(xf )

(2.67)

Here, θf ! {θfa | a ∈ Af} are the canonical parameters of the local exponential familyfor hyperedge f . From eq. (2.66), the joint distribution can then be written as

p(x | θ) =

( ∏

f∈F

νf (xf )

)exp

∑

f∈F

∑

a∈Af

θfaφfa(xf ) − Φ(θ)

(2.68)

Comparing to eq. (2.1), we see that factor graphs define regular exponential fami-lies [104, 311], with parameters θ = {θf | f ∈ F}, whenever local potentials are chosenfrom such families. The results of Sec. 2.1 then show that local statistics, computedover the support of each hyperedge, are sufficient for learning from training data. This


x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

(a) (b) (c)



p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)




∑

a∈Af

θfaφfa(xf )

(2.67)


p(x | θ) =

( ∏

f∈F

νf (xf )

)exp

∑

f∈F

∑

a∈Af


(2.68)


Local exponential family:


x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

(a) (b) (c)



p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)




∑

a∈Af

θfaφfa(xf )

(2.67)


p(x | θ) =

( ∏

f∈F

νf (xf )

)exp

∑

f∈F

∑

a∈Af


(2.68)


�(✓) = logZ(✓)

Undirected Graphs & Exponential Families


x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

(a) (b) (c)



p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)




∑

a∈Af

θfaφfa(xf )

(2.67)


p(x | θ) =

( ∏

f∈F

νf (xf )

)exp

∑

f∈F

∑

a∈Af


(2.68)


�(✓) = logZ(✓)


x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

x3

x4 x5

x1 x2

(a) (b) (c)



p(x) ∝ ψ123(x1, x2, x3)ψ234(x2, x3, x4)ψ35(x3, x5)




∑

a∈Af

θfaφfa(xf )

(2.67)


p(x | θ) =

( ∏

f∈F

νf (xf )

)exp

∑

f∈F

∑

a∈Af


(2.68)


•  Pick features to define an exponential family of distributions •  Use factor graph to represent structure of chosen statistics •  Create undirected graph with a clique for every factor node •  Result: Visualization of Markov properties of your family

Directed Graphs & Exponential Families

p(x) =NY

i=1

p(xi | x�(i), ✓i)

•  For each node, pick an appropriate exponential family •  Pick features of parent nodes relevant to child variable

Most generally, indicators for all joint configurations of parents. •  Child parameters are a (learned) linear function of parent features •  Result: Node-specific conditional exponential family models

�(✓i, x�(i)) =

Z

Xi

⌫(xi) exp�✓

Ti �(xi, x�(i))

dxi

p(xi | x�(i), ✓i) = ⌫(xi) exp�✓

Ti �(xi, x�(i))� �(✓i, x�(i))

Learning DirectedGraphical Models

1X

2X

3X

1X

2X

3X

1X

2X

3X

1X

2X

3X

1,

1,

1,

2,

2,

2,

N,

N,

N,

(a) (b)

1X

2X

3X

1X

2X

3X

1X

2X

3X

1X

2X

3X

1,

1,

1,

2,

2,

2,

N,

N,

N,

(a) (b)

N

Terminology: Inference versus Learning •  Inference: Given a model with known parameters, estimate the “hidden”

variables for some data instance, or find their marginal distribution •  Learning: Given multiple data instances, estimate parameters for a graphical

model of their structure: Maximum Likelihood, Maximum a Posteriori, …   Training instances may be completely or partially observed

•  Inference: Given observed symptoms for a particular patient, infer probabilities that they have contracted various diseases

•  Learning: Given a database of many patient diagnoses,learn the relationships between diseases and symptoms

Example: Expert systems for medical diagnosis

Example: Markov random fields for semantic image segmentation •  Inference: What object category is depicted at each pixel in some image? •  Learning: Across a database of images, in which some example objects

have been labeled, how do those labels relate to low-level image features?

Learning Directed Graphical Models 1X

2X 3X

X 4

1X

2X

1X

3X

1X

X 4

2X 3X

(a) (b)

Intuition: Must learn a good predictive model of each node, given its parent nodes

p(x) =Y

i2Vp(xi | x�(i), ✓i)

•  Directed factorization allows likelihood to locally decompose: p(x | ✓) = p(x1 | ✓1)p(x2 | x1, ✓2)p(x3 | x1, ✓3)p(x4 | x2, x3, ✓4)

log p(x | ✓) = log p(x1 | ✓1) + log p(x2 | x1, ✓2) + log p(x3 | x1, ✓3) + log p(x4 | x2, x3, ✓4)

•  We often assume a similarly factorized (meta-independent) prior: p(✓) = p(✓1)p(✓2)p(✓3)p(✓4)

log p(✓) = log p(✓1) + log p(✓2) + log p(✓3) + log p(✓4)

•  We thus have independent ML or Bayesian learning problems at each node

Learning with Complete Observations 1X

2X

3X

1X

2X

3X

1X

2X

3X

1X

2X

3X

1,

1,

1,

2,

2,

2,

N,

N,

N,

(a) (b)

1X

2X

3X

1X

2X

3X

1X

2X

3X

1X

2X

3X

1,

1,

1,

2,

2,

2,

N,

N,

N,

(a) (b)

1X

2X

3X

1X

2X

3X

1X

2X

3X

1X

2X

3X

1,

1,

1,

2,

2,

2,

N,

N,

N,

(a) (b)

N Directed

Graphical Model N Independent, Identically

Distributed Training Examples Plate Notation •  Directed graph encodes statistical structure of single training examples:

D = {xV,1, . . . , xV,N}p(D | ✓) =NY

n=1

Y

i2Vp(xi,n | x�(i),n, ✓i)

•  Given N independent, identically distributed, completely observed samples:

log p(D | ✓) =NX

n=1

X

i2Vlog p(xi,n | x�(i),n, ✓i) =

X

i2V

"NX

n=1

log p(xi,n | x�(i),n, ✓i)

#

Prior Distributions and Tied Parameters log p(D | ✓) =

NX

n=1

X

i2Vlog p(xi,n | x�(i),n, ✓i) =

X

i2V

"NX

n=1


#

A “meta-independent” factorized prior

log p(✓ | D) = C +

X

i2V

"log p(✓i) +

NX

n=1


#•  Factorized posterior allows independent learning for each node:

log p(✓b | D) = C + log p(✓b) +

X

i|bi=b

NX

n=1

log p(xi,n | x�(i),n, ✓b)

•  Learning easy when groups of nodes are tied to use identical, shared parameters:

log p(D | ✓) =X

i2V

"NX

n=1

log p(xi,n | x�(i),n, ✓bi)

#

bitype or class for variable node i

log p(✓) =X

i2Vlog p(✓i)

Example: Tied Temporal Parameters z1 z2 z3 z4 z5

x1 x2 x3 x4 x5

z1 z2 z3

x1 x2 x3

z1 z2 z3 z4

x1 x2 x3 x4

z0

z0

z0

p(x, z | ✓) =NY

n=1

TnY

t=1

p(zt,n | zt�1,n, ✓time

)p(xt,n | zt,n, ✓obs)

Sequence 1

Sequence 2

Sequence 3

Exponential Families ofProbability Distributions

Why the Exponential Family?

•  Allows simultaneous study of many popular probability distributions:Bernoulli (binary), Categorical, Poisson (counts), Exponential (positive), Gaussian (real), …

•  Maximum likelihood (ML) learning is simple: moment matching of sufficient statistics •  Bayesian learning is simple: conjugate priors are available

Beta, Dirichlet, Gamma, Gaussian, Wishart, … •  The maximum entropy interpretation: Among all distributions with certain moments

of interest, the exponential family is the most random (makes fewest assumptions) •  Parametric and predictive sufficiency: For arbitrarily large datasets, optimal learning

is possible from a finite-dimensional set of statistics (streaming, big data)

vector of sufficient statistics (features) defining the family

vector of natural parameters indexing particular distributions

�(x) 2 Rd

✓ ✓ Rd

p(x | ✓) = ⌫(x) exp{✓T�(x)� �(✓)} �(✓) = log

Z


⇥ = {✓ 2 Rd | �(✓) < 1}

Example: Bernoulli Distribution Bernoulli Distribution: Single toss of a (possibly biased) coin

Ber(x | µ) = µ

x(1� µ)1�x

0 µ 1E[x | µ] = P[x = 1] = µ

x 2 {0, 1}

Exponential Family Form: Derivation on board Ber(x | ✓) = exp{✓x� �(✓)} ✓ 2 ⇥ = R

µ = (1 + exp(�✓))�1

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Logistic Function:

✓ = log(µ)� log(1� µ)

�(✓) = log(1 + e✓)

Log Normalization:

−10 −8 −6 −4 −2 0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

Log-Normalizer is Convex

Log Partition Function (Normalizer) p(x | ✓) = ⌫(x) exp{✓T�(x)� �(✓)} �(✓) = log

Z


Sec. 2.1. Exponential Families 31

Proposition 2.1.1. The log partition function Φ(θ) of eq. (2.2) is convex (strictly sofor minimal representations) and continuously differentiable over its domain Θ. Itsderivatives are the cumulants of the sufficient statistics {φa | a ∈ A}, so that

∂Φ(θ)

∂θa= Eθ[φa(x)] !

∫

Xφa(x) p(x | θ) dx (2.4)

∂2Φ(θ)

∂θa∂θb= Eθ[φa(x)φb(x)] − Eθ[φa(x)] Eθ[φb(x)] (2.5)

Proof. For a detailed proof of this classic result, see [15, 36, 311]. The cumulant gener-ating properties follow from the chain rule and algebraic manipulation. From eq. (2.5),∇2Φ(θ) is a positive semi–definite covariance matrix, implying convexity of Φ(θ). Forminimal families, ∇2Φ(θ) must be positive definite, guaranteeing strict convexity.

Due to this result, the log partition function is also known as the cumulant generatingfunction of the exponential family. The convexity of Φ(θ) has important implicationsfor the geometry of exponential families [6, 15, 36, 74].

Entropy, Information, and Divergence

Concepts from information theory play a central role in the study of learning andinference in exponential families. Given a probability distribution p(x) defined on adiscrete space X , Shannon’s measure of entropy (in natural units, or nats) equals

H(p) = −∑

x∈X

p(x) log p(x) (2.6)

In such diverse fields as communications, signal processing, and statistical physics,entropy arises as a natural measure of the inherent uncertainty in a random variable [49].The differential entropy extends this definition to continuous spaces:

H(p) = −∫

Xp(x) log p(x) dx (2.7)

In both discrete and continuous domains, the (differential) entropy H(p) is concave,continuous, and maximal for uniform densities. However, while the discrete entropy isguaranteed to be non-negative, differential entropy is sometimes less than zero.

For problems of model selection and approximation, we need a measure of thedistance between probability distributions. The relative entropy or Kullback-Leibler(KL) divergence between two probability distributions p(x) and q(x) equals

D(p || q) =

∫

Xp(x) log

p(x)

q(x)dx (2.8)

Important properties of the KL divergence follow from Jensen’s inequality [49], whichbounds the expectation of convex functions:

E[f(x)] ≥ f(E[x]) for any convex f : X → R (2.9)

•  Proof: Chain rule of derivatives and algorithmic manipulation (board)

•  Derivatives of the log partition function have an intuitive form:

⇥ = {✓ 2 Rd | �(✓) < 1}

Reminder: Gradients & Hessians

Gradient Vector: f : Rd ! R

rf : Rd ! Rd

(rf(✓))k =@f(✓)

@✓k

  Hessian is always symmetric   For linear functions, Hessian is zero   For quadratic functions, Hessian is

constant over the input space   The function f is convex if and only if

the Hessian is positive semidefinite

Hessian Matrix: r2f : Rd ! Rd⇥d

(r2f(✓))k` =@2f(✓)

@✓k@✓`

(r2f(✓))kk =@2f(✓)

@✓2k

Reminder: Convex Functions

−1

0

1

2

3

−1

0

1

2

3

0

100

200

300

400

✓ ✓0

f(�✓ + (1� �)✓0) �f(✓) + (1� �)f(✓0)  From any initialization, gradient descent (eventually) finds global minimum   Newton’s method, which uses Hessian, often converges faster   A sum of convex functions remains convex   The negative of a convex function is said to be concave   A function is convex if and only if its epigraph (the set of points lying

above the curve) is a convex set

Log Partition Function (Normalizer) p(x | ✓) = ⌫(x) exp{✓T�(x)� �(✓)} �(✓) = log

Z


•  Derivatives of the log partition function have an intuitive form: r✓�(✓) = E✓[�(x)]

r2✓�(✓) = Cov✓[�(x)] = E✓[�(x)�(x)

T]� E✓[�(x)]E✓[�(x)]

T

⇥ = {✓ 2 Rd | �(✓) < 1}

•  Important consequences for learning with exponential families:   Finding gradients is equivalent to finding expected sufficient statistics,

or moments, of some current model. This is an inference problem!   The Hessian is positive definite so the log partition function is convex   This in turn implies that the parameter space is convex   Learning is a convex problem: No local optima!

At least when we have complete observations…

⇥�(✓)

A Little Information Theory •  The entropy is a natural measure of the inherent uncertainty

(difficulty of compression) of some random variable:



∂Φ(θ)


∫

Xφa(x) p(x | θ) dx (2.4)

∂2Φ(θ)






H(p) = −∑

x∈X

p(x) log p(x) (2.6)


H(p) = −∫




D(p || q) =

∫

Xp(x) log

p(x)

q(x)dx (2.8)





∂Φ(θ)


∫

Xφa(x) p(x | θ) dx (2.4)

∂2Φ(θ)






H(p) = −∑

x∈X

p(x) log p(x) (2.6)


H(p) = −∫




D(p || q) =

∫

Xp(x) log

p(x)

q(x)dx (2.8)



discrete entropy(concave, non-negative)

differential entropy(concave, real-valued)

•  The relative entropy or Kullback-Leibler (KL) divergence is a non-negative, but asymmetric, “distance” between a given pair of probability distributions:The KL divergence equals zero if and only if almost everywhere.



∂Φ(θ)


∫

Xφa(x) p(x | θ) dx (2.4)

∂2Φ(θ)






H(p) = −∑

x∈X

p(x) log p(x) (2.6)


H(p) = −∫




D(p || q) =

∫

Xp(x) log

p(x)

q(x)dx (2.8)




Applying Jensen’s inequality to the logarithm of eq. (2.8), which is concave, it is eas-ily shown that the KL divergence D(p || q) ≥ 0, with D(p || q) = 0 if and only ifp(x) = q(x) almost everywhere. However, it is not a true distance metric becauseD(p || q) "= D(q || p). Given a target density p(x) and an approximation q(x), D(p || q)can be motivated as the information gain achievable by using p(x) in place of q(x) [49].Interestingly, the alternate KL divergence D(q || p) also plays an important role in thedevelopment of variational methods for approximate inference (see Sec. 2.3).

An important special case arises when we consider the dependency between tworandom variables x and y. Let pxy(x, y) denote their joint distribution, px(x) andpy(y) their corresponding marginals, and X and Y their sample spaces. The mutualinformation between x and y then equals

I(pxy) ! D(pxy || pxpy) =

∫

X

∫

Ypxy(x, y) log

pxy(x, y)

px(x)py(y)dy dx (2.10)

= H(px) + H(py) − H(pxy) (2.11)

where eq. (2.11) follows from algebraic manipulation. The mutual information can beinterpreted as the expected reduction in uncertainty about one random variable fromobservation of another [49].

Projections onto Exponential Families

In many cases, learning problems can be posed as a search for the best approximationof an empirically derived target density p̃(x). As discussed in the previous section, theKL divergence D(p̃ || q) is a natural measure of the accuracy of an approximation q(x).For exponential families, the optimal approximating density is elegantly characterizedby the following moment–matching conditions:

Proposition 2.1.2. Let p̃ denote a target probability density, and pθ an exponentialfamily. The approximating density minimizing D(p̃ || pθ) then has canonical parametersθ̂ chosen to match the expected values of that family’s sufficient statistics:

Eθ̂[φa(x)] =

∫

Xφa(x) p̃(x) dx a ∈ A (2.12)

For minimal families, these optimal parameters θ̂ are uniquely determined.

Proof. From the definition of KL divergence (eq. (2.8)), we have

D(p̃ || pθ) =

∫

Xp̃(x) log

p̃(x)

p(x | θ)dx

=

∫

Xp̃(x) log p̃(x) dx −

∫

Xp̃(x)

[

log ν(x) +∑

a∈A

θaφa(x) − Φ(θ)

]

dx

= −H(p̃) −∫

Xp̃(x) log ν(x) dx −

∑

a∈A

θa

∫

Xφa(x) p̃(x) dx + Φ(θ)

p(x) = q(x)

•  The mutual information measures dependence between a pair of random variables:


Applying Jensen’s inequality to the logarithm of eq. (2.8), which is concave, it is eas-ily shown that the KL divergence D(p || q) ≥ 0, with D(p || q) = 0 if and only ifp(x) = q(x) almost everywhere. However, it is not a true distance metric becauseD(p || q) "= D(q || p). Given a target density p(x) and an approximation q(x), D(p || q)can be motivated as the information gain achievable by using p(x) in place of q(x) [49].Interestingly, the alternate KL divergence D(q || p) also plays an important role in thedevelopment of variational methods for approximate inference (see Sec. 2.3).

An important special case arises when we consider the dependency between tworandom variables x and y. Let pxy(x, y) denote their joint distribution, px(x) andpy(y) their corresponding marginals, and X and Y their sample spaces. The mutualinformation between x and y then equals

I(pxy) ! D(pxy || pxpy) =

∫

X

∫

Ypxy(x, y) log

pxy(x, y)

px(x)py(y)dy dx (2.10)

= H(px) + H(py) − H(pxy) (2.11)

where eq. (2.11) follows from algebraic manipulation. The mutual information can beinterpreted as the expected reduction in uncertainty about one random variable fromobservation of another [49].

Projections onto Exponential Families

In many cases, learning problems can be posed as a search for the best approximationof an empirically derived target density p̃(x). As discussed in the previous section, theKL divergence D(p̃ || q) is a natural measure of the accuracy of an approximation q(x).For exponential families, the optimal approximating density is elegantly characterizedby the following moment–matching conditions:

Proposition 2.1.2. Let p̃ denote a target probability density, and pθ an exponentialfamily. The approximating density minimizing D(p̃ || pθ) then has canonical parametersθ̂ chosen to match the expected values of that family’s sufficient statistics:

Eθ̂[φa(x)] =

∫

Xφa(x) p̃(x) dx a ∈ A (2.12)

For minimal families, these optimal parameters θ̂ are uniquely determined.

Proof. From the definition of KL divergence (eq. (2.8)), we have

D(p̃ || pθ) =

∫

Xp̃(x) log

p̃(x)

p(x | θ)dx

=

∫

Xp̃(x) log p̃(x) dx −

∫

Xp̃(x)

[

log ν(x) +∑

a∈A

θaφa(x) − Φ(θ)

]

dx

= −H(p̃) −∫

Xp̃(x) log ν(x) dx −

∑

a∈A

θa

∫

Xφa(x) p̃(x) dx + Φ(θ)

Zero if and only if variables are independent.

cs242: probabilistic graphical models - brown...

Documents