course notes for advanced probabilistic machine …jwp2128/teaching/e9801/notes/apml...chapter 1...

Course Notes forAdvanced Probabilistic Machine Learning

John PaisleyDepartment of Electrical Engineering

Columbia University

Fall 2014

Abstract

These are lecture notes for the seminar ELEN E9801 Topics in Signal Processing:“Advanced Probabilistic Machine Learning” taught at Columbia University in Fall2014. They are transcribed almost verbatim from the handwritten lecture notes, andso they preserve the original bulleted structure and are light on the exposition. Somelectures therefore also have a certain amount of reiterating in them, so some statementsmay be repeated a few times throughout the notes.

The purpose of these notes is to (1) have a cleaner version for the next timeit’s taught, and (2) make them public so they may be helpful to others. Since theexposition comes during class, often via student questions, these lecture notes maycome across as too fragmented depending on the reader’s preference. Still, I hopethey will be useful to those not in the class who want a streamlined way to learn thematerial at a fairly rigorous level, but not yet at the hyper-rigorous level of manytextbooks, which also mix the fundamental results with the fine details. I hope thesenotes can be a good primer towards that end. As with the handwritten lectures, thisdocument does not contain any references.

The twelve lectures are split into two parts. The first eight deal with severalstochastic processes fundamental to Bayesian nonparametrics: Poisson, gamma,Dirichlet and beta processes. The last four lectures deal with some advanced tech-niques for posterior inference in Bayesian models. Each lecture was between 2 and2-1/2 hours long.

i

Contents

I Topics in Bayesian Nonparametrics 1

1 Poisson distribution and process, superposition and marking theorems 2

2 Completely random measures, Campbell’s theorem, gamma process 12

3 Beta processes and the Poisson process 19

4 Beta processes and size-biased constructions 25

5 Dirichlet processes and a size-biased construction 31

6 Dirichlet process extensions, count processes 38

7 Exchangeability, Dirichlet processes and the Chinese restaurant process 45

8 Exchangeability, beta processes and the Indian buffet process 53

II Topics in Bayesian Inference 58

9 EM and variational inference 59

10 Stochastic variational inference 66

11 Variational inference for non-conjugate models 73

12 Scalable MCMC inference 80

ii

Part I

Topics in Bayesian Nonparametrics

1

Chapter 1

Poisson distribution and process, superposition andmarking theorems

s The Poisson distribution is perhaps the fundamental discrete distribution and, along with theGaussian distribution, one of the two fundamental distributions of probability.

Importance: Poisson → discrete r.v.’sGaussian → continuous r.v.’s

Definition: A random variable X ∈ {0, 1, 2, . . . } is Poisson distributed with parameter λ > 0if

P (X = n|λ) =λn

n!e−λ, (1.1)

denoted X ∼ Pois(λ).

Moments of Poisson

E[X] =∞∑n=1

nP (X = n|λ) =∞∑n=1

λn

(n− 1)!e−λ = λ

∞∑n=1

λn−1

(n− 1)!e−λ︸︷︷︸

=1

(1.2)

E[X2] =∞∑n=1

n2P (X = n|λ) = λ∞∑n=1

nλn−1

(n− 1)!e−λ = λ

∞∑n=0

(n+ 1)λn

n!e−λ

= λ(E[X] + 1) = λ2 + λ (1.3)

V[X] = E[X2]− E[X]2 = λ (1.4)

Sums of Poisson r.v.’s (take 1)

s Sums of Poisson r.v.’s are also Poisson. Let X1 ∼ Pois(λ1) and X2 ∼ Pois(λ2). ThenX1 +X2 ∼ Pois(λ1 + λ2).

2

Chapter 1 Poisson distribution and process, superposition and marking theorems 3

Important interlude: Laplace transforms and sums of r.v.’ss Laplace transforms give a very easy way to calculate the distribution of sums of r.v.’s (amongother things).

Laplace transforms Let X ∼ p(X) be a positive random variable and let t > 0. The Laplace transform of X is

E[e−tX ] =

∫X

e−txp(x) dx (sums when appropriate) (1.5)

Important property

s There is a one-to-one mapping between p(x) and E[e−tX ]. That is, if p(x) and q(x) are twodistributions and Ep[e−tX ] = Eq[e−tX ], then p(x) = q(x) for all x. (p and q are the samedistribution)

Sums of r.v.’s

s Let X1ind∼ p(x), X2

ind∼ q(x) and Y = X1 +X2. What is the distribution of Y ?

s Approach: Take the Laplace transform of Y and see what happens.

Ee−tY = Ee−t(X1+X2) = E[e−tX1e−tX2 ] = E[e−tX1 ]E[e−tX2 ]︸︷︷︸by independence of X1 and X2

(1.6)

s So we can multiply the Laplace transforms of X1 and X2 and see if we recognize it.

Sums of Poisson r.v.’s (take 2)s The Laplace transform of a Poisson random variable has a very important form that should bememorized.

Ee−tX =∞∑n=0

e−tnλn

n!e−λ = e−λ

∞∑n=0

(λe−t)n

n!= e−λeλe

−t= eλ(e−t−1) (1.7)

s Back to the problem: X1ind∼ Pois(λ1), X2

ind∼ Pois(λ2), Y = X1 +X2.

Ee−tY = E[e−tX1 ]E[e−tX2 ] = eλ1(e−t−1)eλ2(e−t−1) = e(λ1+λ2)(e−t−1) (1.8)

We recognize that the last term is the Laplace transform of a Pois(λ1 + λ2) random variable.We can therefore conclude that Y ∼ Pois(λ1 + λ2).s Another way of saying this is that, if we draw Y1 ∼ Pois(λ1 + λ2) and X1 ∼ Pois(λ1) andX2 ∼ Pois(λ2) and define Y2 = X1 + X2. Then Y1 is equal to Y2 in distribution. (i.e., theymay not be equal, but they have the same distribution. We write this as Y1

d= Y2.)s The idea extends to sums of more than two. Let Xi ∼ Pois(λi). Then

∑iXi ∼ Pois(

∑i λi)

sinceEe−t

∑iXi =

∏i

Ee−tXi = e(∑i λi)(e

−t−1). (1.9)


A conjugate prior for λ

s What if we have X1, . . . , XN that we believe to be generated by a Pois(λ) distribution, but wedon’t know λ?

s One answer: Put a prior distribution on λ, p(λ), and calculate the posterior of λ using Bayes’rule.

Bayes’ rule (review)

P (A,B) = P (A|B)P (B) = P (B|A)P (A) (1.10)⇓

P (A|B) =P (B|A)P (A)

P (B)(1.11)

posterior =likelihood× prior

evidence(1.12)

Gamma prior

s Let λ ∼ Gam(a, b), where p(λ|a, b) = ba

Γ(a)λa−1e−bλ is a gamma distribution. Then the

posterior of λ is

p(λ|X1, . . . , XN) ∝ p(X1, . . . , XN |λ)p(λ) =

[N∏i=1

λXi

Xi!e−λ

]ba

Γ(a)λa−1e−bλ

∝ λa+∑Ni=1 Xi−1e−(b+N)λ (1.13)

⇓p(λ|X1, . . . , XN) = Gam(a+

∑Ni=1Xi, b+N) (1.14)

Note that E[λ|X1, . . . , XN ] =a+

∑Ni=1 Xi

b+N≈ Empirical average of Xi

(Makes sense because E[X|λ] = λ)

V[λ|X1, . . . , XN ] =a+

∑Ni=1Xi

(b+N)2 ≈ Empirical average/N(Get more confident as we see more Xi)s The gamma distribution is said to be the conjugate prior for the parameter of the Poisson

distribution because the posterior is also gamma.

Poisson–Multinomial

s A sequence of Poisson r.v.’s is closely related to the multinomial distribution as follows:

Let Xiind∼ Pois(λi) and let Y =

∑Ni=1Xi.

Then what is the distribution of ~X = (X1, . . . , XN) given Y ?


We can use basic rules of probability. . .

P (X1, . . . , XN) = P (X1, . . . , XN , Y =∑N

i=1 Xi)← (Y is a deterministic function of X1:N )

= P (X1, . . . , XN |Y =∑N

i=1Xi)P (Y =∑N

i=1Xi) (1.15)

And soP (X1, . . . , XN |Y =

∑Ni=1Xi) = P (X1,...,XN )

P (Y=∑Ni=1 Xi)

=∏i P (Xi)

P (Y=∑Ni=1 Xi)

(1.16)

s We know that P (Y ) = Pois(Y ;∑N

i=1 λi), so

P (X1, . . . , XN |Y =∑

iXi) =

[N∏i=1

λXiiXi!

e−λi

]/[(∑N

i=1 λi)∑Ni=1Xi

(∑N

i=1Xi)!e−

∑Ni=1 λi

]

=Y !

X1! · · ·XN !

N∏i=1

(λi∑Nj=1 λj

)Xi

(1.17)

⇓Mult(Y ; p1, . . . , pN), pi = λi/

∑j λj

s What is this saying?

1. Given the sum of N independent Poisson r.v.’s, the individual values are distributed as amultinomial using the normalized parameters.

2. We can sample X1, . . . , XN in two equivalent ways

a) Sample Xi ∼ Pois(λi) independently

b) First sample Y ∼ Pois(∑

j λj), then (X1, . . . , XN) ∼ Mult(Y ; λ1∑

j λj, . . . , λN∑

j λj

)Poisson as a limiting case distribution

s The Poisson distribution arises as a limiting case of the sum over many binary events eachhaving small probability.

Binomial distribution and Bernoulli process

s Imaging we have an array of random variables Xnm, where Xnm ∼ Bern(λn

)for m = 1, . . . , n

and fixed 0 ≤ λ ≤ n. Let Yn =∑n

m=1Xnm and Y = limn→∞ Yn. Then Yn ∼ Bin(n, λ

n

)and

Y ∼ Pois(λ).


Picture: Have n coins with bias λ/nevenly spaced between [0, 1]. Go downthe line and flip each independently. Inthe limit n → ∞, the total # of 1’s isa Pois(λ) r.v.

Proof :

limn→∞

P (Yn = k|λ) = limn→∞

n!

(n− k)!k!

(λ

n

)k (1− λ

n

)n−k(1.18)

= limn→∞

[n(n− 1) · · · (n− k + 1)

nk

]︸︷︷︸

→ 1

[(1− λ

n

)−k]︸︷︷︸

→ 1

[λk

k!

(1− λ

n

)n]︸︷︷︸→

λk

k!e−λ︸︷︷︸

= Pois(k;λ)

So limn→∞ Bin(n, λ

n

)= Pois(λ).

A more general statements Let λnm be an array of positive numbers such that

1.∑n

m=1 λnm = λ <∞2. λnm < 1 and limn→∞ λnm = 0 for all m

Let Xnm ∼ Bern(λnm) for m = 1, . . . , n. Let Yn =∑n

m=1Xnm and Y = limn→∞ Yn. ThenY ∼ Pois(λ).

Proof : (use Laplace transform)

1. E e−tY = limn→∞

E e−tYn = limn→∞

E e−t∑nm=1Xnm = lim

n→∞

n∏m=1

E e−tXnm

2. E e−tXnm = λnme−t·1 + (1− λnm)e−t·0 = 1− λnm(1− e−t)

3. So E e−tY = limn→∞

n∏m=1

(1− λnm(1− e−t)

)= lim

n→∞e∑nm=1 ln(1−λnm(1−e−t))

4. ln(1− λnm(1− e−t)) = −∞∑s=1

1

sλsnm(1− e−t)s because 0 ≤ 1− λnm(1− e−t) < 1

5.n∑

m=1

ln(1− λnm(1− e−t)) = −

(n∑

m=1

λnm

)︸︷︷︸

= λ

(1− e−t)−∞∑s=2

1

s

(n∑

m=1

λsnm

)(1− e−t)s︸︷︷︸

→ 0 as n → ∞

So E e−tY = e−λ(1−e−t). Therefore Y ∼ Pois(λ).


Poisson process

s In many ways, the Poisson process is no more complicated than the previous discussion on thePoisson distribution. In fact, the Poisson process can be thought of as a “structured” Poissondistribution, which should hopefully be more clear below.

Intuitions and notations

s S : a space (think Rd or part of Rd)s Π : a random countable subset of S(i.e., a random # of points and theirlocations)s A ⊂ S : a subset of Ss N(A) : a counting measure. Countshow many points in Π fall in A(i.e., N(A) = |Π ∩ A|)s µ(·) : a measure on S

↪→ µ(A) ≥ 0 for |A| > 0µ(·) is non-atomic

↪→ µ(A)→ 0 as |A| → ∅

Think of µ as a scaled probability distribution that is continuous so that µ({x}) = 0 for allpoints x ∈ S.

Poisson processes

s A Poisson process Π is a countable subset of S such that

1. For A ⊂ S, N(A) ∼ Pois(µ(A))

2. For disjoint sets A1, . . . , Ak, N(A1), . . . , N(Ak) are independent Poisson random vari-ables.

N(·) is called a “Poisson random measure”. (See above for mapping from Π to N(·))

Some basic properties

s The most basic properties follow from the properties of a Poisson distribution.

a) EN(A) = µ(A)→ therefore µ is sometimes referred to as a “mean measure”

b) If A1, . . . , Ak are disjoint, then

N(⋃ki=1 Ai) =

∑ki=1 N(Ai) ∼ Pois

(∑ki=1 µ(Ai)

)= Pois

(µ(⋃ki=1 Ai)

)(1.19)

Since this holds for k →∞ and µ(Ai)↘ 0, N(·) is “infinitely divisible”.


c) Let A1, . . . , Ak be disjoint subsets of S. Then

P (N(A1) = n1, . . . , N(Ak) = nk|N(⋃ki=1Ai) = n) = P (N(A1)=n1,...,N(Ak)=nk)

P (N(⋃ki=1 Ai)=n)

(1.20)

Notice from earlier that N(Ai)⇔ Xi. Following the same exact calculations,

P (N(A1) = n1, . . . , N(Ak) = nk|N(⋃ki=1Ai) = n) = n!

n1!···nk!

∏ki=1

(µ(Ai)

µ(⋃kj=1 Aj)

)ni(1.21)

Drawing from a Poisson process (break in the basic properties)s Property (c) above gives a very simple way for drawing Π ∼ PP(µ), though some thoughtis required to see why.

1. Draw the total number of points N(S) ∼ Pois(µ(S)).

2. For i = 1, . . . , N(S) draw Xiiid∼ µ/µ(S). In other words, normalize µ to get a

probability distribution on S.3. Define Π = {X1, . . . , XN(S)}.

d) Final basic property (of these notes)

E e−tN(A) = eµ(A)(e−t−1) −→ an obvious result since N(A) ∼ Pois(µ(A)) (1.22)

Some more advanced properties

Superposition Theorem: Let Π1,Π2, . . . be a countable collection of independent Poisson pro-cesses with Πi ∼ PP(µi). Let Π =

⋃∞i=1 Πi. Then Π ∼ PP(µ) with µ =

∑∞i=1 µi.

Proof : Remember from the definition of a Poisson process we have to show two things.

1. Let N(A) be the PRM (Poisson random measure) associated with PP (Poisson process)Π and Ni(A) with Πi. Clearly N(A) =

∑∞i=1Ni(A), and since Ni(A) ∼ Pois(µi(A)) by

definition, it follows that N(A) ∼ Pois(∑∞

i=1 µi(A)).

2. Let A1, . . . , Ak be disjoint. Then N(A1), . . . , N(Ak) are independent because Ni(Aj)are independent for all i and j.

Restriction Theorem: If we restrict Π to a subset of S, we still have a Poisson process. LetS1 ⊂ S and Πi = Π∩S1. Then Π1 ∼ PP(µ1), where µ1(A) = µ(S1 ∩A). This can be thoughtof as setting µ = 0 outside of S1, or just looking at the subspace S1 and ignoring the rest of S.

Mapping Theorem: This says that one-to-one function y = f(x) preserve the Poisson process.That is, if Πx ∼ PP(µ) and Πy = f(Πx), then Πy is also a Poisson process with the propertransformation made to µ. (See Kingman for details. We won’t use this in this class.)


Example

LetN be a Poisson random measure onR2 with mean measure µ(A) = c|A|.

|A| is the area of A (“Lebesguemeasure”)

Intuition : Think trees in a forest.

Question 1: What is the distance R from the origin to the nearest point?

Answer: We know that R > r if N(Br) = 0, where Br = ball of radius r. Since these areequivalent events, we know that

P (R > r) = P (N(Br) = 0) = e−µ(Br) = e−cπr2

. (1.23)

Question 2: Let each atom be the center of a disk of radius a. Take our line of sight as thex-axis. How far can we see?

Answer: The distance V is equivalent to the farthest we can extend a rectangle Dx with y-axisboundaries of [−a, a]. We know that V > x if N(Dx) = 0. Therefore

P (V > x) = P (N(Dx) = 0) = e−µ(Dx) = e−2acx. (1.24)

Marked Poisson processes

s The other major theorem of these notes relates to “marking” the points of a Poisson processwith a random variable.

s Let Π ∼ PP(µ). For each x ∈ Π, associate a r.v. y ∼ p(y|x). We say that y has “marked” x.The results is also a Poisson process.

Theorem: Let µ be a measure on space S and p(y|x) a probability distribution on space M .For each x ∈ Π ∼ PP(µ) draw y ∼ p(y|x) and define Π∗ = {(xi, yi)}. Then Π∗ is a Poissonprocess on S ×M with mean measure µ∗ = µ(dx)p(y|x)dy.

Comment: If N∗(C) = |Π∗ ∩C| for C ⊂ S×M , this says that N∗(C) ∼ Pois(µ∗(C)), whereµ∗(C) =

∫Cµ(dx)p(y|x)dy.


Proof : Need to show that Ee−tN∗(C) = exp{∫C

(e−t − 1)µ(dx)p(y|x)dy}

1. Note that N∗(C) =∑N(S)

i=1 1{(xi, yi) ∈ C}. N(S) is PRM associated with Π ∼ PP(µ).

2. Recall tower property: Ef(A,B) = E[E[f(A,B)|B]].

3. Therefore, Ee−tN∗(C) = E[E[exp

{−t∑N(S)

i=1 1{(xi, yi) ∈ C}}|Π]]

4. Manipulating :

Ee−tN∗(C) = E

N(S)∏i=1

E[e−t1{(xi,yi)∈C}|Π]

(1.25)

= E

N(S)∏i=1

∫M

{e−t·11{(xi, yi) ∈ C}+ e−t·01{(xi, yi) 6∈ C}

}p(yi|xi)dyi

5. Continuing :

Ee−tN∗(C) = E

N(S)∏i=1

[1 +

∫M

(e−t − 1)1{(xi, yi) ∈ C}p(yi|xi)dyi]

︸︷︷︸

use 1{(xi, yi) 6∈ C} = 1− 1{(xi, yi) ∈ C}

= E

[n∏i=1

[1 +

∫S

∫M

(e−t − 1)1{(x, y) ∈ C}p(y|x)dyµ(dx)

µ(S)|N(S) = n

]]︸︷︷︸

Tower again using Poisson-multinomial representation

= E

[((1 +

∫C

(e−t − 1)p(y|x)dyµ(dx)

µ(S)

)N(S)]

︸︷︷︸N(S) ∼ Pois(µ(S))

(1.26)

6. Recall that if n ∼ Pois(λ), then

Ezn =∞∑n=0

znλn

n!e−λ = e−λ

∞∑n=0

(zλ)n

n!= eλ(z−1).

Therefore, this last expectation shows that

Ee−tN∗(C) = exp

{µ(S)

∫C

(e−t − 1)p(y|x)dyµ(dx)

µ(S)

}= exp

{∫C

(e−t − 1)µ(dx)p(y|x)dy

}, (1.27)

thus N∗(C) ∼ Pois(µ∗(C)).


Example 1: Coloring

s Let Π ∼ PP(µ) and let an x ∈ Π be randomly colored from among K colors. Denote the colorby y with P (y = i) = pi. Then Π∗ = {(xi, yi)} is a PP on S ×{1, . . . , K} with mean measureµ∗(dx · {y}) = µ(dx)

∏Ki=1 p

1(y=i)i . If we want to restrict Π∗ to the ith color, called Π∗i , then

we know that Π∗i ∼ PP(piµ). We can also restrict to two colors, etc.

Example 2: Using the extended space

s Image we have a 24-hour store and customers arrive according to a PP with mean measureµ on R (time). In this case, let µ(R) = ∞, but µ([a, b]) < ∞ for finite a, b (which gives theexpected number of arrivals between times a and b). Imagine a customer arriving at time xstays for duration y ∼ p(y|x). At time t, what can we say about the customers in the store?

s Π ∼ PP(µ) and Π∗ = {(xi, yi)} for xi ∈ Π is PP(µ(dx)p(y|x)dy) because it’s a markedPoisson process.

s We can construct the marked Poisson process like below. The counting measure N∗(C) ∼Pois(µ∗(C)), µ∗ = µ(dx)p(y|x)dy.

s The points in Ct (below) are those that arrive before time t and are still there at time t. Itfollows that

N∗(Ct) ∼ Pois(∫ t

0

µ(dx)

∫ ∞t−x

p(y|x)dy

).

N∗(Ct) is the number of customers in the store at time t.

Chapter 2

Completely random measures, Campbell’s theorem,gamma process

Poisson process review

s Recall the definition of a Poisson random measure.

PRM definition: Let S be a space and µ a non-atomic (i.e., diffuse, continuous) measure on it(think a positive function). A random measure N on S is a PRM with mean measure µ if

a) For every subset A ⊂ S, N(A) ∼ Pois(µ(A)).

b) For disjoint sets A1, . . . , Ak, N(A1), . . . , N(Ak) are independent r.v.’s

Poisson process definition: Let X1, . . . , XN(S) be the N(S) points in N (a random number)of measure equal to one. Then the point process Π = {X1, . . . , XN(S)} is a Poisson process,denoted Π ∼ PP(µ).

Recall that to draw this (when µ(S) <∞) we can

a) Draw N(S) ∼ Pois(µ(S))

b) For i = 1, . . . , N(S), draw Xiiid∼ µ/µ(S) ← normalize µ to get a probability measure.

Functions of Poisson processes

s Often models will take the form of a function of an underlying Poisson process:∑

x∈Π f(x).

Example (see Sec. 5.3 of Kingman): Imagine that star locations are distributed as Π ∼ PP(µ).They’re marked independently with a massm ∼ p(m), giving a marked PP Π∗ ∼ PP(µ×p dm).The gravitational field at, e.g., the origin is∑

(x,m)∈Π∗

f((x,m)) =∑

(x,m)∈Π∗

Gmx

‖x‖32

x (G is a constant from physics)

12

Chapter 2 Completely random measures, Campbell’s theorem, gamma process 13

s We can analyze these sorts of problems using PP techniques

s We will be more interested in the context of “completely random measures” in this class.

Functions: The finite case

s Let Π ∼ PP(µ) and |Π| < ∞ (with probability 1). Let f(x) be a positive function. LetM =

∑x∈Π f(x). We calculate its Laplace transform (for t < 0):

EetM = Eet∑x∈Π f(x) = E

|Π|∏i=1

etf(xi)

← recall two things (below) (2.1)

Recall:

1. |Π| = N(S) ← Poisson random measure for Π

2. Tower property Eg(x, y) = E[E[g(x, y)|y]].

So:

E

N(S)∏i=1

etf(xi)

= E

EN(S)∏i=1

etf(xi)|N(S)

= E[E[etf(x)

]N(S)]. (2.2)

Since N(S) ∼ Pois(µ(S)), we use the last term to conclude

E

N(S)∏i=1

etf(xi)

= exp{µ(S)(E[etf(x)]− 1)}

= exp

∫S

µ(dx)(etf(x) − 1)(since Eetf(x) =

∫S

µ(dx)

µ(S)etf(x)

)↗ ↑

recall that E[(et)N(A)] = exp

∫A

µ(dx)(et − 1).

(f(x) = 1 in this case)

And so for functions M =∑

x∈Π f(x) of Poisson processes Π with an almost sure finitenumber of points

EetM = exp

∫S

µ(dx)(etf(x) − 1). (2.3)


The infinite case

s In the case where µ(S) =∞, N(S) =∞ with probability one. We therefore use a differentproof technique that gives the same result.

s Approximate f(x) with simple functions: fk =∑2k

i=1 ai1Ai using k2k equally-spaced levels inthe interval (0, k); Ai = [ i

2k, i+1

2k) and ai = i

2k. In the limit k →∞, the width of those levels

12k↘ 0, and so fk → f .

s That’s not a proof, but consider that fk(x) = maxi≤k2ki

2k1( i

2k< x), so fk(x)↗ x as k →∞.

s Important notation change:M =∑

x∈Π f(x)⇔∫SN(dx)f(x).

s Approximate f with fk. Then with the notation change,Mk =∑

x∈Π fk(x)⇔∑2k

i=1 aiN(Ai).s The Laplace functional is

EetM = limk→∞

EetMk = limk→∞

E2k∏i=1

etaiN(Ai) ← N(Ai) and N(Aj) are independent

= limk→∞

exp

2k∑i=1

µ(Ai)(etai − 1)

← N(Ai) ∼ Pois(µ(Ai))

= exp

∫S

µ(dx)(etf(x) − 1) ← integral as limit of infinitesimal sums (2.4)

Mean and variance ofM: Using ideas from moment generating functions, it follows that

E(M) =

∫S

µ(dx)f(x)︸︷︷︸= d

dtEetM|t=0

, V(M) =

∫S

µ(dx)f(x)2︸︷︷︸= d2

dt2EetM|t=0 − E(M)2

(2.5)


Finiteness of∫SN(dx)f(x)

s The next obvious question when µ(S) =∞ (and thus N(S) =∞) is if∫SN(dx)f(x) <∞.

Campbell’s theorem gives the necessary and sufficient conditions for this to be true.

s Campbell’s Theorem: Let Π ∼ PP(µ) and N the PRM. Let f(x) be a non-negative functionon S. Then with probability one,

M =

∫S

N(dx)f(x)

{<∞ if

∫S

min(f(x), 1)µ(dx) <∞=∞ otherwise

(2.6)

Proof : For u > 0, e−uM = e−uM1(M < ∞) ↗ 1(M < ∞) as u ↘ 0. By dominatedconvergence, as u↘ 0

Ee−uM = E[e−uM1(M <∞)]↗ E[1(M <∞)] = P (M <∞) (2.7)

In our case: P (M <∞) = limu↘0 exp∫Sµ(dx)(e−uf(x) − 1).

Sufficiency

1. For P (M <∞) = 1 as in the theorem, we need limu↘0

∫Sµ(dx)(e−uf(x) − 1) = 0.

2. For 0 < u < 1 we have

1−f(x) < 1−uf(x) < e−uf(x) −→

(Figure: e−uf(x) is convex in f(x)and 1 − uf(x) is a 1st order Taylorexpansion of e−uf(x) at f(x) = 0)

3. Therefore 1− e−uf(x) < f(x). Also, 1− e−uf(x) < 1 trivially.

4. So:0 ≤

∫S

µ(dx)(1− e−uf(x)) ≤∫S

µ(dx) min(f(x), 1) (2.8)

5. If∫Sµ(dx) min(f(x), 1) <∞, then by dominated convergence

limu↘0

∫S

µ(dx)(1− e−uf(x)) =

∫S

µ(dx)(

1− exp

{limu↘0−uf(x)

})= 0 (2.9)

s This proves sufficiency. For necessity we can show that (see, e.g., Cinlar II.2.13)∫S

min(f(x), 1)µ(dx) =∞ =⇒∫S

µ(dx)(1−e−uf(x)) =∞ =⇒ Ee−uM = 0,∀u (2.10)


Completely random measures (CRM)

s Definition of measure: The set function µ is a measure on the space S if

1. µ(∅) = 0

2. µ(A) ≥ 0 for all A ⊂ S

3. µ(∪∞i=1Ai) =∑∞

i=1 µ(Ai) when Ai ∩ Aj = ∅, i 6= j

s Definition of completely random measure: The set function M is a completely random measureon the space S if it satisfies #1 to #3 above and

1. M(A) is a random variable

2. M(A1), . . . ,M(Ak) are independent for disjoint sets Ai

s Example: Let N be the counting measure associated with Π ∼ PP(µ). It’s a CRM.

s We will be interested in the following situation: Let Π ∼ PP(µ) and mark each θ ∈ Π with a r.v.π ∼ λ(π), π ∈ R+. Then Π∗ = {(θ, π)} is a PP on S × R+ with mean measure µ(dθ)λ(π)dπ

s If N(dθ, dπ) is the counting measure for Π∗, then N(C) ∼ Pois(∫Cµ(dθ)λ(π)dπ).

s For A ⊂ S, let M(A) =∫A

∫∞0N(dθ, dπ)π. Then M is a CRM on S.

s M is a special case of sums of functions of Poisson processes with f(θ, π) = π. Therefore weknow that

EetM(A) = exp

∫A

∫ ∞0

(etπ − 1)µ(dθ)λ(π)dπ. (2.11)

s This works both ways: If we define M and show it has this Laplace transform, then we knowthere is a marked Poisson process “underneath” it with mean measure equal to µ(dθ)λ(π)dπ.

−→

marked PP on S × R+ with mean completely random measure on S.measure µ(dx)× λ(π)dπ M(A) =

∫A

∫∞0N(dθ, dπ)π


Gamma processess Definition: Let µ be a non-atomic measure on S. Then G is a gamma process if for all A ⊂ S,G(A) ∼ Gam(µ(A), c) and G(A1), . . . , G(Ak) are independent for disjoint A1, . . . , Ak. Wewrite G ∼ GaP(µ, c). (c > 0)s Before trying to intuitively understand G, let’s calculate its Laplace transform. For t < 0,

EetG(A) =

∫ ∞0

cµ(A)

Γ(µ(A))G(A)µ(A)−1e−G(A)(c−t)dG(A) =

(c

c− t

)µ(A)

(2.12)

s Manipulate this term as follows (and watch the magic!)

(c

c− t

)µ(A)

= exp

{−µ(A) ln

c− tc

}= exp

{−µ(A)

∫ c−t

c

1

sds

}= exp

{−µ(A)

∫ c−t

c

ds

∫ ∞0

e−πsdπ

}= exp

{−µ(A)

∫ ∞0

dπ

∫ c−t

c

e−πsds

}(switched integrals)

= exp

{µ(A)

∫ ∞0

(etπ − 1)π−1e−cπdπ

}(2.13)

Therefore, G has an underlying Poisson random measure on S × R+

G(A) =

∫A

∫ ∞0

N(dθ, dπ)π, N(dθ, dπ) ∼ Pois(µ(dθ)π−1e−cπdπ) (2.14)

s The mean measure of N is µ(dθ)π−1e−cπdπ on S × R+. We can use this to answer questionsabout G ∼ GaP(µ, c) using the Poisson process perspective.

1. How many total atoms?∫S

∫∞0µ(dθ)π−1e−cπdπ =∞ ⇒ infinite # w.p. 1

(Tells us that there are an infinite number of points in any subset A ⊂ S that have nonzeromass according to G)

2. How many atoms ≥ ε > 0?∫S

∫∞εµ(dθ)π−1e−cπdπ <∞ ⇒ finite # w.p. 1

(w.r.t. #1, further tells us only a finite number have mass greater than ε)3. Campbell’s theorem: f(θ, π) = π →

∫S

∫∞0

min(π, 1)µ(dθ)π−1e−cπdπ <∞, therefore

G(A) =

∫A

∫ ∞0

N(dθ, dπ)π <∞ w.p. 1

(Tells us if we summed up the infinite number of nonzero masses in any set A, we wouldget a finite number even though we have an infinite number of nonzero things to add)s Aside: We already knew #3 by definition of G, but this isn’t always the order in which CRMs

are defined. Imagine starting the definition with a mean measure on S × R+.s #3 shouldn’t feel mysterious at all. Consider∑∞

n=11n2 . It’s finite, but for each n, 1

n2 > 0 andthere are an infinite number of settings for n. “Infinite jump processes” such as the gammaprocess replace deterministic sequences like (1, 1/22, 1/32, . . . ) with something random.


Gamma process as a limiting case

s Is there a more intuitive way to understand the gamma process?

s Definition: Let µ be a non-atomic measure on S and µ(S) <∞. Let πiiid∼ Gam(µ(S)

K, c) and

θiiid∼ µ/µ(S) for i = 1, . . . , K. If Gk =

∑Ki=1 πiδθi , then limK→∞GK = G ∼ GaP(µ, c).

Picture: In the limit K →∞, we havemore and more atoms with smaller andsmaller weights

GK(A) =K∑i=1

πiδθi(A) =K∑i=1

πi1(θi ∈ A)

Proof : Use the Laplace transform

EetGK(A) = Eet∑Ki=1 πi1(θi∈A) = E

∏Ki=1 e

tπi1(θi∈A) = E[etπ1(θ∈A)]K

= E[etπ1(θ ∈ A) + 1(θ 6∈ A)

]K=

[E[etπ]P (θ ∈ A) + P (θ 6∈ A)

]K=

[(c

c− t

)µ(S)K µ(A)

µ(S)+ 1− µ(A)

µ(S)

]K

=

[1 +

µ(A)

µ(S)

((c

c− t

)µ(S)K

− 1

)]K

=

[1 +

µ(A)

µ(S)

(∞∑n=1

(ln cc−t)

n

n!

(µ(S)

K

)n)]K← (exponential power series)

=

[1 +

µ(A)

µ(S)

(µ(S)

Kln

c

c− t+O(1/K2)

)]K(2.15)

s In the limit K →∞, this last equation converges to

exp

{µ(A)

µ(S)µ(S) ln

c

c− t

}=

(c

c− t

)µ(A)

.

( recall that limK→∞(1 + aK

+O(K−2))K = ea )

s This is the Laplace transform of a Gam(µ(A), c) random variable.

s Therefore, GK(A)→ G(A) ∼ Gam(µ(A), c), which we’ve already defined and analyzedas a gamma process.

Chapter 3

Beta processes and the Poisson process

A sparse coding latent factor model

s We have a d× n matrix Y . We want to factorize it as follows:

where

θi ∼ p(θ) i = 1, . . . , K

wj ∼ p(w) j = 1, . . . , n

zj ∈ {0, 1}K j = 1, . . . , n

“sparse coding” because each vector Yjonly possesses the columns of Θ indicatedby zj (want

∑i zji � K)

s Example: Y could be

a) gene data of n people,

b) patches extracted from an image for denoising (called “dictionary learning”)

s We want to define a “Bayesian nonparametric” prior for this problem. By this we mean that

1. The prior can allow K →∞ and remain well defined

2. As K →∞, the effective rank is finite (and relatively small)

3. The model somehow learns this rank from the data (not discussed)

19

Chapter 3 Beta processes and the Poisson process 20

A “beta sieves” prior

s Let θi ∼ µ/µ(S) and wj be drawn as above. Continue this generative model by letting

zjiiid∼ Bern(πi), j = 1, . . . , n (3.1)

πi ∼ Beta(αγ

K, α(

1− γ

K

)), i = 1, . . . , K (3.2)

s The set (θi, πi) are paired. πi gives the probability an observation picks θi. Notice that weexpect πi → 0 as K →∞.

s Construct a completely random measure HK =∑K

i=1 πiδθi .s We want to analyze what happens when K →∞. We’ll see that it converges to a beta process.

Asymptotic analysis of beta sieves

s We have that HK =∑K

i=1 πiδθi , πiiid∼ Beta (αγ/K, α(1− γ/K)) , θi

iid∼ µ/µ(S), whereγ = µ(S) <∞. We want to understand limK→∞HK .

s Look at the Laplace transform of HK(A). Let H(A) = limK→∞HK(A). Then

EetH(A) = limK→∞

EetHK(A) = limK→∞

Eet∑Ki=1 πi1(θi∈A) = lim

K→∞E[etπ1(θ∈A)]K︸︷︷︸

sum→ product and use i.i.d. fact

(3.3)

s Focus on Eetπ1(θ∈A) for a particular K-level approximation. We have the following (long)sequence of equalities:

Eetπ1(θ∈A) = E[etπ1(θ ∈ A) + 1(θ 6∈ A)] = P (θ ∈ A)Eetπ + P (θ 6∈ A)

= 1 +µ(A)

µ(S)

(Eetπ − 1

)← Eetπ = 1 +

∞∑s=1

ts

s!

s−1∏r=0

αγK

+ r

α + r

= 1 +µ(A)

µ(S)

∞∑s=1

ts

s!

s−1∏r=0

αγK

+ r

α + r← plugging in ↑

= 1 +µ(A)

K

∞∑s=1

ts

s!

s−1∏r=1

r

α + r+O( 1

K2 ) ← separate out r = 0

= 1 +µ(A)

K

∞∑s=1

ts

s!

αΓ(α)Γ(s)

Γ(α + s)+O( 1

K2 ) ← since Γ(α + 1) = αΓ(α)

= 1 +µ(A)

K

∞∑s=1

ts

s!

∫ 1

0

απs−1(1− π)α−1dπ +O( 1K2 ) ← normalizer of Beta(s, α)

= 1 +µ(A)

K

∫ 1

0

∞∑s=1

((tπ)s

s!

)απ−1(1− π)α−1dπ +O( 1

K2 )

= 1 +µ(A)

K

∫ 1

0

(etπ − 1)απ−1(1− π)α−1dπ +O( 1K2 )


Again using limK→∞(1 + a/K +O(K−2)K = ea, it follows that

limK→∞

E[etπ1(θ∈A)]K = exp

{µ(A)

∫ 1

0

(etπ − 1)απ−1(1− π)α−1dπ

}. (3.4)

s We therefore know that

1. H is a completely random measure.

2. It has an associated underlying Poisson random measure N(dθ, dπ) on S × [0, 1] withmean measure µ(dθ)απ−1(1− π)α−1dπ.

3. We can write H(A) as∫A

∫ 1

0N(dθ, dπ)π.

Beta process (as a CRM)

s Definition: Let N(dθ, dπ) be a Poisson random measure on S × [0, 1] with mean measureµ(dθ)απ−1(1 − π)α−1dπ, where µ is a non-atomic measure. Define the CRM H(A) as∫A

∫ 1

0N(dθ, dπ)π . Then H is called a beta process, H ∼ BP(α, µ) and

EetH(A) = exp

{µ(A)

∫ 1

0

(etπ − 1)απ−1(1− π)α−1dπ

}.

s (We just saw how we can think of H as the limit of a finite collection of random variables. Thistime we’re just starting from the definition, which we could proceed to analyze regardless ofthe beta sieves discussion above.)

s Properties of H: Since H has a Poisson process representation, we can use the mean mea-sure to calculate its properties (and therefore the asymptotic properties of the beta sievesapproximation).

s Finiteness: Using Campbell’s theorem, H(A) is finite with probability one, since∫A

∫ 1

0

min(π, 1)︸︷︷︸= π

απ−1(1− π)α−1dπµ(dθ) = µ(A) <∞ (by assumption about µ)

(3.5)s Infinite jump process: H(A) is constructed from an infinite number of jumps, almost allinfinitesimally small, since

N(A× (0, 1]) ∼ Pois(µ(A)

∫ 1

0

απ−1(1− π)α−1dπ)

= Pois(∞) (3.6)

s Finite number of “big” jumps: There are only a finite number of jumps greater than anyε > 0 since

N(A× [ε, 1]) ∼ Pois(µ(A)

∫ 1

ε

απ−1(1− π)α−1dπ︸︷︷︸< ∞ for ε > 0

)(3.7)

As ε→ 0, the value in the Poisson goes to infinity, so the infinite jump process arises inthis limit. Since the integral over the magnitudes is finite, this infinite number of atoms isbeing introduced in a “controlled” way as a function of ε (i.e., not “too quickly”)


s Reminder and intuitions: All of these properties are over instantiations of a beta process, andso all statements are made with probability one.

– It’s not absurd to talk about beta processes that don’t have an infinite number of jumps, orintegrate to something infinite (“not absurd” in the way that it is absurd to talk about anegative value drawn from a beta distribution).

– The support of the beta process includes these events, but they have probability zero, soany H ∼ BP(α, µ) is guaranteed to have the properties discussed above.

– It’s easy to think of H as one random variable, but as the beta sieves approximation shows,H is really a collection of an infinite number of random variables.

– The statements we are making about H above aren’t like asking whether a beta randomvariable is greater than 0.5. They are larger scale statements about properties of thisinfinite collection of random variables as a whole.

s Another definition of the beta process links it to the beta distribution and our finite approxima-tion:

s Definition II: Let µ be a non-atomic measure on S. For all infinitesimal sets dθ ∈ S, let

H(dθ) ∼ Beta{αµ(dθ), α(1− µ(dθ))},

then H ∼ BP(α, µ).

s We aren’t going to prove this, but the proof is actually very similar to the beta sieves proof.

s Note the difference from the gamma process, where G(A) ∼ Gam(µ(A), c) for any A ⊂ S.The beta distribution only comes in the infinitesimal limit. That is

H(A) 6∼ Beta{αµ(A), α(1− µ(A))},

when µ(A) > 0. Therefore, we can only write beta distributions on things that equal zero withprobability one. . . Compare this with the limit of the beta sieves prior.

s Observation: While µ({θ}) = µ(dθ) = 0,∫Aµ({θ})dθ = 0 but

∫Aµ(dθ) = µ(A) > 0.

s This is a major difference between a measure and a function: µ is a measure, not a function. Italso seems to me a good example of why these additional concepts and notations are necessary,e.g., why we can’t just combine things like µ(A) =

∫Ap(θ)dθ into one single notation, but

instead talk about the “measure” µ and it’s associated density p(θ) such that µ(dθ) = p(θ)dθ.

s This leads to discussions involving the Radon-Nikodym theorem, etc. etc.

s The level of our discussion stops at an appreciation for why these types of theorems exist andare necessary (as overly-obsessive as they may feel the first time they’re encountered), but wewon’t re-derive them.


Bernoulli process

s The Bernoulli process is constructed from the infinite limit of the “z” sequence in the pro-cess zji ∼ Bern(πi), πi

iid∼ Beta (αγ/K, α(1− γ/K)) , i = 1, . . . , K. The random measureX

(K)j =

∑Ki=1 zjiδθi converges to a “Bernoulli process” as K →∞.

s Definition: Let H ∼ BP(α, µ). For each atom of H (the θ for which H({θ}) > 0), letXj({θ})|H

ind∼ Bern(H({θ})). Then Xj is a Bernoulli process, denoted Xj|H ∼ BeP(H).

s Observation: We know thatH has an infinite number of locations θ whereH({θ}) > 0 becauseit’s a discrete measure (the Poisson process proves that). Therefore, X is infinite as well.

Some questions about X

1. How many 1’s in Xj?

2. For X1, . . . , Xn|H ∼ BeP(H), how many total locations are there with at least one Xj

equaling one there (marginally speaking, with H integrated out)?

i.e., what is∣∣∣{θ :

∑nj=1 Xj({θ}) > 0}

∣∣∣s The Poisson process representation of the BP makes calculating this relatively easy. We start

by observing that the Xj are marking the atoms, and so we have a marked Poisson process (or“doubly marked” since we can view (θ, π) as a marked PP as well).

Beta process marked by a Bernoulli process

s Definition: Let Π ∼ PP(µ(dθ)απ−1(1−π)α−1dπ) on S×[0, 1] be a Poisson process underlyinga beta process. For each (θ, π) ∈ Π draw a binary vector z ∈ {0, 1}n where zi|π

iid∼ Bern(π)for i = 1, . . . , n. Denote the distribution on z as Q(z|π). Then Π∗ = {(θ, π, z)} is a markedPoisson process with mean measure µ(dθ)απ−1(1− π)α−1dπQ(z|π).

s There is therefore a Poisson process underlying the joint distribution of the hierarchical process

H ∼ BP(α, µ), Xi|Hiid∼ BeP(H), i = 1, . . . , n.

s We next answer the two questions about X asked above, starting with the second one.


Question: What is K+n =

∣∣∣{θ :∑n

j=1Xj({θ}) > 0}∣∣∣ ?

Answer: The transition distribution Q(z|π) gives the probability of a vector z at a particularlocation (θ, π) ∈ Π (notice Q doesn’t depend on θ).

All we care about is whether z ∈ C = {0, 1}n\~0 (i.e., has a 1 in it)

We make the following observations:

– The probability Q(C|π) = P (z ∈ C|π) = 1− P (z 6∈ C|π) = 1− (1− π)n.

– If we restrict the marked PP to C, we get the distribution on the value of K+n :

K+n = N(S, [0, 1], C) ∼ Pois

(∫S

∫ 1

0

µ(dθ)απ−1(1− π)α−1dπ Q(C|π)︸︷︷︸= 1−(1−π)n

). (3.8)

– It’s worth stopping to remember that N is a counting measure, and think about whatexactly N(S, [0, 1], C) is counting.

∗ N(S, [0, 1], C) is asking for the number of times event C happens (an event relatedto z), not caring about what the corresponding θ or π are (hence the S and [0, 1]).∗ i.e., it’s counting the thing we’re asking for, K+

n .

– We can show that 1− (1− π)n =n−1∑i=0

π(1− π)i ← geometric series

– It follows that∫S

∫ 1

0

µ(dθ)απ−1(1− π)α−1dπ(1− (1− π)n) = µ(S)n−1∑i=0

α

∫ 1

0

(1− π)α+i−1dπ

= µ(S)n−1∑i=0

αΓ(1)Γ(α + i)

Γ(α + i+ 1)

=n−1∑i=0

αµ(S)

α + i(3.9)

s Therefore K+n ∼ Pois

( n−1∑i=0

αµ(S)

α + i

).

s Notice that as n→∞, K+n →∞ with probability one, and that EK+

n ≈ αµ(S) lnn.

s Also notice that we get the answer to the first question for free. Since Xj are i.i.d., we cantreat each one marginally as if it were the first one.

s If n = 1, X(S) ∼ Pois(µ(S)). That is, the number of ones in each Bernoulli process isPois(µ(S))-distributed.

Chapter 4

Beta processes and size-biased constructions

The beta process

s Definition (review): Let α > 0 and µ be a finite non-atomic measure on S. Let C ∈ S × [0, 1]and N be a Poisson random measure with N(C) ∼ Pois(

∫Cµ(dθ)απ−1(1 − π)α−1dπ). For

A ⊂ S define H(A) =∫A

∫ 1

0N(dθ, dπ)π. Then H is a beta process, denoted H ∼ BP(α, µ).

Intuitive picture (review)

Figure 4.1 (left) Poisson process (right) CRM constructed form Poisson process. If (dθ, dπ) is a point inthe PP, N(dθ, dπ) = 1 and N(dθ, dπ)π = π. H(A) is adding up π’s in the set A× [0, 1].

Drawing from this prior

s In general, we know that if Π ∼ PP(µ), we can drawN(S) ∼ Pois(µ(S)) andX1, . . . , XN(S)iid∼

µ/µ(S) and construct Π from the X ′is.

s Similarly, we have the reverse property that if N ∼ Pois(γ) and X1, . . . , XNiid∼ p(X), then the

set Π = {X1, . . . , XN} ∼ PP(γp(X)dX). (This inverse property will be useful later.)

25

Chapter 4 Beta processes and size-biased constructions 26

s Since∫S

∫ 1

0µ(dθ)απ−1(1− π)α−1dπ =∞, this approach obviously won’t work for drawing

H ∼ BP(α, µ).

s The method of partitioning [0, 1] and drawing N(S×(a, b]) followed by

θ(a)i ∼ µ/µ(S), π

(a)i ∼

απ−1(1− π)α−1∫ baαπ−1(1− π)α−1dπ

1(a < π(a)i ≤ b) (4.1)

is possible (using the Restriction theorem from Lecture 1, independence of PP’s on disjointsets, and the first bullet of this section), but not as useful for Bayesian models.

s The goal is to find size-biased representations for H that are more straightforward. (i.e., thatinvolve sampling from standard distributions, which will hopefully make inference easier)

Size-biased representation I (a “restricted beta process”)

s Definition: Let α = 1 and µ be a non-atomic measure on S with µ(S) = γ <∞. Generate thefollowing independent set of random variables

Viiid∼ Beta(γ, 1), θi

iid∼ µ/µ(S), i = 1, 2, . . . (4.2)

Let H =∑∞

i=1

(∏ij=1 Vj

)δθi . Then H ∼ BP(1, µ).

Proof: The proof uses the limiting case of the following finite approximation

s Let πi ∼ Beta( γK, 1), θi ∼ µ/µ(S) for i = 1, . . . , K. LetHK =

∑Ki=1 πiδθi . Then limK→∞HK ∼

BP(1, µ). The proof is similar to the one last lecture.

Question 1: As K →∞, what is π(1) = max{π1, . . . , πK}?

Answer: Look at the CDF’s. We want the function P (π(1) < V1) for a V1 ∈ [0, 1]. Because theπi are independent,

P (π(1) < V1) = limK→∞

P (π1 < V1, . . . , πK < V1) = limK→∞

K∏i=1

P (πi < V1) (4.3)

– P (πi < V1) =

∫ V1

0

γ

KπγK−1

i dπi = VγK

1

– limK→∞

K∏i=1

P (πi < V1) = V γ1

– Therefore, π(1) = V1, V1 ∼ Beta(γ, 1)


Question 2: What is the second largest, denoted π(2) = limK→∞max{π1, . . . , πK}\{π(1)}?

Answer: This is a little more complicated, but answering how to get π(2) shows how to get theremaining π(i).

P (π(2) < t|π(1) = V1) =∏

πi 6=π(1)

P (πi < t|πi < V1) ← condition is each πi < π(1) = V1

=∏

πi 6=π(1)

P (πi < t, πi < V1)

P (πi < V1)

=∏

πi 6=π(1)

P (πi < t)

P (π(1) = V1)← since t < V1, first event contains second

= limK→∞

∏πi 6=π(1)

∫ t0γKπγK−1

i dπi∫ V1

0γKπγK−1

i dπi

= limK→∞

[(t

V1

) γK

]K−1

=

(t

V1

)γ(4.4)

– So the density p(π(2)|π(1) = V1) = V −11 γ

(π(2)

V1

)γ−1

. π(2) has support [0, V1].

– Change of variables: V2 := π(2)/V1 → π(2) = V1V2, dπ(2) = V1dV2.

– Plugging in, P (V2|π(1) = V1) = V −11 γV γ−1

2 · V1︸︷︷︸Jacobian

= γV γ−12 = Beta(γ, 1)

– The above calculation has shown two things:

1. V2 is independent of V1 (this is an instance of a “neutral-to-the-right process”)

2. V2 ∼ Beta(γ, 1)

s Since π(2)|{π(1) = V1} = V1V2 and V1, V2 are independent, we can get the value of π(2) usingpreviously drawn V1 and then drawing V2 from Beta(γ, 1) distributions.

s The same exact reasoning follows for π(3), π(4), . . .s For example, for π(3), we have P (π(3) < t|π(2) = V1V2, π(1) = V1) = P (π(3) < t|π(2) = V1V2)because conditioning on π(2) = V1V2 restricts π(3) to also satisfy condition of π(1).s In other words, if we force π(3) < π(2) by conditioning, we get the additional requirementπ(3) < π(1) for free, so we can condition on the π(i) immediately before.

s Think of V1V2 as a single non-random (i.e., already known) value by the time we get to π(3). Wecan exactly follow the above sequence after making the correct substitutions and re-indexing.


Size-biased representation II

Definition: Let α > 0 and µ be a non-atomic measure on S with µ(S) < ∞. Generate thefollowing random variables:

Ci ∼ Pois( αµ(S)

α + i− 1

), i = 1, 2, . . .

πij ∼ Beta(1, α + i− 1), j = 1, . . . , Ci

θij ∼ µ/µ(S), j = 1, . . . , Ci (4.5)

Define H =∑∞

i=1

∑Cij=1 πijδθij . Then H ∼ BP(α, µ).

Proof:

s We can use Poisson processes to prove this. This is a good example of how easy a proof canbecome when we recognize a hidden Poisson process and calculate it’s mean measure.

– Let Hi =∑Ci

j=1 πijδθij . Then the set Πi = {(θij, πij)} is a Poisson process because itcontains a Poisson-distributed number of i.i.d. random variables.

– As a result, the mean measure of Πi is

αµ(S)

α + i− 1︸︷︷︸Poisson # part

× (α + i− 1)(1− π)α+i−2dπ︸︷︷︸distribution on π

×µ(dθ)/µ(S)︸︷︷︸distribution on θ

(4.6)

We can simplify this to αµ(dθ)(1 − π)α+i−2dπ. We can justify this with the markingtheorem (π marks θ), or just thinking about the joint distribution of (θ, π).

– H =∑∞

i=1Hi by definition. Equivalently Π =⋃∞i=1 Πi.

– By the superposition theorem, we know that Π is a Poisson process with mean measureequal to the sum of the mean measures of each Πi.

– We can calculate this directly:

∞∑i=1

αµ(dθ)(1− π)α+i−2dπ = αµ(dθ)(1− π)α−2

∞∑i=1

(1− π)i︸︷︷︸= 1−π

π

dπ (4.7)

– Therefore, we’ve shown that Π is a Poisson process with mean measure

απ−1(1− π)α−1dπµ(dθ).

– In other words, this second size-biased construction is the CRM constructed from in-tegrating a PRM with this mean measure against the function f(θ, π) = π along the πdimension. This is the definition of a beta process.


Size-biased representation IIIs Definition: Let α > 0 and µ be a non-atomic measure on S with µ(S) <∞. The following isa constructive definition of H ∼ BP(α, µ).

Ci ∼ Pois(µ(S)), V(`)ij ∼ Beta(1, α), φij ∼ µ/µ(S) (4.8)

H =∞∑i=1

Ci∑j=1

V(i)ij

i−1∏`=1

(1− V (`)ij )δθij (4.9)

s Like the last construction, the weights are decreasing in expectation as a function of i.

H =

C1∑j=1

V1jδθ1j +

C2∑j=1

V(2)

2j (1− V (1)2j )δθ2j +

C3∑j=1

V(3)

3j (1− V (2)3j )(1− V (1)

3j )δθ3j + · · · (4.10)

s The structure is also very similar. We have a Poisson-distributed number of atoms in eachgroup and they’re marked with an independent random variable.

s Therefore, we can write H =∑∞

i=1 Hi, where each Hi has a corresponding Poisson process Πi

with mean measure µ(S)× (µ(dθ)/µ(S))× λi(π)dπ, where λi(π) is the distribution of

π = Vi

i−1∏j=1

(1− Vj), Vi ∼ Beta(1, α).

s By the superposition theorem, H has an underlying Poisson process Π with mean measureequal to the sum of each Hi’s mean measures: µ(dθ)

∑∞i=1 λi(π)dπ.

s Therefore, all that remains is to calculate this sum (which is a little complicated).

Proof :s We focus on λi(π), which is the distribution on π = f(V1, . . . , Vi), where f(V1, . . . , Vi) =Vi∏i−1

j=1(1− Vj), Vj ∼ Beta(1, α).

s Lemma: Let T ∼ Gam(i− 1, α). Then e−T d=∏i−1

j=1(1− Vj).

s Proof : Define ξj = − ln(1 − Vj). We can show by a change of variables that ξj ∼ Exp(α).The function − ln

∏i−1j=1(1− Vj) =

∑i−1j=1 ξj . Since the Vj are independent, the ξj are indepen-

dent. We know that sums of i.i.d. exponential r.v.’s are gamma distributed, so T =∑i−1

j=1 ξj is

distributed as Gam(i − 1, α). That is, − ln∏i−1

j=1(1 − Vj)d= T ∼ Gam(i − 1, α) and the re-

sult follows because the same function of two equally distributed r.v.’s is also equally distributed.

s We split the proof into two cases, i = 1 and i > 1.

s Case i = 1: V1j ∼ Beta(1, α), therefore π1j = V1j ∼ λ1(π)dπ = α(1− π)α−1dπ.


s Case i > 1: Vij ∼ Beta(1, α), Tij ∼ Gam(i− 1, α), πij = Vije−Tij . We need to find the density

of πij . Let Wij = e−Tij . Then changing variables,

pWi(w|α) =

αi−1

(i− 2)!wα−1(− lnw)i−2. (4.11)

↑plug Tij = − lnWij into gamma distribution and multiply by Jacobian

s Therefore πij = VijWij and using the product distribution formula

λi(π|α) =

∫ 1

π

w−1pV (π/w|α)pWi(w|α)dw

=αi

(i− 2)!

∫ 1

π

wα−1(− lnw)i−2(1− π/w)α−1dw

=αi

(i− 2)!

∫ 1

π

w−1(− lnw)i−2(w − π)α−1dw (4.12)

s This integral doesn’t have a closed form solution. However, recall that we only need to calculateµ(dθ)

∑∞i=1 λi(π)dπ to find the mean measure of the underlying Poisson process.

µ(dθ)∞∑i=1

λi(π)dπ = µ(dθ)λ1(π)dπ + µ(dθ)∞∑i=2

λi(π)dπ (4.13)

µ(dθ)∞∑i=2

λi(π)dπ = µ(dθ)∞∑i=2

dπαi

(i− 2)!

∫ 1

π

w−1(− lnw)i−2(w − π)α−1dw

= µ(dθ)dπα2

∫ 1

π

w−1(w − π)α−1dw∞∑i=2

(−α lnw)i−2

(i− 2)!︸︷︷︸= e−α lnw = w−α

= µ(dθ)dπα2

∫ 1

π

w−(α+1)(w − π)α−1dw︸︷︷︸= (w−π)α

απwα

∣∣∣1π

= µ(dθ)α(1− π)α

πdπ (4.14)

s Adding µ(dθ)λ1(π)dπ from Case 1 with this last value,

µ(dθ)∞∑i=1

λi(π)dπ = µ(dθ)απ−1(1− π)α−1dπ. (4.15)

s Therefore, the construction corresponds to a Poisson process with mean measure equal to thatof a beta process. It’s therefore a beta process.

Chapter 5

Dirichlet processes and a size-biased construction

s We saw how beta processes can be useful as a Bayesian nonparametric prior for latent factor(matrix factorization) models.

s We’ll next discuss BNP priors for mixture models.

Quick review

s 2-dimensional data generated from Gaussian with unknownmean and known variance.s There are a small set of possible means and an observationspicks one of them using a probability distribution.s Let G =

∑Ki=1 πkδθi be the mixture distribution on mean pa-

rameters – θi: ith mean, πi: probability of it

s For the nth observation,1. cn ∼ Disc(π) picks mean index2. xn ∼ N(θcn ,Σ) generates observation

Priors on G

s Let µ be a non-atomic probability measure on the parameter space.

s Since π is a K-dimensional probability vector, a natural prior is Dirichlet.

Dirichlet distribution: A distribution on probability vectors

s Definition: Let alpha1, . . . , αK be K positive numbers. The Dirichlet distribution densityfunction is defined as

Dir(π|α1, . . . , αK) =Γ(∑

i αi)∏i Γ(αi)

K∏i=1

παi−1i (5.1)

31

Chapter 5 Dirichlet processes and a size-biased construction 32

s Goals: The goals are very similar to the beta process.

1. We want K →∞2. We want the parameters α1, . . . , αK to be such that, as K →∞, things are well-defined.3. It would be nice to link this to the Poisson process somehow.

Dirichlet random vectors and gamma random variables

s Theorem: Let Zi ∼ Gam(αi, b) for i = 1, . . . , K. Define πi = Zi/∑K

j=1 Zj Then

(π1, . . . , πK) ∼ Dir(α1, . . . , αK). (5.2)

Furthermore, π and Y =∑K

j=1 Zj are independent random variables.

s Proof: This is just a change of variables.

– p(Z1, . . . , ZK) =K∏i=1

p(Zi) =K∏i=1

bαi

Γ(αi)Zαi−1i e−bZi

– (Z1, . . . , ZK) := f(Y, π) = (Y π1, . . . , Y πK−1, Y (1−∑K−1

i=1 πi))

– PY,π(Y, π) = PZ(f(Y, π)) · |J(f)| ← J(·) = Jacobian

– J(f) =

∂f1

∂π1· · · ∂f1

∂πK−1

∂f1

∂Y

. . .∂fK∂π1

· · · ∂fK∂πK−1

∂fK∂Y

=

Y 0 · · · π1

0 Y 0 π2

0 0. . . ...

−Y −Y · · · 1−∑K−1

i=1 πi

– And so |J(f)| = Y K−1

– Therefore

PZ(f(Y, π))|J(f)| =K∏i=1

bαi

Γ(αi)(Y πi)

αi−1e−bY πiY K−1, (πK := 1−K−1∑i=1

πi)

=

[b∑i αi

Γ(∑

i αi)Y

∑i αi−1e−bY

]︸︷︷︸

Gam(∑

i αi, b)

[Γ(∑

i αi)∏i Γ(αi)

K∏i=1

παi−1i

]︸︷︷︸

Dir(α1, . . . , αK)

(5.3)

s We’ve shown that:

1. A Dirichlet distributed probability vector is a normalized sequence of independent gammarandom variables with a constant scale parameter.

2. The sum of these gamma random variables is independent of the normalization becausetheir joint distribution can be written as a product of two distributions.

3. This works in reverse: If we want to draw an independent sequence of gamma r.v.’s, wecan draw a Dirichlet vector and scale it by an independent gamma random variable withfirst parameter equal to the sum of the Dirichlet parameters (and second parameter set towhatever we want).


Dirichlet process

s Definition: Let α > 0 and µ a non-atomic probability measure on S. For all partitions of S,A1, . . . , Ak, where Ai ∩Aj = ∅ for i 6= j and ∪Ki=1Ai = S, define the random measure G on Ssuch that

(G(A1), . . . , G(Ak)) ∼ Dir(αµ(A1), . . . , αµ(Ak)). (5.4)

Then G is a Dirichlet process, denoted G ∼ DP(αµ).

Dirichlet processes via the gamma process

s Pick a partition of S, A1, . . . , Ak. We can represent G ∼ DP(αµ) as the normalization ofgamma distributed random variables,

(G(A1), . . . , G(Ak)) =(G′(A1)

G′(S), . . . ,

G′(Ak)

G′(S)

), (5.5)

G′(Ai) ∼ Gam(αµ(Ai), b), G′(S) = G′(∪ki=1Ai) =k∑i=1

G′(Ai) (5.6)

s Looking at the definition and how G′ is defined, we realize that G′ ∼ GaP(αµ, b). Therefore, aDirichlet process is simply a normalized gamma process.

s Note that G′(S) ∼ Gam(α, b). So G′(S) <∞ with probability one and so the normalizationG is well-defined.

Gamma processes and the Poisson process

s Recall that the gamma process is constructed from a Poisson process.

s Gamma process: Let N be a Poisson random measure on S × R+ with mean measureαµ(dθ)z−1e−bzdz. Define G′(Ai) =

∫Ai

∫∞0N(dθ, dz)z. Then G′ ∼ GaP(αµ, b).

s Since the DP is a rescaled GaP, this shares the sameproperties (from Campbells theorem)

– For example, it’s an infinite jump process.

– However, the DP is not a CRM like the GaPsince G(Ai) and G(Aj) are not independent fordisjoint sets Ai and Aj . This should be clearsince G has to integrate to 1.


Dirichlet process as limit of finite approximation

s This is very similar to the previous discussion on limits of finite approximations to the gammaand beta process.

s Definition: Let α > 0 and µ a non-atomic probability measure on S. Let

GK =K∑i=1

πiδθi , π ∼ Dir(α/K, . . . , α/K), θiiid∼ µ (5.7)

Then limK→∞GK = G ∼ DP(αµ).

s Rough proof: We can equivalently write

GK =K∑i=1

(Zi∑Kj=1 Zj

)δθi , Zi ∼ Gam(α/K, b), θi ∼ µ (5.8)

s If G′K =∑K

i=1 Ziδθi , we’ve already proven that G′K → G′ ∼ GaP(αµ, b). GK is thus the limitof the normalization of G′K . Since limK→∞G

′K(S) is finite almost surely, we can take the

limit of the numerator and denominator of the gamma representation of GK separately. Thenumerator converges to a gamma process and the denominator its normalization. Therefore,GK converges to a Dirichlet process.

Some comments

s This infinite limit of the finite approximation results in an infinite vector, but the originaldefinition was of a K dimensional vector, so is a Dirichlet process infinite or finite dimensional?Actually, the finite vector of the definition is constructed from an infinite process:

G(Aj) = limK→∞

GK(Aj) = limK→∞

K∑i=1

πiδθi(Aj). (5.9)

s Since the partition A1, . . . , Ak of S is of a continuous space we have to be able to let K →∞,so there has to be an infinite-dimensional process underneath G.

s The Dirichlet process gives us a way of defining priors on infinite discrete probability distribu-tions on this continuous space S.

s As an intuitive example, if S is a space corresponding to the mean of a Gaussian, the Dirichletprocess gives us a way to assign a probability to every possible value of this mean.

s Of course, by thinking of the DP in terms of the gamma and Poisson processes, we know that aninfinite number of means will have probability zero, and infinite number will also have non-zeroprobability, but only a small handful of points in the space will have substantial probability.The number and locations of these atoms are random and learned during inference.

s Therefore, as with the beta process, size-biased representations of G ∼ DP(αµ) are needed.


A “stick-breaking” construction of G ∼ DP(αµ)

s Definition: Let α > 0 and µ be a non-atomic probability measure on S. Let

Vi ∼ Beta(1, α), θi ∼ µ (5.10)

independently for i = 1, 2, . . . Define

G =∞∑i=1

Vi

i−1∏j=1

(1− Vj)δθi . (5.11)

Then G ∼ DP(αµ).

s Intuitive picture: We start with a unit length stick and break off proportions.

G = V1δθ1 +

V2(1− V1)δθ2 +

V3(1− V2)(1− V1)δθ3 + · · ·

(1 − V2)(1 − V1) is what’s leftafter the first two breaks. Wetake proportion V3 of that for θ3and leave (1−V3)(1−V2)(1−V1)

↙

Getting back to finite Dirichlets

s Recall from the definition that (G(A1), . . . , G(AK)) ∼ Dir(αµ(A1), . . . , αµ(AK)) for all par-titions A1, . . . , AK of S.

s Using the stick-breaking construction, we need to show that the vector formed by

G(Ak) =∞∑i=1

Vi

i−1∏j=1

(1− Vj)δθi(Ak) (5.12)

for k = 1, . . . , K is distributed as Dir(αµ1, . . . , αµK), where µk = µ(Ak).

s Since P (θi ∈ Ak) = µ(Ak), δθi(Ak) can be equivalently represented by a K-dimensionalvector eYi = (0, . . . , 1, . . . , 0), with the 1 in the position Yi and Yi ∼ Disc(µ1, . . . , µK) and therest 0.

s Letting πi = G(Ai), we therefore need to show that if

π =∞∑i=1

Vi

∞∏j=1

(1− Vj)eYi , Viiid∼ Beta(1, α), Yi

iid∼ Disc(µ1, . . . , µK) (5.13)

Then π ∼ Dir(αµ1, . . . , αµK).


s Lemma: Let π ∼ Dir(a1 + b1, . . . , aK + bK). We can equivalently represent this as

π = V Y + (1− V )W, V ∼ Beta(∑

k ak,∑

k bk),

Y ∼ Dir(a1, . . . , aK), W ∼ Dir(b1, . . . , bK) (5.14)

Proof : Use the normalized gamma representation: πi = Zi/∑

j Zj, Zi ∼ Gam(ai + bi, c).

– We can use the equivalence

ZYi ∼ Gam(ai, c), ZW

i ∼ Gam(bi, c) ⇔ ZYi + ZW

i ∼ Gam(ai + bi, c) (5.15)

– Splitting into two random variables this way we have the following normalized gammarepresentation for π

π =

(ZY

1 + ZW1∑

i ZYi + ZW

i

, . . . ,ZYK + ZW

K∑i Z

Yi + ZW

i

)(5.16)

=

( ∑i Z

Yi∑

i ZYi + ZW

i

)︸︷︷︸V ∼ Beta(

∑i ai,

∑i bi)

(ZY

1∑i Z

Yi

, . . . ,ZYK∑i Z

Yi

)︸︷︷︸

Y ∼ Dir(a1,...,aK)

+

( ∑i Z

Wi∑

i ZYi + ZW

i

)︸︷︷︸

1 − V

(ZW

1∑i Z

Wi

, . . . ,ZWK∑i Z

Wi

)︸︷︷︸

W ∼ Dir(b1,...,bK)

– From the previous proof about normalized gamma r.v.’s, we know that the sums areindependent from the normalized values. So V , Y , and W are all independent.

s Proof of stick-breaking construction:

– Start with π ∼ Dir(αµ1, . . . , αµK).

– Step 1:

Γ(∑

i αi)∏i Γ(αi)

K∏i=1

παi−1i =

(K∑j=1

πj

)Γ(∑

i αi)∏i Γ(αi)

K∏i=1

παi−1i (5.17)

=K∑j=1

αµjαµj

Γ(∑

i αi)∏i Γ(αi)

K∏i=1

παi+ej(i)−1i

=K∑j=1

µjΓ(1 +

∑i αi)

Γ(1 + αµj)∏

i 6=j Γ(αi)

K∏i=1

παi+ej(i)−1i︸︷︷︸

= Dir(αµ+ej)

– Therefore, a hierarchical representation of Dir(αµ1, . . . , αµK) is

Y ∼ Discrete(µ1, . . . , µK), π ∼ Dir(αµ+ eY ). (5.18)


– Step 2:From the lemma we have that π ∼ Dir(αµ + eY ) can be expanded into the equivalenthierarchical representation π = V Y ′ + (1− V )π′, where

V ∼ Beta(∑

i eY (i)︸︷︷︸= 1

,∑

i αµi︸︷︷︸= α

), Y ′ ∼ Dir(eY )︸︷︷︸= eY with probability 1

, π′ ∼ Dir(αµ1, . . . , αµK) (5.19)

– Combining Steps 1& 2:We will use these steps to recursively break down a Dirichlet distributed random vectoran infinite number of times. If

π = V eY + (1− V )π′, (5.20)

V ∼ Beta(1, α), Y ∼ Disc(µ1, . . . , µK), π′ ∼ Dir(αµ1, . . . , αµK),

Then from steps 1 & 2, π ∼ Dir(αµ1, . . . , αµK).

– Notice that there are independent Dir(αµ1, . . . , αµK) r.v.’s on both sides. We “brokedown” the one on the left, we can continue by “breaking down” the one on the right:

π = V1eY1 + (1− V1)(V2eY2 + (1− V2)π′′) (5.21)

Viiid∼ Beta(1, α), Yi

iid∼ Disc(µ1, . . . , µK), π′′ ∼ Dir(αµ1, . . . , αµK),

π is still distributed as Dir(αµ1, . . . , αµK).

– Continue this an infinite number of times:

π =∞∑i=1

Vi

i−1∏j=1

eYi , Viiid∼ Beta(1, α), Yi

iid∼ Disc(µ1, . . . , µK). (5.22)

Still, following each time the right-hand Dirichlet is expanded we get a Dir(αµ1, . . . , αµK)random variable. Since limT→∞

∏Tj=1(1− Vj) = 0, the term pre-multiplying this RHS

Dirichlet vector equals zero and the limit above results, which completes the proof.

s Corollary:

If G is drawn from DP(αµ) using the stick-breaking construction and β ∼ Gam(α, b) indepen-dently, then βG ∼ GaP(αµ, b). Writing this out,

βG =∞∑i=1

β(Vi

i−1∏j=1

(1− Vj))δθi (5.23)

s We therefore get a method for drawing a gamma process almost for free. Notice that α appearsin both the DP and gamma distribution on β. These parameters must be the same value for βGto be a gamma process.

Chapter 6

Dirichlet process extensions, count processes

Gamma process to Dirichlet process

s Gamma process: Let α > 0 and µ a non-atomic probability measure on S. Let N(dθ, dw)be a Poisson random measure on S × R+ with mean measure αµ(dθ)we−cwdw, c > 0. ForA ⊂ S, let G′(A) =

∫A

∫∞0N(dθ, dw)w. Then G′ is a gamma process, G′ ∼ GaP(αµ, c), and

G′(A) ∼ Gam(αµ(A), c).

s Normalizing a gamma process: Let’s take G′ and normalize it. That is, define G(dθ) =G′(dθ)/G′(S). (G′(S) ∼ Gam(α, c), so it’s finite w.p. 1). Then G is called a Dirichletprocess, written G ∼ DP(αµ).

Why? Take S and partition it into K disjoint regions, i.e., (A1, . . . , AK), Ai ∩ Aj = ∅, i 6= j,∪iAi = S. Construct the vector

(G(A1), . . . , G(AK)) =

(G′(A1)

G′(S), . . . ,

G′(AK)

G′(S)

). (6.1)

Since each G′(Ai) ∼ Gam(αµ(Ai), c), and G′(S) =∑K

i=1G′(Ai), it follows that

(G(A1), . . . , G(AK)) ∼ Dir(αµ(A1), . . . , αµ(AK)). (6.2)

This is the definition of a Dirichlet process.

s The Dirichlet process has many extensions to suit the structure of different problems.

s We’ll look at four, two that are related to the underlying normalized gamma process, and twofrom the perspective of the stick-breaking construction.

s The purpose is to illustrate how the basic framework of Dirichlet process mixture modeling canbe easily built into more complicated models that address problems not perfectly suited to thebasic construction.

s Goal is to make it clear how to continue these lines of thinking to form new models.

38

Chapter 6 Dirichlet process extensions, count processes 39

Example 1: Spatially and temporally normalized gamma processes

s Imagine we wanted a temporally evolving Dirichlet process. Clusters (i.e., atoms, θ) may ariseand die out at different times (or exist in geographical regions)

Time-evolving model: Let N(dθ, dw, dt) be a Poisson random measure on S × R+ × R withmean measure αµ(dθ)w−1e−cwdwdt. Let G′(dθ, dt) =

∫∞0N(dθ, dw, dt)w. Then G′ is a

gamma process with added “time” dimension t.

s (There’s nothing new from what we’ve studied: Let θ∗ = (θ, t) and αµ(dθ∗) = αµ(dθ)dt.)

s For each atom (θ, t) with G′(dθ, dt) > 0, add a marking yt(θ)ind∼ Exp(λ).

s We can think of yt(θ) as the lifetime of parameter θ born at time t.

s By the marking theorem, N∗(dθ, dw, dt, dy) ∼ Pois(αµ(dθ)w−1e−cwdwdtλe−λydy

)

s At time t′, construct the Dirichlet process G′t by normalizing over all atoms “alive” at time t.(Therefore, ignore atoms already dead or yet to be born.)

Spatial model: Instead of giving each atom θ a time-stamp and “lifetime,” we might want togive it a location and “region of influence”.

s Replace t ∈ R with x ∈ R2 (e.g., latitude-longitude). Replace dt with dx.

s Instead of yt(θ) ∼ Exp(λ) = lifetime, yx(θ) ∼ Exp(λ) = radius of ball at x.

s G′x is the DP at location x′.s It is formed by normalizing over allatoms θ for which

x′ ∈ ball of radius yx(θ) at x


Example 2: Another time-evolving formulation

s We can think of other formulations. Here’s one where time is discrete. (We will build up to thiswith the following two properties).

s Even though the DP doesn’t have an underlying PRM, the fact that it’s constructed from a PRMmeans we can still benefit from its properties.

Superposition and the Dirichlet process: Let G′1 ∼ GaP(α1µ1, c) and G′2 ∼ GaP(α2µ2, c).Then G′1+2 = G′1 +G′2 ∼ GaP(α1µ1 + α2µ2, c). Therefore,

G1+2 =G′1+2

G′1+2(S)∼ DP(α1µ1 + α2µ2). (6.3)

We can equivalently write

G1+2 =G′1(S)

G′1+2(S)︸︷︷︸Beta(α1,α2)

× G′1G′1(S)︸︷︷︸

DP(α1µ1)

+G′2(S)

G′1+2(S)× G′2G′2(S)︸︷︷︸

DP(α2µ2)

∼ DP(α1µ1 + α2µ2) (6.4)

From the lemma last week, these two DP’s and the beta r.v. are all independent.

s Therefore,

G = πG1 + (1− π)G2, π ∼ Beta(α1, α2), G1 ∼ DP(α1µ1), G2 ∼ DP(α2µ2) (6.5)

is equal in distribution to G ∼ DP(α1µ1 + α2µ2).

Thinning of gamma processes (a special case of the marking theorem)

s We know that we can constructG′ ∼ GaP(αµ, c) from the Poisson random measureN(dθ, dw) ∼Pois (αµ(dθ)w−1e−cwdw). Mark each point (θ, w) in N with a binary variable z ∼ Bern(p).Then

N(dθ, dw, z) ∼ Pois(pz(1− p)zαµ(dθ)w−1e−cwdw

). (6.6)

s If we view z = 1 as “survival” and z = 0 as “death,” then if we only care about the atoms thatsurvive, we have

N1(dθ, dw) = N(dθ, dw, z = 1) ∼ Pois(pαµ(dθ)w−1e−cwdw). (6.7)

s This is called “thinning.” We see that p ∈ (0, 1) down-weights the mean measure, so we onlyexpect to see a fraction p of what we saw before.

s Still, a normalized thinned gamma process is a Dirichlet process

G′ ∼ GaP(pαµ, c), G =G′

G′(S)∼ DP(pαµ). (6.8)


s What happens if we thin twice? We’re marking with z ∈ {0, 1}2 and restricting to z = [1, 1],

G′ ∼ GaP(p2αµ, c) → G ∼ DP(p2αµ). (6.9)

s Back to the example, we again want a time-evolving Dirichlet process where new atoms areborn and old atoms die out.

s We can easily achieve this by introducing new gamma processes and thinning old ones.

A dynamic Dirichlet process:

At time t: 1. Draw G∗t ∼ GaP(αtµt, c).

2. Construct G′t = G∗t + G′t−1, where G′t−1 is the gamma processat time t− 1 thinned with parameter p.

3. Normalize Gt = G′t/G′t(S).

s Why is Gt still a Dirichlet process? Just look at G′t:

– Let G′t−1 ∼ GaP(αt−1µt−1, c).

– Then G′t−1 ∼ GaP(pαt−1µt−1, c) and G′t ∼ GaP(αtµt + pαt−1µt−1, c).

– So Gt ∼ DP(αtµt + pαt−1µt−1).

s By induction,

Gt ∼ DP(αtµt + pαt−1µt−1 + p2αt−2µt−2 + · · ·+ pt−1α1µ1). (6.10)

s If we consider the special case where αtµt = αµ for all t, we can simplify this Dirichlet process

Gt ∼ DP(

1− pt

1− pαµ

). (6.11)

In the limit t→∞, this has the steady state

G∞ ∼ DP(

1

1− pαµ

). (6.12)

s Stick-breaking construction (review for the next process)

We saw that if α > 0 and µ is any probability measure, atomic or non-atomic or mixed, thenwe can draw G ∼ DP(αµ) as follows:

Viiid∼ Beta(1, α), θi

iid∼ µ, G =∞∑i=1

Vi

i−1∏j=1

(1− Vj)δθi (6.13)


s It’s often the case where we have grouped data. For example, groups of documents where eachdocument is a set of words.

s We might want to model each group (indexed by d) as a mixture Gd. Then, for observation nin group d, θ(d)

n ∼ Gd, x(d)n ∼ p(x|θ(d)

n ).

s We might think that each group shares the same set of highly probable atoms, but has differentdistributions on them.

s The result is called a mixed-membership model.

Mixed-membership models and the hierarchical Dirichlet process (HDP)

s As the stick-breaking construction makes clear, when µ is non-atomic simply drawing eachGd

iid∼ DP(αµ) won’t work because it places all probability mass on a disjoint set of atoms.

s The HDP fixes this by “discretizing the base distribution.”

Gd |G0iid∼ DP(βG0), G0 ∼ DP(αµ). (6.14)

s Since G0 is discrete, Gd has probability on the same subset of atoms. This is very obvious bywriting the process with the stick-breaking construction:

Gd =∞∑i=1

π(d)i δθi , (π

(d)1 , π

(d)2 , . . . ) ∼ Dir(αp1, αp2, . . . ) (6.15)

pi = Vi

i−1∏j=1

(1− Vj), Viiid∼ Beta(1, α), θi

iid∼ µ.

s Nested Dirichlet processes

The stick-breaking construction is totally general: µ can be any distribution.

What if µ→ DP(αµ)? That is, we define the base distribution to be a Dirichlet process.

G ∼∞∑i=1

Vi

i−1∏j=1

(1− Vj)δGi , Viiid∼ Beta(1, α), Gi

iid∼ DP(αµ). (6.16)

(We write Gi to link to the DP, but we could have written θiiid∼ DP(αµ) since that’s what we’ve

been using.)

s We now have a mixture model of mixture models. For example:

1. A group selects G(d) ∼ G (picks mixture Gi according to probability Vi∏

j<i(1− Vj)).


2. Generates all of its data using this mixture. For the nth observation in group d, θ(d)n ∼ G(d),

X(d)n ∼ p(X|θ(d)

n ).

s In this case we have all-or-nothing sharing. Two groups either share the atoms and the distribu-tion on them, or they share nothing.

Nested Dirichlet process trees

s We can nest this further. Why not let µ in the nDP be a Dirichlet process also? Then we wouldhave a three level tree.

s We can then pick paths down this tree to a leaf node where we get an atom.

Count Processes

s We briefly introduce count processes. With the Dirichlet process, we often have the generativestructure

G ∼ DP(αµ), θ∗j |Giid∼ G, G =

∞∑i=1

πiδθi , j = 1, . . . , N (6.17)

s What can we say about the count process n(θ) =∑N

j=1 1(θ∗j = θ)?

s Recall the following equivalent processes:

G′ ∼ GaP(αµ, c) (6.18)n(θ)|G′ ∼ Pois(G′(θ)) (6.19)

andG′ ∼ GaP(αµ, c) (6.20)

n(S) ∼ Pois(G′(S)) (6.21)θ∗1:n(S) ∼ G′/G′(S) (6.22)

s We can therefore analyze this using the underlying marked Poisson process. However, noticethat we have to let the data size be random and Poisson distributed.

Marking theorem: Let G′ ∼ GaP(αµ, c) and mark each (θ, w) for which G′(θ) = w > 0 withthe random variable n|w ∼ Pois(w). Then (θ, w, n) is a marked Poisson process with meanmeasure αµ(dθ)w−1e−cwdwwn

n!e−w.


s We can restrict this to n by integrating over θ and w.

Theorem: The number of atoms having k counts is

#k ∼ Pois(∫

S

∫ ∞0

αµ(dθ)w−1e−cwdwwk

k!e−w)

= Pois(αk

( 1

1 + c

)k)(6.23)

Theorem: The total number of uniquely observed atoms is also Poisson

#unique =∞∑k=1

#k ∼ Pois( ∞∑k=1

α

k

( 1

1 + c

)k)= Pois(α ln(1 +

1

c)) (6.24)

Theorem: The total number of counts is n(S)|G′ ∼ Pois(G′(S)), G′ ∼ GaP(αµ, c). SoE[n(S)] = α

c( =

∑∞k=1 kE#k)

Final statement: Let c = αN

. If we expect a dataset of size N to be drawn from G ∼ DP(αµ),we expect that dataset to use α ln(α +N)− α lnα unique atoms from G.

A quick final count process

s Instead of gamma process −→ Poisson counts, we could have beta process −→ negativebinomial counts.

s Let H =∑∞

i=1 πiδθi ∼ BP(α, µ).

s Let n(θ) = negBin(r,H(θ)), where the negative binomial random variable counts howmany “successes” there are, with P (success) = H(θ) until there are r “failures” withP (failure) = 1−H(θ).

s This is another count process that can be analyzed using the underlying Poisson process.

Chapter 7

Exchangeability, Dirichlet processes and theChinese restaurant process

DP’s, finite approximations and mixture models

s DP: We saw how, if α > 0 and µ is a probability measure on S, for every finite partition(A1, . . . , Ak) of S, the random measure

(G(A1)), . . . , G(Ak)) ∼ Dir(αµ(A1), . . . , αµ(Ak))

defines a Dirichlet process.

s Finite approximation: We also saw how we can approximate G ∼ DP(αµ) with a finiteDirichlet distribution,

GK =K∑i=1

πiδθi , πi ∼ Dir( αK, . . . ,

α

K

), θ

iid∼ µ.

s Mixture models: Finally, the most common setting for these priors is in mixture models, wherewe have the added layers

θ∗j |G ∼ G, Xj|θ∗j ∼ p(X|θ∗j ), j = 1, . . . , n. (Pr(θ∗j = θi|G) = πi)

s The values of θ∗1, . . . , θ∗n induce a clustering of the data.

s If θ∗j = θ∗j′ for j 6= j′ then Xj and Xj′ are “clustered” together since they come from the samedistribution.

s We’ve thus far focused on G. We now focus on the clustering of X1, . . . , Xn induced by G.

Polya’s Urn model (finite Dirichlet)

s To simplify things, we work in the finite setting and replace the parameter θi with its index i.We let the indicator variables c1, . . . , cn represent θ∗1, . . . , θ

∗n such that θ∗j = θcj .

45

Chapter 7 Exchangeability, Dirichlet processes and the Chinese restaurant process 46

s Polya’s Urn is the following process for generating c1, . . . , cn:

1. For the first indicator, c1 ∼∑K

i=11Kδi

2. For the nth indicator, cn|c1, . . . , cn−1 ∼∑n−1

j=11

α+n−1δcj + α

α+n−1

∑Ki=1

1Kδi

s In words, we start with an urn having αK

balls of color i for each of K colors. We randomlypick a ball, put it back in the urn and put another ball of the same color in the urn.

s Another way to write #2 above is to define n(n−1)i =

∑n−1j=1 1(cj = i). Then

cn|c1, . . . , cn−1 ∼K∑i=1

αK

+ n(n−1)i

α + n− 1δi.

To put it most simply, we’re just sampling the next color from the empirical distribution of theurn at step n.

s What can we say about p(c1 = i1, . . . , cn = in) (write as p(c1, . . . , cn)) under this prior?

1. By the chain rule of probability, p(c1, . . . , cn) =∏n

j=1 p(cj|c1, . . . , cj−1).

2. p(cj = i|c1, . . . , cj−1) =αK

+ n(j−1)i

α + j − 1

3. Therefore,

p(c1:n) = p(c1)p(c2|c1)p(c3|c1, c2) · · · =n∏j=1

αK

+ n(j−1)cj

α + j − 1(7.1)

s A few things to notice about p(c1, . . . , cn)

1. The denominator is simply∏n

j=1(α + j − 1)

2. n(j−1)cj is incrementing by one. That is, after c1:n we have the counts (n

(n)1 , . . . , n

(n)K ). For

each n(n)i the numerator will contain

∏n(n)is=1 ( α

K+ s− 1).

3. Therefore,

p(c1, . . . , cn) =

∏Ki=1

∏n(n)is=1 ( α

K+ s− 1)∏n

j=1(α + j − 1)(7.2)

s Key: The key thing to notice is that this does not depend on the order of c1, . . . , cn. That is, ifwe permuted c1, . . . , cn such that cj = iρ(j), where ρ(·) is a permutation of (1, . . . , n), then

p(c1 = i1, . . . , cn = in) = p(c1 = iρ(1), . . . , cn = iρ(n)).

s The sequence c1, . . . , cn is said to be “exchangeable” in this case.


Exchangeability and independent and identically distributed (iid) sequences

s Independent sequences are exchangeable

p(c1, . . . , cn) =n∏i=1

p(ci) =n∏i=1

p(cρ(i)) = p(cρ(1), . . . , cρ(n)). (7.3)

s Exchangeable sequences aren’t necessarily independent (exchangeability is “weaker”). Thinkof the urn. cj is clearly not independent of c1, . . . , cj−1.

Exchangeability and de Finetti’s

s de Finetti’s theorem: A sequence is exchangeable if and only if there is a parameter π withdistribution p(π) for which the sequence is independent and identically distributed given π.

s In other words, for our problem there is a probability vector π such that p(c1:n|π) =∏n

j=1 p(cj|π).

s The problem is to find p(π)

p(c1, . . . , cn) =

∫p(c1, . . . , cn|π)p(π)dπ

=

∫ n∏j=1

p(cj|π)p(π)dπ

=

∫ n∏j=1

πcjp(π)dπ

=

∫ k∏i=1

πn

(n)i

i p(π)dπ

↓ ↓∏Ki=1

∏n(n)is=1 ( α

K+ s− 1)∏n

j=1(α + j − 1)= Ep(π)

[K∏i=1

πn

(n)i

i

](7.4)

s Above, the first equality is always true. The second one is by de Finetti’s theorem sincec1, . . . , cn is exchangeable. (We won’t proven this theorem, we’ll just use it.) The followingresults. In the last equality, the left hand side was previously shown and the right hand side iswhat the second to last line is equivalently written as.

s By de Finetti and exchangeability of c1, . . . , cn, we therefore arrive at an expression for themoments of π according to the still unknown distribution p(π).


s Because the moments of a distribution are unique to that distribution (like the Laplace trans-form), p(π) has to be Dir( α

K, . . . , α

K), since plugging this in for p(π) we get

Ep(π)

[K∏i=1

πn

(n)i

i

]=

∫ K∏i=1

πniiΓ(α)

Γ( αK

)K

K∏i=1

παK−1

i dπ

=Γ(α)

∏Ki=1 Γ( α

K+ ni)

Γ( αK

)KΓ(α + n)

∫Γ(α + n)∏Ki=1 Γ( α

K+ ni)

K∏i=1

πni+αK−1

︸︷︷︸= Dir( α

K+n1,...,

αK

+nk)

dπ

=Γ(α)

∏Ki=1 Γ( α

K)∏ni

s=1( αK

+ s− 1)

Γ( αK

)KΓ(α)∏n

j=1(α + j − 1)

=

∏Ki=1

∏n(n)is=1 ( α

K+ s− 1)∏n

j=1(α + j − 1)(7.5)

s This holds for all n and (n1, . . . , nk). Since a distribution is defined by its moments, the resultfollows.

s Notice that we didn’t need de Finetti since we could just hypothesize the existence of a π forwhich p(c1:n|π) =

∏i p(ci|π). It’s more useful when the distribution is more “non-standard,”

or to prove that a π doesn’t exist.

s Final statement: As n→∞, the distribution∑K

i=1

n(n)i + α

K

α+nδi →

∑Ki=1 π

∗i δi.

– This is because there exists a π for which c1, . . . , cn are iid, and so by the law of largenumbers the point π∗ exists and π∗ = π.

– Since π ∼ Dir( αK, . . . , α

K), it follows that the empirical distribution converges to a random

vector that is distributed as Dir( αK, . . . , α

K).

The infinite limit (Chinese restaurant process)

s Let’s go back to the original notation:

θ∗j |GK ∼ GK , GK =K∑i=1

πiδθi , π ∼ Dir(α

K, . . . ,

α

K), θi

iid∼ µ.

s Following the exact same ideas (only changing notation). The urn process is

θ∗n|θ∗1, . . . , θ∗n−1 ∼K∑i=1

αK

= n(n−1)i

α + n− 1δθi , θi

iid∼ µ.

s We’ve proven that limK→∞GK = G ∼ DP(αµ). We now take the limit of the correspondingurn process.


s Re-indexing: At observation n, re-index the atoms so that n(n−1)j > 0 for j = 1, . . . , K+

n−1 andn

(n−1)j = 0 for j > K+

n−1. (K+n−1 = # unique values in θ∗1:n−1) Then

θ∗n|θ∗1, . . . , θ∗n−1 ∼K+n−1∑i=1

αK

+ n(n−1)i

α + n− 1δθi +

α

α + n− 1

K∑i=1+K+

n−1

1

Kδθi . (7.6)

s Obviously for n � K, K+n−1 = K very probably, and just the left term remains. However,

we’re interested in K →∞ before we let n grow. In this case

1.αK

+ n(n−1)i

α + n− 1−→ n

(n−1)i

α + n− 1

2.K∑

i=1+K+n−1

1

Kδθi −→ µ.

s For #2, if you sample K times from a distribution and create a uniform measure on thosesamples, then in the infinite limit you get the original distribution back. Removing K+

n−1 <∞of those atoms doesn’t change this (we won’t prove this).

The Chinese restaurant process

s Let α > 0 and µ a probability measure on S. Sample the sequence θ∗1, . . . , θ∗n, θ∗ ∈ S as

follows:

1. Set θ∗1 ∼ µ

2. Sample θ∗n|θ∗1, . . . , θ∗n−1 ∼∑n−1

j=11

α+n−1δθ∗j + α

α+n−1µ

Then the sequence θ∗1, . . . , θ∗n is a Chinese restaurant process.

s Equivalently define n(n−1)i =

∑n−1j=1 1(θ∗j = θj). Then

θ∗n|θ∗1, . . . , θ∗n−1 ∼K+n−1∑i=1

n(n−1)i

α + n− 1δθi +

α

α + n− 1µ.

s As n→∞, αα+n−1

→ 0 and

K+n−1∑i=1

n(n−1)i

α + n− 1δθi −→ G =

∞∑i=1

πiδθi ∼ DP(αµ). (7.7)

s Notice with the limits that K first went to infinity and then n went to infinity. The resultingempirical distribution is a Dirichlet process because for finite K the de Finetti mixing measureis Dirichlet and the infinite limit of this finite Dirichlet is a Dirichlet process (as we’ve shown).


Chinese restaurant analogy

s An infinite sequence of tables each have a dish (parameter) placed on it that is drawn iid fromµ. The nth customer sits at an occupied table with probability proportional to the number ofcustomers seated there, or selects the first unoccupied table with probability proportional to α.

s “Dishes” are parameters for distributions, a “customer” is a data point that uses its dish tocreate its value.

Some properties of the CRP

s Cluster growth: What does the number of unique clusters K+n look like as a function of n?

K+n =

∑nj=1 1(θ∗j 6= θ∗`<j) ← this event occurs when we select a new table

E[K+n ] =

n∑j=1

E[1(θ∗j 6= θ∗`<j)] =n∑j=1

P (θ∗j 6= θ∗`<j) =n∑j=1

α

α + j − 1≈ α lnn

Where does this come from? Each θ∗j can pick a new table. θ∗j does so with probability αα+j−1

.

s Cluster sizes: We saw last lecture that if we let the number of observations n be random, where

n|y ∼ Pois(y), y ∼ gam(α, α/n),

then we can analyze cluster size and number using the Poisson process.

– E[n] = E[E[n|y]] = E[y] = n ← expected number of observations

– K+n ∼ Pois(α ln(α + n)− α lnα) ← total # clusters

– Therefore E[K+n ] = α ln((α + n)/α) (compare with above where n is not random)

–∑K+

n

i=1 1(n(n)i = k) ∼ Pois

(αK

( nα+n

)K)← number of clusters with k observations

s It’s important to remember that n is random here. So in #2 and #3 above, we first generaten and then sample this many times from the CRP. For example, E[K+

n ] is slightly differentdepending on whether n is random or not.


Inference for the CRPs Generative process: Xn|θ∗n ∼ p(X|θ∗n), θ∗n|θ∗1:n−1 ∼∑n

i=11

α+n−1δθ∗i + α

α+n−1µ. α > 0 is

“concentration” parameter and we assume µ is non-atomic probability measure.

s Posterior inference: Given the data X1, . . . , XN and parameters α and µ, the goal of inferenceis to perform the inverse problem of finding θ∗1, . . . , θ

∗N . This gives the unique parameters

θ1:KN = unique(θ∗1:N) and the partition of the data into clusters.

s Using Bayes rule doesn’t get us far (recall that p(B|A) = p(A|B)p(B)p(A)

).

p(θ1, θ2, . . . , θ∗1:N |X1:N) =

[N∏j=1

p(Xj|θ∗j )

]p(θ∗1:N)

∞∏i=1

p(θi)/

intractable normalizer (7.8)

s Gibbs sampling: We can’t calculate the posterior analytically, but we can sample from it: Iteratebetween sampling the atoms given the assignments and then sampling the assignments giventhe atoms.

Sampling the atoms θ1, θ2, . . .s This is the easier of the two. For the KN unique clusters in θ∗1, . . . , θ∗N at iteration t, we need to

sample θ1, . . . , θKN .

Sample θi: Use Bayes rule,

p(θi|θ−1, θ∗1:N , X1:N) ∝

∏j:θ∗j=θi

p(Xj|θi)

︸︷︷︸

likelihood

× p(θi)︸︷︷︸prior (µ)

(7.9)

s In words, the posterior of θi depends only on the data assigned to the ith cluster according toθ∗1, . . . , θ

∗N .

s We simply select this subset of data and calculate the posterior of θi on this subset. When µand p(X|θ) are conjugate, this is easy.

Sampling θ∗j (seating assignment for Xj)s Use exchangeability of θ∗1, . . . , θ∗N to treat Xj as if it were the last observation,

p(θ∗j |X1:N ,Θ, θ∗−j) ∝ p(Xj|θ∗j ,Θ)p(θ∗j |θ∗−j) ← also conditions on “future” θ∗n (7.10)

s Below is the sampling algorithm followed by the mathematical derivation

set θ∗j =

{θi w.p. ∝ p(Xj|θi)

∑n6=j 1(θ∗n = θi), θi ∈ unique{θ∗−j}

θnew ∼ p(θ|Xj) w.p. ∝ α∫p(Xj|θ)p(θ)dθ

(7.11)


s The first line should be straightforward from Bayes rule. The second line is trickier because wehave to account for the infinitely remaining parameters. We’ll discuss the second line next.

s First, return to the finite model and take then take the limit (and assume the appropriate re-indexing).

s Define: n−ji = #{θ∗n : θ∗n, n 6= j}, K−j = #unique{θ∗−j}.

s Then the prior on θ∗j is

θ∗j |θ∗−j ∼K−j∑i=1

n−jiα + n− 1

δθi +α

α + n− 1

K∑i=1

1

Kδθi (7.12)

s The term∑K

i=11Kδθi overlaps with the K−j atoms in the first term, but we observe that

K−j/K → 0 as K →∞.

s First: What’s the probability a new atom is used in the infinite limit (K →∞)?

p(θ∗j = θnew|Xj, θ∗−j) ∝ lim

K→∞α

K∑i=1

1

Kp(Xj|θi) (7.13)

Since θiiid∼ µ,

limK→∞

K∑i=1

1

Kp(Xj|θi) = Eµ[p(Xj|θ)] =

∫p(Xj|θ)µ(dθ). (7.14)

Technically, this is the probability that an atom is selected from the second part of (7.12) above.We’ll see why this atom is therefore “new” next.

s Second: Why is θnew ∼ p(θ|Xj)? (And why is it new to begin with?)

s Given that θ∗j = θnew, we need to find the index i so that θnew = θi from the second half of(7.12).

p(θnew = θi|Xj, θ∗j = θnew) ∝ p(Xj|θnew = θi)p(θnew = θi|θ∗j = θnew)

∝ limK→∞

p(Xj|θi)1

K⇒ p(Xj|θ)µ(dθ) (7.15)

So p(θnew|Xj) ∝ p(Xj|θ)µ(dθ).

Therefore, given that the atom associated with Xj is selected from the second half of (7.12),the probability it coincides with an atom in the first half equals zero (and so it’s “new” withprobability one). Also, the atom itself is distributed according to the posterior given Xj .

Chapter 8

Exchangeability, beta processes and the Indianbuffet process

Marginalizing (integrating out) stochastic processes

s We saw how the Dirichlet process gives a discrete distribution on model parameters in aclustering setting. When the Dirichlet process is integrated out, the cluster assignments form aChinese restaurant process:

p(θ∗1, . . . , θ∗N)︸︷︷︸

Chinese restaurant process

=

∫ N∏n=1

p(θ∗n|G)︸︷︷︸i.i.d. from discrete dist.

p(G)︸︷︷︸DP

dG (8.1)

s There is a direct parallel between the beta-Bernoulli process and the “Indian buffet process”:

p(Z1∗, . . . , ZN)︸︷︷︸Indian buffet process

=

∫ N∏n=1

p(Zn|H)︸︷︷︸Bernoulli process

p(H)︸︷︷︸BP

dH (8.2)

s As with the DP→CRP transition, the BP→IBP transition can be understood from the limitingcase of the finite BP model.

Beta process (finite approximation)

s Let α > 0, γ > 0 and µ a non-atomic probability measure. Define

HK =K∑i=1

πiδθi , πi ∼ Beta(α γK, α(1− γ

K)), θi ∼ µ. (8.3)

Then limK→∞HK = H ∼ BP(α, γµ). (See Lecture 3 for proof.)

53

Chapter 8 Exchangeability, beta processes and the Indian buffet process 54

Bernoulli process using HKs Given HK , we can draw the Bernoulli process ZKn |HK ∼ BeP (HK) as follows:

ZKn =

K∑i=1

binδθi , bin ∼ Bernoulli(πi). (8.4)

Notice that bin should also be marked with K, which we ignore. Again we are particularlyinterested in limK→∞(ZK

1 , . . . , ZKN ).

s To derive the IBP, we first consider

limK→∞

p(ZK1:N) = lim

K→∞

∫ N∏n=1

p(ZKn |HK)p(HK)dHK . (8.5)

s We can think of ZK1 , . . . , Z

KN in terms of a binary matrix, BK = [bin], where

1. each row corresponds to atom θi and biniid∼ Bern(πi)

2. each column corresponds to a Bernoulli process, ZKn

Important: The rows of B are independent processes.

s Consider the process bin|piiiid∼ Bern(πi), πi ∼ Beta(α γ

K, α(1 − γ

K)). The marginal process

bi1, . . . , biN follows an urn model with two colors.

Polya’s urn (two-color special case)

1. Start with an urn having αγ/K balls of color 1 and α(1− γ/K) balls of color 2.

2. Pick a ball at random, pit it back and put a second one of the same color

Mathematically:

1. bi1 ∼γ

Kδ1 + (1− γ

K)δ0

2. bi,N+1|bi1, . . . , biN ∼αγK

+ n(N)i

α +Nδ1 +

α(1− γK

) +N − n(N)i

α +Nδ0

where n(N)i =

∑Nj=1 bij . Recall from exchangeability and deFinetti that

limK→∞

n(N)i

N−→ πi ∼ Beta(α γ

K, α(1− γ

K)) (8.6)

s Last week we proved this in the context of the finite symmetric Dirichlet, π ∼ Dir( αK, . . . , α

K).

The beta distribution is the two-dimensional special case of the Dirichlet and the proof can beapplied to any parameterization besides symmetric.


s In the Dirichlet→CRP limiting case, K corresponds to the number of colors in the urn andK →∞ with the starting number of each color α

K→ 0.

s In this case, there are always only two colors. However, the number of urns equals K, so thenumber of urn processes is going to infinity as the first parameter of each beta goes to zero.

An intuitive derivation of the IBP: Work with the urn representation of BK . Again, bin is entry(i, n) of BK and the generative process for BK is

bi,N+1|bi1, . . . , biN ∼αγK

+ n(N)i

α +Nδ1 +

α(1− γK

) +N − n(N)i

α +Nδ0 (8.7)

where n(N)i =

∑Nj=1 bij . Each row of BK is associated with a θi ∼iid µ so we can reconstruct

ZKn =

∑Ki=1 binδθi using what we have.

s Let’s break down limK→∞BK into two cases.

Case n = 1: We ask how many ones are in the first column of B?

limK→∞

K∑i=1

bi1 ∼ limK→∞

Bin(K, γ/K) = Pois(γ) (8.8)

So Z1 has Pois(γ) ones. Since the θ associated with these ones are i.i.d., we can “ex post facto”draw them i.i.d. from µ and re-index.

Case n > 1: For the remaining Zn, we break this into two subcases.

s Subcase n(n−1)i > 0: bin|bi1, . . . , bi,n−1 ∼

n(n−1)i

α + n− 1δ1 +

α + n− 1− n(n−1)i

α + n− 1δ0

s Subcase n(n−1)i = 0: bin|bi1, . . . , bi,n−1 ∼ lim

K→∞

α γK

α + n− 1δ1 +

α(1− γK

) + n− 1

α + n− 1δ0

s For each i,(

αγα+n−1

)1Kδ1 → 0δ1, but there are also an infinite number of these indexes i for

which n(n−1)i = 0. Is there a limiting argument we can again make to just ask how many ones

there are total for these indexes with n(n−1)i = 0?

s Let Kn = #{i : n(n−1)i > 0}, which is finite almost surely. Then

limK→0

K∑i=1

bin1(n(n−1)i = 0) ∼ lim

K→∞Bin

(K −Kn,

( αγ

α + n− 1

) 1

K

)= Pois

( αγ

α + n− 1

)(8.9)


s So there are Pois(

αγα+n−1

)new locations for which bin = 1. Again, since the atoms are i.i.d.

regardless of the index, we can simply draw them i.i.d. from µ and re-index.

Putting it all together: The Indian buffet process

s For n = 1: Draw C1 ∼ Pois(γ), θ1, . . . , θC1

iid∼ µ and set Z1 =∑C1

i=1 δθ1 .

s For n > 1: Let Kn−1 =∑n−1

j=1 Cj . For i = 1, . . . , Kn−1, draw

bin|bi,1:n−1 ∼n

(n−1)i

α + n− 1δ1 +

α + n− 1− n(n−1)i

α + n− 1δ0. (8.10)

Then draw Cn ∼ Pois(

αγα+n−1

)and θKn−1+1, . . . , θKn−1+Cn

iid∼ µ and set

Zn =

Kn−1∑i=1

binδθi +

Kn−1+Cn∑i′=Kn−1+1

δθi . (8.11)

s By exchangeability and deFinetti, limN→∞1N

∑Ni=1 Zi → H ∼ BP(α, γµ).

The IBP story: Start with α customers not eating anything.

1. A customer walks into an Indian buffet with an infinite number of dishes and samplesPois(γ) of them.

2. The nth customer arrives and samples from the previously sampled dishes with probabil-ity proportional to the number of previous customers who sampled it, and then samplesPois

(αγ

α+n−1

)new dishes.

s In modeling scenarios, each dish corresponds to a factor (e.g., a one-dimensional subspace)and a customer samples a subset of factors.

s Clearly, after n customers there are∑n

i=1Ci ∼ Pois(∑

i = 1n αγα+i−1

)dishes that have been

sampled. (See Lecture 3 for another derivation of this quantity.)

BP constructions and the IBP

s From Lecture 4: Let α, γ > 0 and µ a non-atomic probability measure. Let

Ci ∼ Pois( αγ

α + i− 1

), πij ∼ Beta(1, α + i− 1), θij ∼ µ (8.12)

and define H =∑∞

i=1

∑Cij=1 πijδθij . Then H ∼ BP(α, γµ).


s We proved this using the Poisson process theory. Now we’ll show how this construction can bearrived at in the first place.

s Imagine an urn with β1 balls of one color and β2 of another. The distribution on the first drawfrom this urn is

β1

β1 + β2

δ1 +β2

β1 + β2

δ0.

Let (b1, b2, . . . ) be the urn process with this initial configuration. Then as we have seen, in thelimit N →∞, 1

N

∑Ni=1 bi → π ∼ Beta(β1, β2).

s Key: We can skip the whole urn procedure if we’re only interested in π. That is, if we drawfrom the urn once, look at it and see it’s color one, then the urn distribution is

β1 + 1

1 + β1 + β2

δ1 +β2

1 + β1 + β2

δ0

The questions to ask are what will happen if we continue? What is the different betweenthis “posterior” configuration and another urn where this is defined to be the initial setting? Itshouldn’t be hard to convince yourself that, in this case, as N →∞,

1

N

N∑i=1

bi|b1 = 1 → π ∼ Beta(β1 + 1, β2).

s This is because the sequence (b1 = 1, b2, b3, . . . ) is equivalent to an urn process (b2, b3, . . . )where the initial configuration are β1 + 1 of color one and β2 of color 2.

s Furthermore, we know from deFinetti and exchangeability that if Z1, . . . , ZN are from an IBP,then limN→∞

1N

∑Ni=1 Zi → H ∼ BP(α, γµ). Right now we’re only interested in this limit.

Derivation of the construction via the IBP

s For each Zn, there are Cn ∼ Pois(

αγα+n−1

)new “urns” introduced with the initial configuration

1

α + nδ1 +

α + n− 1

α + nδ0. From this point on, each of these new urns can be treated as indepen-

dent processes. Notice that this is the case for every instantiation of the IBP.

s With the IBP, we continue this urn process. Instead, we can ask the limiting distribution of theurn immediately after instantiating it. We know from above that the new urns created at step nwill converge to random variables drawn independently from a Beta(1, α + n− 1) distribution.We can draw this probability directly.

s The results is the construction written above.

Part II

Topics in Bayesian Inference

58

Chapter 9

EM and variational inference

s We’ve been discussing Bayesian nonparametric priors and their relationship to the Poissonprocess up to this point (first 2/3 of the course). We now turn to the last 1/3 which will be aboutmodel inference (with tie-ins to Bayesian nonparametrics).

s In this lecture, we’ll review two deterministic optimization-based approaches to inference: theEM algorithm for MAP inference and an extension of EM to approximate Bayesian posteriorinference called variational inference.

General setup

1. Prior: We define a model with a set of parameters θ and a prior distribution on them p(θ).

2. Likelihood: Conditioned on the model parameters, we have a way of generating the set ofdata X according to the distribution p(X|θ).

3. Posterior: By Baye’s rule, the posterior distribution of the parameters θ is

p(θ|X) =p(X|θ)p(θ)∫p(X|θ)p(θ)dθ

(9.1)

s Of course, the problem is that for most models the denominator is intractable, and so we can’tactually calculate p(θ|X) analytically.

s The first, most obvious solution is to ask for the most probable θ according to p(θ|X). Since

p(θ|X) = p(X|θ)p(θ)/∫

p(X|θ)p(θ)dθ︸︷︷︸constant wrt θ

we know that

arg maxθ

p(θ|X) = arg maxθ

p(X|θ)p(θ) = arg maxθ

ln p(X|θ) + ln p(θ)︸︷︷︸for computational convenience

(9.2)

s Therefore, we can come up with an algorithm for finding a local optimal solution to the (usually)non-convex problem arg maxθ p(X, θ).

59

Chapter 9 EM and variational inference 60

s This often leads to a problem (that will be made more clear in an example later). The functionln p(X, θ) is sometimes hard to optimize with respect to θ (e.g., it might require messy gradi-ents).

s The EM algorithm is a very general method for optimizing ln p(X, θ) w.r.t. θ in a “nice” way(e.g., closed form updates).

s With EM, we expand the joint likelihood p(X, θ) using some auxiliary variables c so that wehave p(X, θ, c) and p(X, θ) =

∫p(X, θ, c)dc. c is something picked entirely out of conve-

nience. For now, we keep it abstract and highlight this with an example later.

Build-up to EM

s Observe that by basic probability, p(c|X, θ)p(X, θ) = p(X, θ, c). Therefore,

ln p(X, θ)︸︷︷︸thing we want

= ln p(X, θ, c)− ln p(c|X, θ)︸︷︷︸expanded version equal to thing we want

(9.3)

s We then introduce any probability distribution on c, q(c), and follow the the sequence of steps:

1. ln p(X, θ) = ln p(X, θ, c)− ln p(c|X, θ)

2. q(c) ln p(X, θ) = q(c) ln p(X, θ, c)− q(c) ln p(c|X, θ)

3.∫q(c) ln p(X, θ)dc =

∫q(c) ln p(X, θ, c)dc−

∫q(c) ln p(c|X, θ)dc

4. ln p(X, θ) =∫q(c) ln p(X, θ, c)dc−

∫q(c) ln q(c)dc

−∫q(c) ln p(c|X, θ)dc+

∫q(c) ln q(c)dc (add and subtract same thing)

5. ln p(X, θ) =

∫q(c) ln

p(X, θ, c)

q(c)dc︸︷︷︸

L(q,θ)

+

∫q(c) ln

q(c)

p(c|X, θ)dc︸︷︷︸

KL(q‖p)s We have decomposed the objective function ln p(X, θ) into a sum of two terms

1. L(q, θ) : A function we can (hopefully) calculate

2. KL(q‖p) : The KL-divergence between q and p. Recall that KL ≥ 0 and KL = 0 onlywhen q = p.

s The practical usefulness isn’t immediately obvious. For now, we analyze it as an algorithmfor optimizing ln p(X, θ) w.r.t. θ. That is, how can we use the right side to find a local optimalsolution of the left side?


s We will see that the following algorithm gives a sequence of values for θ that are monotonicallyincreasing p(X, θ) and can be shown to converge to a local optimum of p(X, θ).

The Expectation-Maximization Algorithm

E-step: Change q(c) by setting q(c) = p(c|X, θ). Obviously, we’re assuming p(c|X, θ) can becalculated in closed form. What does this do?

1. Sets KL(q‖p) = 0.

2. Since θ is fixed, ln p(X, θ) isn’t changed here, therefore since KL ≥ 0 in general, bychanging q so that KL(q‖p) = 0, we increase L(q, θ) so that L(q, θ) = ln p(X, θ).

M-step: L(q, θ) = Eq[ln p(X, θ, c)] − Eq[ln q(c)] has now been modified as a function of θ.Therefore, we can maximize L(q, θ) w.r.t. θ. What does this do?

1. Increases L(q, θ) (obvious).

2. Since θ has changed, q(c) 6= p(c|X, θ) anymore, so KL(q‖p) > 0.

3. Therefore,

ln p(X, θold) = Eq[lnp(X, θold, c)

q(c)

]< Eq

[lnp(X, θnew, c)

q(c)

]+KL(q‖p) = ln p(X, θnew)

(9.4)

Example: Probit classification

s We’re given data pairs D = {(xi, zi)} for i = 1, . . . , N , xi ∈ Rd and zi ∈ {−1,+1}. We wantto build a linear classifier f(xTw − w0) : x→ {−1,+1}.

xTi w−w0

> 0 for all points to upper right< 0 for all points to lower left= 0 all points on line

Classify using the probit function:

f(xTw−w0) = Φ

(xTw − w0

σ

)= p(z = 1|w, x)

Φ is the CDF of a standard normal distribution

Simplification of notation: Let w ← [w0wT ]T and xi ← [1xTi ]T .

– The model variable θ = w in this case. Let w ∼ N(0, cI) be the prior.

– The joint likelihood is p(z, w|x) = p(w)∏N

i=1 p(zi|w, xi).


s So we would like to optimize

ln p(w)+N∑i=1

ln p(zi|w, xi) = − 1

2cwTw+

N∑i=1

zi ln Φ(xTi w

σ

)+

N∑i=1

(1−zi) ln

[1− Φ

(xTi wσ

)](9.5)

with respect to w, which is hard because of Φ(·).

Solution via EM

s Introduce the latent variables (“hidden data”) y1, . . . , yN such that yi|xi, w ∼ N(xTi w, σ2) and

let zi = 1(yi > 0). Marginalizing this expanded joint likelihood,

p(zi = 1|xi, w) =

∫ ∞−∞

p(yi, zi = 1|xi, w)dyi (9.6)

=

∫ ∞−∞

1(yi > 0)N(yi|xTi w, σ2)dyi = Φ(xTi w

σ

)so the expanded joint likelihood p(z, y, w|x) has the correct marginal distribution.

s From the EM algorithm we know that

ln p(z, w|x) = Eq[lnp(z, y, w|x)

q(y)

]+KL(q(y)‖p(y|z, w, x)) (9.7)

=N∑i=1

∫q(y) ln

p(zi, yi, w|xi)q(y)

dy +

∫q(y) ln

q(y)

p(y|z, w, x)dy

Q: Can we simplify q(y) = q(y1, . . . , yN)?

A: We need to set q(y) = p(y|z, w, x). Since p(y|z, w, x) =∏N

i=1 p(yi|zi, w, xi), we know thatwe can write q(y) =

∏Ni=1 q(yi) (but we still don’t know what it is).

s E-step

1. For each i, set q(yi) = p(yi|zi, w, xi) ∝ p(zi|yi)p(yi|xi, w). Since this second term equals1(yi ∈ Rzi)N(yi|xTi w, σ2), it follows that q(yi) = TNzi(x

Ti w, σ

2), which is a truncatednormal on the zi half of R.

2. Construct the objective∑

i Eq(yi)[ln p(zi, yi, w|xi)]. We can ignore∑

i Eq(yi)[ln q(yi)]because it doesn’t depend on w, which is what we want to optimize over next.

N∑i=1

Eq ln p(zi, yi, w|xi) =N∑i=1

Eq[− 12σ2 (yi − xTi w)2︸︷︷︸

ln p(yi|xi,w) + const.

] +N∑i=1

Eq ln p(zi|yi)

− 12cwTw + const. ← ln p(w) (9.8)

=N∑i=1

− 12σ2 (Eq[yi]− xTi w)2 − 1

2cwTw + const. w.r.t. w


s M-step

1. Optimize the objective from the E-step with respect to w. We can take the gradient w.r.t.w and see this is equal to

w =(1

cI +

1

σ2

N∑i=1

xixTi

)−1( N∑i=1

xiEq[yi]/σ2)

(9.9)

s The expectation Eq[yi] is of the truncated Normal r.v. with distribution q(yi), which canbe looked up in a book or on Wikipedia. The expectation looks complicated, but it’ssimply an equation that can be evaluated using what we have. The final algorithm canbe seen in this equation: After updating w, update q. This updates Eq[yi] and so we canagain update w. The code would look like iterations between updating w and updatingEq[yi] for each i. Notice that this is a nice, closed form iterative algorithm. The output wmaximizes p(z, w|x).

Extension of EM to a BNP modification of the probit model

Model extension: Imagine that x is very high dimensional. We might think that only a fewdimensions of x are relevant for predicting z. We can use a finite approximation to the gammaprocess to address this.

Gamma process prior: Let cjiid∼ Gam(α

d, b), wj ∼ N(0, cj). From the GaP theory we know

that when d� α most values of cj ≈ 0, but there are some that will be large. In this context,cj are variances, so the model says that most dimensions of w should be zero, in which case thecorresponding dimension of x is not used in predicting z = f(xTw).

Log joint likelihood: The EM equation is

ln p(z, w|x) =

∫q(y, c) ln

p(z, y, w, c|x)

q(y, c)dydc +

∫q(y, c) ln

q(y, c)

p(y, c|z, w, x)dydc (9.10)

We can think of y again as the “hidden data” that doesn’t have an interpretation as a modelvariable. c can be thought of as a model variable and we want to integrate out the uncertaintyof this variable to infer a point estimate of w.

E-step:

1. Set q(y, c) = p(y, c|z, w, x). In this case y and c are conditionally independent of eachother,

p(y, c|z, w, x) = p(y|z, w, x)p(c|w).

We can calculate these as before:

q(yi) = p(yi|zi, w, xi) = TNzi(xTi w, σ

2) (9.11)q(cj) = p(cj|wj) = GiG(w2

j , α, b) ← generalized inverse Gaussian (9.12)


2. Construct the objective function∑i Eq ln p(zi, yi, w, c|xi) =

∑i Eq ln p(zi|yi)+

∑i Eq ln p(yi|xi, w)+Eq ln p(w|c)+const.

M-step:

1. Optimize the objective from the E-step with respect to w. We can take the gradient withrespect to w and set to zero to find that

w =(

diag(Eq[c−1j ]) +

1

σ2

N∑i=1

xixTi

)−1( N∑i=1

xiEq[yi]/σ2)

(9.13)

Variational inference

s We motivate variational inference in the context of the model we’ve been discussing and thengive a general review of it.

s Generative process:

zi = sign(yi), yi ∼ N(xTi w, σ2), wj ∼ N(0, cj), cj ∼ Gam(α

d, b) (9.14)

s We saw how we could maximize ln p(z, w|x) easily by including distributions on y and c.Could we do the same for w?

ln p(z|x) =

∫q(y, c, w) ln

p(z, y, w, c|x)

q(y, c, w)dydcdw +

∫q(y, c, w) ln

q(y, c, w)

p(y, c, w|z, x)dydcdw

(9.15)

s Short answer: Probably not. What has changed?

1. Before, we were maximizing ln p(z, w|x) w.r.t. w. This meant we needed to calculatep(y, c|z, x, w). Fortunately, since y and c are independent conditioned on w, we cancalculate p(y, c|z, x, w) = p(y|z, x, w)p(c|w) analytically using the point estimate of w.

2. Therefore, we could set q(y) = p(y|z, x, w) and q(c) = p(c|w) to make KL = 0 andoptimize L.

3. This time, we can’t factorize p(y, c, w|z, x), so we can’t find q(y, c, w) = p(y, c, w|z, x)in closed form to make KL = 0.

s Some other observations

1. ln p(z|x) is a constant since it depends only on the data z, x and the underlying model.

2. Therefore, it doesn’t make sense to talk about maximizing ln p(z|x) like before when wewere using EM to maximize ln p(z, w|x) with respect to w.

3. If we could find q(y, c, w) to make KL(q‖p) = 0, we would have the full posterior of themodel! (That’s the whole purpose of inference.)

4. What made EM for ln p(z, w|x) easy was that we could write q(y, c) = q(y)q(c) and stillfind optimal distributions.


s Variational inference idea

1. We can’t write q(y, c, w) = q(y)q(c)q(w) and still have KL(q‖p) = 0, i.e., can’t haveq(y)q(c)q(w) = p(y, c, w|z, x).

2. What if we make an approximation that p(y, c, w|z, x) ≈ q(y)q(c)q(w)?

3. We still have that

ln p(z|x) =

∫q(y)q(c)q(w) ln

p(z, y, w, c|x)

q(y)q(c)q(w)dydcdw︸︷︷︸

L(q)

+

∫q(y)q(c)q(w) ln

q(y)q(c)q(w)

p(y, c, w|z, x)dydcdw︸︷︷︸

KL(q‖p)(9.16)

4. However, KL 6= 0 (ever)

5. Notice though that, since ln p(z|x) = const. and KL > 0, if we can maximize L wrt q,we can minimize KL between q and p.

Variational inference (in general)

s Have data x and model with variables θ1, . . . , θm.

s Choose probability distributions q(θi) for each θi.

s Construct the equality

ln p(x)︸︷︷︸log evidence (constant)

=

∫ ( m∏i=1

q(θi))

lnp(x, θ1, . . . , θm)∏m

i=1 q(θi)dθ1 . . . dθm︸︷︷︸

L:variational lower bound on log evidence

(9.17)

+

∫ ( m∏i=1

q(θi))

ln

∏mi=1 q(θi)

p(θ1, . . . , θm|x)dθ1 . . . dθm︸︷︷︸

KL divergence between∏mi=1 q(θi) and p(θ1,...,θm|x)

Observations

1. As before, ln p(x) is a constant (that we don’t know)

2. KL(q‖p) ≥ 0 (as always), therefore L ≤ ln p(x)

3. As with EM, p(x, θ1, . . . , θm) is something we can write (joint likelihood)

4. L = Eq ln p(x, θ1, . . . , θm)︸︷︷︸expected log joint likelihood

−∑m

i=1

∫q(θi) ln q(θi)dθi︸︷︷︸

individual entropies of each q

is a closed form objective function

(currently by assumption, not always true!)

5. We’ve defined the distribution q(θi) but not its parameters. L is a function of theseparameters

6. By finding parameters of each q(θi) to maximizeL, we are equivalently finding parametersto minimize KL(q‖p).

Chapter 10

Stochastic variational inference

Bayesian modeling review

s Recall that we have a set of data X and of model parameters Θ = {θi} for i = 1, . . . ,M .

s We have defined a model for the data, p(X|Θ), and a prior for the model, p(Θ).

s The goal is to learn Θ and capture a level of uncertainty, so by Bayes rule, we want

p(Θ|X) =p(X|Θ)p(Θ)

p(X). (10.1)

s We can’t calculate p(X) so we need approximate methods.


s Using the EM algorithm as motivation, we set up the “master equation”

ln p(X)︸︷︷︸constant marginal likelihood

=

∫q(Θ) ln

p(X,Θ)

q(Θ)dΘ︸︷︷︸

variational lower bound, L

+

∫q(Θ) ln

q(Θ)

p(Θ|X)dΘ︸︷︷︸

KL-divergence between q and desired p(Θ|X)

(10.2)

s We have freedom to pick q(Θ), but that means we have to define it.

s There is a trade-off:

1. Pick q(Θ) to minimize KL (ideally set q = p(Θ|X)!)

2. We don’t know p(Θ|X), so pick q that’s easy to optimize.

s What do we mean by “optimize”?Answer: We can pick a q(Θ) so that we have a closed-form function L. Since ln p(X) isa constant, by maximizing L wrt the parameters of q, we are minimizing the non-negativeKL-divergence. Each increase in L finds a q closer to p(Θ|X) according to KL.

66

Chapter 10 Stochastic variational inference 67

Picking q(θi)s The variational objective function is

L =

∫ ( m∏i=1

q(θi))

ln p(X,Θ)dΘ−m∑i=1

∫q(θi) ln q(θi)dθi (10.3)

s Imagine we have q(θj) for j 6= i. Then we can isolate q(θi) as follows:

L =

∫q(θi)E−qi [ln p(X,Θ)] dθi −

∫q(θi) ln q(θi)dθi −

∑j 6=i

∫q(θj) ln q(θj)dθj

=

∫q(θi) ln

1Z

exp{E−qi [ln p(X,Θ)]}q(θi)

dθi + constant wrt θi (10.4)

s Z is the normalizing constant: Z =∫

exp{E−qi [ln p(X,Θ)]}dθi. It’s a constant wrt θi, so wecan cancel it out by including the appropriate term in the constant.

s Therefore, L = −KL(q(θi)‖ exp{E−qi [ln p(X,Θ)]}/Z) + const wrt θi.

s To maximize this over all q(θi), we need to set KL = 0 (since KL ≥ 0). Therefore,

q(θi) ∝ exp{E−qi [ln p(X,Θ)]}. (10.5)

s In words: We can find q(θi) by calculating the expected log joint likelihood using the other q’sand exponentiating the results (and normalizing it)

s Notice that we get both the form and parameters for q(θi) this way.

Example: Sparse regression

– Data: Labeled data (xn, zn), xn ∈ Rd, zn ∈ {−1, 1}, d large

– Model: zn = sign(yn), yn ∼ N(xTnw, σ2), wj ∼ N(0, cj), cj ∼ Gam(a

d, b)

– Recall: Last week, we did EM MAP for w, meaning we got a point estimate of w andintegrated out y and c. In that case, we could set q(cj) and q(yn) to their exact posteriorsconditioned on w because they were independent.


Step 1: Pick a factorization, say q(y, c, w) =[∏

n q(yn)][∏

j q(cj)]q(w)

Step 2: Find optimal q distributions


– q(yn) ∝ exp{��ZZZ

E−q[ln1(zn = sign(yn))]︸︷︷︸determines support of yn (R+ or R−)

+ E−q[ln p(yn|xn, w)]︸︷︷︸Determines function and params

} = TNzn(xTnEqw, σ2)

This would have been much harder by taking derivatives of L

–q(cj) ∝ exp{E−q[ln p(wj|cj)] +��@@Eq ln p(cj)}

∝ c− 1

2j exp{− 1

2cjEq[w2

j ]}cad−1

j exp{−bcj} = GiG(2b,Eq[w2j ],

ad− 1

2)

Again, it would have been harder to define GiG(τ1, τ2, τ3) and take derivatives of L

– q(w) ∝ exp{E−q[ln p(~y|X,w)] + E−q[ln p(w|~c)]} = N(µ,Σ) (a standard calculation)

Problem: w ∈ Rd and we’re assuming d is large (e.g., 10,000). Therefore, calculating thed× d matrix Σ is impractical.

A solution: Use further factorization. Let q(w) =∏

j q(wj). We can again find theoptimal form of q(wj) to be N(µj, σ

2j ).

More tricks: In this case, updating q(wj) depends on q(wi 6=j). This can make thingsslow to converge. Instead, it works better to directly work with L and update the vectors~µ = (µ1, . . . , µd) and ~σ2 = (σ2

1, . . . , σ2d) by setting, e.g.,∇~µL = 0.

Conjugate exponential family models (CEF)s Variational inference in CEF models has a nice generic form. Here, conjugate means theprior/likelihood of all model variables are conjugate. Exponential means all distributions are inthe exponential family.

s Review and notation: Pick a model variable, θi. We have a likelihood and prior for it:

– Likelihood: Assume a conditionally independent likelihood

p(w1, . . . , wn|η(θi)) =N∏n=1

p(wn|ηi) =

[N∏n=1

h(wn)

]exp

{ηTi

N∑n=1

t(wn)−NA(ηi)}

(10.6)

η(θi) ≡ ηi is natural parameter, t(wn) is sufficient statistic vector, A(ηi) is log-normalizer

– Prior: A conjugate prior is p(ηi) = f(χ, ν) exp{ηTi χ− νA(η)}

– Posterior: p(ηi|~w) = f(χ′, ν ′) exp{ηTi χ′ − ν ′A(ηi)}, χ′ = χ+∑

n t(wn), ν ′ = ν +N

Variational Inferences We have variational distributions on each wn, q(wn) and on q(θi). Using the log→ expectation→ exponential “trick”

q(θi) ∝ exp{η(θi)

T(χ+

N∑n=1

Eq[t(wn)])− (ν +N)A(η(θi))

}. (10.7)


s Therefore, we immediately see that the optimal q distribution for a model variable with a priorconjugate to the likelihood is in the same family as the prior.

s In fact, we see that with variational inference, we take the expected value of the sufficientstatistics that would be used to calculate the true conditional posterior.

s Contrast this with Gibbs sampling, where we have samples of t(w1), . . . , t(wN) and we samplea new value of θi directly using these sufficient statistics (that are themselves changing witheach iteration because they are also sampled).

The more direct calculation for CEF models

s Imagine instead that we wanted to work directly with the variational objective function, L.

– Generically, L = Eq[ln p(X, θ)] − Eq[ln q(θ)]. In terms of the CEF form we’ve beentalking about, this is

L = Eq[ln p(w1:N , θi)]− Eq[ln q(θi)] + things that don’t depend on q(θi)

s We’ve defined q(θi) = f(χ′, ν ′) exp{η(θi)Tχ′ − ν ′A(η(θi))}. Therefore focusing just on θi,

Lθi = Eq[ηTi (χ+∑

n t(wn))− (ν +N)A(ηi)]− Eq[ηTi χ′ − ν ′A(ηi)]− ln f(χ′, ν ′)

= Eq[ηi]T(χ+

N∑n=1

Eq[t(wn)]− χ′)− (ν +N − ν ′)Eq[A(ηi)]− ln f(χ′, ν ′) (10.8)

s We need some equalities from the conjugate exponential family:∫q(θi)dθi = 1 ⇒

∫∇χ′q(θi)dθi = 0 (10.9)

=

∫(∇χ′ ln f(χ′, ν ′) + ηi)q(θi)dθi

which means ⇒ Eqηi = −∇χ′ ln f(χ′, ν ′)

∫q(θi)dθi = 1 ⇒

∫d

dν ′q(θi)dθi = 0 (10.10)

=

∫ ( d

dν ′ln f(χ′, ν ′) + A(ηi)

)q(θi)dθi

which means ⇒ EqA(ηi) = − d

dν ′ln f(χ′, ν ′)

s Therefore, the general form of the objective for θi is

Lθi = −∇χ′ ln f(χ′, η′)T (χ+∑

n Eqt(wn)− χ′)− ddν′

ln f(χ′, ν ′)(ν +N − ν ′)− ln f(χ′, ν ′)(10.11)


s Our goal is to maximize this with respect to (χ′, ν ′). If we take the gradient of L with respectto these parameters and set it to zero,

∇(χ′,ν′)L = −

[d2 ln fdχ′dχ′T

d2 ln fdχ′dν′

d2 ln fdν′dχ′T

d2 ln fdν′2

][χ+

∑n Eq[t(wn)]− χ′ν +N − ν ′

]= 0 (10.12)

we see that we get exactly the same updates that we derived before using the “log-expectation-exponentiate” approach. We can simply ignore the matrix and set the right vector equal to zero.

Variational inference from the gradient perspective

s The most general way to optimize an objective function is via gradient ascent.

s In the variational inference setting, this means updating q(θ)i|ψi) according to ψi ← ψi +ρM∇ψiL, where ρ is step size and M is a positive definite matrix.

s Ideally, we want to maximize with respect to ψi, meaning set ψi such that∇ψiL = 0. We sawhow we could do that in closed form with CEF models.

s Equivalently, if[χ′

ν ′

]←[χ′

ν ′

]+ ρM∇(χ′,ν′)L, we can set ρ = 1 and M = −

[d2 ln fd·d·T

]−1

.

Then this step would move directly to the closed form solution for ψi (i.e., where∇ψiL = 0).

s We observe that M = −[d2 ln fd·d·T

]−1

= −[d2 ln q(θi)d·d·T

]−1

. This second matrix is the inverse ofthe Fisher information matrix. When M is set to this value, we are said to be moving in thedirection of the natural gradient.

s Therefore, in sum, when we update χ′ and ν ′ in closed form as previously discussed, we areimplicitly taking a full step (since ρ = 1) in the direction of the natural gradient (because ofsetting of M ) and landing directly on the optimum. The important point being stressed here isthat there is an implied gradient algorithm being executed when we calculate our closed-formparameter updates in variational inference for CEF models.

s This last point motivates the following discussion on scalable extensions of variational inferencefor CEF models.

s In many data modeling problems we have two issues:

– N is huge (# documents in corpus, # users, etc)

– Calculating q(wn) requires non-trivial amount of work

Both of these issues lead to slow algorithms. (How slow? e.g., one day = one iteration)


s We’ll next show how stochastic optimization can be applied to this problem in a straightforwardway for CEF models.

Simple notation

s Let θ be a “global variable,” meaning it’s something that interacts with each unit of data.

s Let wn be a set of “local variables” and associated data. Given θ, wn and wn′ are independentof each other.

Examples:

– The Gaussian mixture model, θ = { π︸︷︷︸mixingweights

, µ1:K ,Σ1:K︸︷︷︸Gaussian params

}, wn = { xn︸︷︷︸data

, cn︸︷︷︸cluster

}.

– LDA, θ = {β1, . . . , βk︸︷︷︸topics

}, wn = { θn︸︷︷︸topic dist

, ~cn︸︷︷︸word alloc

, ~xn︸︷︷︸word obs

}

s The variational objective function is,

L =N∑n=1

Eq[

lnp(wn|θ)q(wn)

]+ Eq ln p(θ)− Eq ln q(θ). (10.13)

s Imagine if we subsampled St ⊂ {1, . . . , N} and created

Lt =N

|St|∑n∈St

Eq[

lnp(wn|θ)q(wn)

]+ Eq ln p(θ)− Eq ln q(θ). (10.14)

s There are(N|St|

)possible subsets of {1, . . . , N} of size |St|, each having probability 1/

(N|St|

).

s Each wn appears in(N−1|St|−1

)of the

(N|St|

)subsets.

s Therefore, we can calculate ELt summing over the random subsets as follows

ELt =

(N

|St|

)−1∑St

Lt (10.15)

=N

|St|

N∑n=1

(N

|St|

)−1(N − 1

|St| − 1

)Eq[

lnp(wn|θ)q(wn)

]+ Eq ln p(θ)− Eq ln q(θ).

s Observe that(N|St|

)−1( N−1|St|−1

)= |St|

N. Therefore, ELt = L.

Stochastic variational inference (SVI)


s Idea: What if for each iteration we subsampled a random subset of wn indexed by St, optimizedtheir q(wn) only and then took a gradient step on parameters of q(θ) using Lt instead of L? Wewould have SVI.

s Stochastic variational inference: Though the technique can be made more general, we restrictour discussion to CEF models.

s Method:

1. At iteration t, subsample w1, . . . , wN according to randomly generated index set St.

2. Construct Lt =N

|St|∑n∈St

Eq[

lnp(wn|θ)q(wn)

]+ Eq ln p(θ)− Eq ln q(θ).

3. Optimize q(wn) for n ∈ St.4. Update q(θ|ψ) by setting ψ ← ψ + ρtM∇ψLt.

A closer look at Step 4 (the key step)

s By following the same calculations as before (and letting ψ = [χ′, ν ′],

∇(χ′,ν′)L = −


d2 ln fdχ′dν′


d2 ln fdν′2

][χ+ N

|St|∑

n∈St Eq[t(wn)]− χ′

ν + |St| − ν ′

](10.16)

s If we set M = −


d2 ln fdχ′dν′


d2 ln fdν′2

]−1

, then this simplifies nicely.

[χ′

ν ′

]← (1− ρt)

[χ′

ν ′

]+ ρt

[χ+ N

|St|∑

n∈St Eq[t(wn)]

ν + |St|′

](10.17)

s What has changed?

– We are now looking at subsets of data.

– Therefore, we don’t set ρt = 1 like before. In general, stochastic optimization requires∑t ρt =∞ and

∑t ρ

2t <∞ for convergence.

s What is this doing?

– Before, we were making a full update using all the data (ρt = 1). Now, since we onlylook at a subset, we average the “full update” (restricted to the subset) with the mostrecent value of the parameters χ′ and ν ′. Since ρt → 0, we weight this new informationless and less.

Chapter 11

Variational inference for non-conjugate models

s Variational inference

Have: Data X , model variables Θ = {θi}, factorized distribution q(Θ) =∏

i q(θi).

Setup: ln p(X)︸︷︷︸constant

=

∫ (∏i

q(θi)

)lnp(X,Θ)∏i q(θi)

dΘ︸︷︷︸variational objective L

+

∫ (∏i

q(θi)

)ln

∏i q(θi)

p(Θ|X)dΘ︸︷︷︸

KL divergence ≥ 0

Variational inference: Find parameters of each q(θi) to maximize L, which simultaneouslyminimizes KL(q‖p(Θ|X)).

Potential problems: This assumes that we can actually compute L = Eq ln p(X,Θ) −Eq ln q(Θ). Even if we use the “trick” from before, where we set q(θi) ∝ exp{E−q ln p(X,Θ)},we still need to be able to take these expectations.

s Examples: We’ll give three examples where this problem arises, followed by possible solutions.

1. Poisson matrix factorization has a generative portion containing (loosely) something likeY ∼ Pois(a1b1 + · · ·+ akbk). Y is the data nad ai, bi are r.v.’s we want to learn.

– Log-joint: ln p(Y, a, b) = Y ln∑

i aibi −∑

i aibi +∑

i ln p(ai)p(bi) + const.

– Problem: E ln∑

i aibi is intractable for continuous distribution q(a, b) =∏

i q(ai)q(bi).

2. Logistic normal models look like cn ∼ Disc(p), with pi ∝ exi and x ∼ N(µ,Σ).

– Log-joint: ln p(c, x) =∑

i xi∑

n 1(cn = i)−N ln∑

i exi + ln p(x).

– Problem: −E ln∑

i exi is intractable, where q(x) = N(m,S).

3. Logistic regression has the process yn ∼ σ(xTnw)δ1 + (1− σ(xTnw))δ−1, σ(a) = 11+e−a

.

– Log-joint: ln p(y, w|X) = −∑

n ln(1 + e−ynxTnw) + ln p(w).

– Problem: −E ln(1 + e−ynxTnw) is intractable using continuous q(w).

73

Chapter 11 Variational inference for non-conjugate models 74

s Instant solution: Setting q(θ) = δθ corresponds to a point estimate of θ. We don’t need to dothe intractable integral in that case.

s One solution to intractable objectives is to find a more tractable lower bound of the functionthat’s intractable (since we’re in the maximization regime).

– The tighter the lower bound the better the function is being approximated. Therefore, notall lower bounds are equally good.

– Lower bounds often use additional parameters that control how tight they are. These areauxiliary variables that are optimized along with q.

s Solutions to the three problems

1. ln() is concave. Therefore, lnEx ≥ E lnx. If we introduce p ∈ ∆K , then

E ln∑i

aibi = E ln∑i

piaibipi≥ E

[∑i

pi lnaibipi

]=∑i

piE[ln ai + ln bi]−∑i

pi ln pi

(11.1)Often E lnx is tractable. After updating q(ai) and q(bi), we also need to update the vector p totighten the bound.

2. − ln() is convex, therefore we can use a first-order Taylor expansion

− E ln∑i

exi ≥ − ln ξ − E

[(∑i

exi − ξ)d ln z

dz

∣∣∣ξ

]= − ln ξ −

∑i Eexi − ξξ

(11.2)

Usually we can calculate Eexi (MGF). We then update ξ after q(x).

3. Finding a lower bound for − ln(1 + e−ynxTnw) is more challenging. One bound due to Jaakola

and Jordan in 2000 is

− ln(1 + e−ynxTnw) ≥ lnσ(ξn) +

1

2(ynx

Tnw − ξn)− λ(ξn)(wxnx

Tnw − ξ2

n) (11.3)

where λ(ξn) = (2σ(ξn)− 1)/(4ξn).

– In this case there is a xin for each xn since the functions we want to bound are differentfor each observation.

– Though it’s more complicated, we can take all expectations now since the bound isquadratic in w.

– In fact, when q(w) and p(w) are Gaussian, the solution for q is in closed form.

– After solving for q(w), we can get closed form updates of ξn.

s The way we update all auxiliary variables, p, ξ and ξn, is by differentiation and setting to zero.

s We next give a more concrete example found in Bayesian nonparametric models.


• A size-biased representation of the beta process (review)

• Definition: Let γ > 0 and µ a non-atomic probability measure. For k = 1, 2, . . . , draw θkiid∼ µ

and Vjiid∼ Beta(γ, 1). If we define H =

∑∞k=1

(∏kj=1 Vj

)δθk then H ∼ BP(1, γµ).

Proof : See notes from Lecture 4 for the proof.

• Bernoulli process (review)

s Recall that if we have H , then we can use it to draw a sequence of Bernoulli processes~Zn|H

iid∼ BeP(H), where ~Zn =∑∞

k=1 znkδθk , znk ∼ Bern(∏k

i=1 Vj).

(~Zn is also often treated as latent and part of data generation)

Variational inference for the beta process (using this representation)

s We simply focus on inference for V1, V2, . . . . Inference for θk and ~Zn are problem-specific (andpossibly don’t have non-conjugacy issues)

s Let everything else in the model be represented by X . X contains all the data and any variablesbesides ~Z and Θ (I’ll switch to Z instead of ~Z now)

s Joint likelihood: The joint likelihood is the first thing we need to write out. This can factorizeas

p(X,Θ, Z, V ) = p(X|Θ, Z)p(Z|V )p(V )p(Θ). (11.4)

s Therefore, to update the variational distribution q(V ) given all other q distributions, we onlyneed to focus on p(Z|V )p(V ).

s For now, we assume q(Z) is a delta function. That is, we have a 0-or-1 point estimate for thevalues in Z.

s So we focus on p(Z|V )p(V ) with Z a point estimate. According to the prior/likelihooddefinition, this further factorizes as

p(Z, V ) =

[N∏n=1

∞∏k=1

p(znk|V1, . . . , Vk)

]︸︷︷︸

likelihood

[∞∏k=1

p(Vk)

]︸︷︷︸

prior

← truncate at T atoms

∝

[N∏n=1

T∏k=1

( k∏j=1

Vj

)znk(1−

k∏j=1

Vj

)1−znk][

T∏k=1

V γ−1j

](11.5)

s We pick q(V ) =∏T

k=1 q(Vk) and leave the form undefined for now.


s The variational objective related to V is

LV =N∑n=1

T∑k=1

[znk

k∑j=1

Eq lnVj + (1− znk)Eq ln(1−∏k

j=1Vj)

]+

T∑k=1

(γ−1)Eq lnVj−T∑k=1

Eq ln q

(11.6)

Problem: The expectation Eq ln(1−∏k

j=1 Vj) is intractable.

Question: How can we update q(Vk)?

A solution: Try lower bounding ln(1−∏k

j=1 Vj) in such a way that it works.

Recall: An aside from Lecture 3 stated that 1−∏k

j=1 Vj =∑k

i=1(1− Vi)∏i−1

j=1 Vj . Therefore,we can try looking for a lower bound using this instead.

s Since ln is a concave function, lnEX ≥ E lnX . Therefore, if we introduce a k-dimensionalprobability vector p, we have

ln( k∑i=1

pi(1− Vi)i−1∏j=1

Vj/pi

)≥

k∑i=1

pi ln

[(1− Vi)

∏i−1j=1 Vj

pi

]

=k∑i=1

pi

(ln(1− Vi) +

i−1∑j=1

lnVj

)−

k∑i=1

pi ln pi (11.7)

Therefore,

Eq ln(

1−k∑j=1

Vj

)≥

k∑i=1

pi

(Eq ln(1− Vi) +

i−1∑j=1

Eq lnVj

)−

k∑i=1

pi ln pi (11.8)

(Aside) This use of p is the variational lower bound idea and is often how variational inferenceis presented in papers:

ln p(X) = ln

∫p(X,Θ)dΘ = ln

∫q(Θ)

p(X,Θ)

q(Θ)dΘ ≥

∫q(Θ) ln

p(X,Θ)

q(ΘdΘ ≡ L.

(11.9)

s We simply replace Eq ln(

1−∏k

i=1 Vi

)with this lower bound and hope it fixes the problem.

Since we have this problematic term for all k = 1, . . . , T , we have T auxiliary probabilityvectors p(1), . . . , p(T ).

s We can still use the method of setting q(Vk) = exp{E−q ln p(Z, V )} to find the best q(Vk) butreplacing p(Z, V ) with the lower bound of it. Therefore, q(Vk) is the best q for the lower boundversion of L (the function we’re now optimizing), not the true L.


s In this case, we find that we get a (very complicated) analytical solution for q(Vk):

q(Vk) ∝ Vγ+

∑Tm=k

∑Nn=1 znk+

∑Tm=k+1

∑Nn=1(1−znk(

∑mi=k+1 p

(m)i )−1

k ×

(1− Vk)1+∑Tm=k

∑Nn=1(1−znk)p

(m)k −1 (11.10)

But this simply has the form of a beta distribution with parameters that can be calculated“instantaneously” by a computer.

Un-addressed question: What do we set the probability vector p(m) equal to?

Answer: We can treat it like a new model parameter and optimize over it along with thevariational parameters. In general, if we want to optimize p in

∑ki=1 piEyi −

∑ki=1 pi ln pi, we

can setpi ∝ exp{Eyi}. (11.11)

In this case, Eyi = E ln(1− Vi) +∑i−1

j=1 E lnVj , which is fast to compute.

s We can iterate between updating p(1), . . . , p(T ) and q(V1), . . . , q(VT ) before moving on to theother q distribution, or update each once and move to the other q’s before returning.

Directly optimizing Ls Rather than introduce approximations, we would like to find other ways to optimize L. Specifi-

cally, is there a way to optimize it without taking expectations at all?

Setup: Coordinate ascent

Given q(θ)∏

i q(θi), the following is the general way to optimize L.

1. Pick a q(θi|φi) ← φi are variational parameters

2. Set∇φiL = 0 w.r.t. φi (if possible) or φi ← φi + ρ∇φiL (if not)

3. After updating each q(θi) this way, repeat.

Problem: For q(θi|φi), Eq ln p(X,Θ) isn’t (totally) in closed form.

Detail: Let

ln p(X,Θ) = f1(X, θi)︸︷︷︸problem

+ f2(X,Θ)︸︷︷︸no problem

→ L = Ef1 + Ef2 − E ln q︸︷︷︸= h(X,Ψ)

(11.12)

f1 isn’t necessarily the log of an entire distribution, so f1 and f2 could contain θi. Ψ containsall the variational parameters of q.

s What we need is∇ψiL = ∇ψiEf1(X, θi)︸︷︷︸hard

+∇ψih(X,Θ)︸︷︷︸easy

.


Solution (version 1): Create an unbiased estimate of∇ψiEf1

∇ψiEf1 = ∇ψi

∫f1(X, θi)q(θi|ψi)dθi

=

∫f1(X, θi)q(θi|ψi)∇ψi ln q(θi|ψi)dθi (11.13)

We use the log trick for this q∇ψ ln q = qq∇ψq = ∇ψq.

s Use Monte Carlo integration:

EqX ≈1

S

S∑s=1

Xs, Xsiid∼ q → lim

S→∞

1

S

S∑s=1

Xs = EqX (11.14)

Therefore, we can use the following approximation,∫f1(X, θi)q(θi|ψi)∇ψi ln q(θi|ψi)dθi = Eq[f1(X, θi)∇ψi ln q(θi|ψi)]

≈ 1

S

S∑s=1

f1(X, θ(s)i )∇ψi ln q(θ

(s)i )︸︷︷︸

≡ ∇ψiEf1

, θ(s)i

iid∼ q(θi|ψi).(11.15)

Therefore,∇ψiL ≈ ∇ψiEf1 +∇ψih(X,Ψ) and RHS is an unbiased approximation, meaningthe expectation of the RHS equals the LHS.

Problem with this

s Even though the expectation of this approximation equals the truth, the variance might be huge,meaning S should be large.

Review: If XS = 1S

∑Ss=1 xs, xs

iid∼ q. Then EXS = EqX and V ar(XS) = V ar(X)/S. If wewant V ar(Xs) < ε, then S > V ar(X)/ε (possible huge!)

Need: Variance reduction, which is a way for approximating Ef(θ) that is unbiased, but hassmaller variance. One such method is to use a control variate.

Control variates: A way of approximating Eqf(θ) with less variance than using Monte Carlointegration as described above.

Introduce a function g(θ) of the same variables such that

– g(θ) and f(θ) are highly correlated

– Eg(θ) can be calculated (unlike Ef(θ))


Procedure:

– Replace f(θ) with f(θ) = f(θ)− a(g(θ)− Eg(θ)) in L, with a ∈ R– Replace 1

S

∑Ss=1 f(θs)∇ ln q(θs) with 1

S

∑Ss=1 f(θs)∇ ln q(θs)

Why this works better:

– Eqf(θ) = Ef(θ)− a(Eg(θ)− E[Eg(θ)]) = Ef(θ) ← unbiased

– V ar(f(θ)) = E[(f(θ)− Ef(θ))2] = V ar(f(θ))− 2aCov(f(θ), g(θ)) + a2V ar(g(θ))

s Setting a = Cov(f,g)V ar(f)

minimizes V (f(θ)) above. In this case, we can see that V ar(f(θ))V ar(f(θ))

=

1− Corr(f, g)2.

s Therefore, when f and g are highly correlated, the variance of f is much less than f . Thistranslates to fewer samples to approximate Ef(θ) within a accuracy threshold than needed forEf(θ) using the same threshold.

s In practice, the optimal value a = Cov(f,g)V ar(f)

can be approximated each iteration using samples.

Modified coordinate ascent

1. Pick a q(θi|ψi)2. If L is closed form w.r.t. q(θi|ψi), optimize the normal way. Else, approximate ∇ψiL

by ∇∗ψiL = ∇ψiEf1 +∇ψih(X,Ψ) and set ψi ← ψi + ρ∇∗ψiL. Possibly taking multiplesteps.

3. After each q(θi|ψi) has been updated, repeat.

Picking control variates

– Lower bounds: By definition, a tight lower bound is highly correlated with the function itapproximates. It’s also picked to give closed form expectations.

– Taylor expansions (2nd order): A second order Taylor expansion is not a bound, but oftenit is a better approximation than a bound. in this case,

g(θ) = f(µ) + (θ − µ)T∇f(µ) +1

2(θ − µ)T∇2f(µ)(θ − µ). (11.16)

If the second moments of q(θ) can be calculated, then this is a possible control variate. µcould be the mean of q(θ|ψ) given the current value of ψ.

Chapter 12

Scalable MCMC inference

Abstract

In the last class, I present two recent papers that address scalability for MonteCarlo Metropolis Hastings (MCMC) inference:

1. M. Welling and Y.W. Teh, “Bayesian Learning via Stochastic Gradient LangevinDynamics,” International Conference on Machine Learning, 2011.

2. D. Maclaurin and R.P. Adams, “Firefly Monte Carlo: Exact MCMC with Sub-sets of Data,” Uncertainty in Artificial Intelligence, 2014.

These presentations were given using slides, which I transcribe verbatim.

s Tons of digitized data that can potentially be modeled:

s Online “documents” (blogs, reviews, tweets, comments, etc.)s Audio, video & images (Pandora, YouTube, Flickr, etc.)s Social media, medicine, economic, astronomy, etc. etc.

s It used to be that the problem was turning a mathematical model into reality in the most basicsense:

s How do we learn it? (computer innovations)s Yes, but how do we learn it? (MCMC, optimization methods)

s Given the massive amount of data now readily accessible by a computer, all of which we wantto use, the new question is:

s Yes, but how do we learn it in a reasonable amount of time?s Up to (and often including) now, the default solution has been to sub-sample a manageableamount and throw the rest away.

80

Chapter 12 Scalable MCMC inference 81

s Justification:

− The large data we have is itself a small subsample of the set of all possible data.

− So if we think that’s a reasonable statistical representation, what’s wrong with a little less(relatively speaking)?

s Example 1: Netflix has 17K movies, 480K users, 100M ratings− There are 8,160M potential ratings (1.2% measured), so if we remove users to learn aboutthe movies, what can it hurt?

s Example 2: We have 2M newspaper articles of ×1000 more existing articles. Why notsubsample to learn the topics?

s We want to (and should) use all data since the more data we have, the more information we canextract.

s This translates to more “expressive” (i.e., complicated) probabilistic models in the context ofthis class.

s This hinges on having the computational resources to do it:

s More powerful computers,s Recognizing and exploiting parallelization,s Using inference tricks to process all the data more efficiently

s Recent successes in large scale machine learning have mostly been optimization-based ap-proaches:

s Stochastic optimization has a huge literatures Stochastic variational inference ports this over to the problem of approximate posteriorlearning

s The common denominator of these methods is that they process small subsets of the data at atime.

s Example: With SVI we saw how each piece of data had its own local variables and all datashared global variables.

s Pick a subset of the data,s update their local variables,s then perform a gradient step on the global variables.

s This idea of processing different subsets of data for each iteration is a general one.


s Can it be extended to MCMC inference?

s We’ll review two recent ML papers on this:

s M. Welling and Y.W. Teh, “Bayesian Learning via Stochastic Gradient Langevin Dynam-ics,” ICML 2011.

s D. Maclaurin and R.P. Adams, “Firefly Monte Carlo: Exact MCMC with Subsets of Data,”UAI 2014.

Bayesian Learning via Stochastic Gradient Langevin Dynamics

s This SGLD algorithm combines two things:

s Stochastic gradient optimization

s Langevin dynamics, which incorporates noise to the updates

s The stochastic gradients and Langevin dynamics are combined in such a way that MCMCsampling results.

s Considers a simple modeling framework with approximations(Alert: possible future research!)

s Let θ denote model vector and p(θ) its prior distribution.

s Let p(x|θ) be the probability of data x given model variable θ.

s The posterior distribution of N observations X = {x1, . . . , xN} is assumed to have the form

p(θ|X) ∝ p(θ)N∏i=1

p(xi|θ).

s How do we work with this? Rough outline of following slides:

s Maximum a posteriori with gradient methodss Stochastic gradient methods for MAP (big data extension)s Metropolis-Hastings MCMC samplings Langevin diffusions (gradient + MH-type algorithm)s Stochastic gradient for Langevin diffusions

s A maximum a posteriori estimate of θ is

θMAP = arg maxθ

ln p(θ) +N∑i=1

ln p(xi|θ).


s This can be (locally) optimized using gradient ascent

θt+1 = θt + ∆θt

where ∆θt is the weighted gradient of the log joint likelihood

∆θt =εt2

(∇ ln p(θt) +

N∑i=1

∇ ln p(xi|θt)

).

s Notice that this requires evaluating p(xi|θt) for every xi.

s Stochastic gradient ascent optimizes exactly the same objective function (or finds a localoptimal solution).

s The only difference is that for each step t, a subset of n� N observations is used in the stepθt+1 = θt + ∆θt:

∆θt =εt2

(∇ ln p(θt) +

N

n

n∑i=1

∇ ln p(xti|θt)

).

s xti is the ith observation in subset t:

s This set is usually talked of as being randomly generateds In practice, contiguous chunks of n are often used

s For the stochastic gradient

∆θt = εt2

(∇ ln p(θt) + N

n

∑ni=1∇ ln p(xti|θt)

)there are two requirements on εt to provably ensure convergence (see Leon Bottou’s papers,e.g., one in 1998),

∞∑t=1

εt =∞,∞∑t=1

ε2t <∞.

s Intuitively:

s The first one ensures that the parameters can move as far as necessary to find the localoptimum.s The second one ensures that the “noise” from using n� N does not have “infinite impact”and the algorithm converges.

s Often used: εt =(ab+t

)γ with γ ∈ (0.5, 1].

s MAP (and ML) approaches are good tools to have, but they don’t capture uncertainty and canlead to overfitting.

s A typical Bayesian approach is to use MCMC sampling, which is a set of techniques that givessamples from the posterior.


s One MCMC approach is “random walk Metropolis-Hasting”:

s Sample ηt ∼ N(0, εI) and set θ∗t+1 = θt + ηt− i.e., θ∗t+1 ∼ q(θ∗t+1|θt) = N(θt, εI).

s Set θt+1 = θ∗t+1 with probability p(X,θ∗t+1)q(θt|θ∗t+1)

p(X,θt)q(θ∗t+1|θt)=

p(X,θ∗t+1)

p(X,θt), otherwise θt+1 = θt.

s We evaluate the complete joint likelihood to calculate the probability of acceptance.

s After t > T , we can treat θt as samples from the posterior distribution p(θ|X) and use them forempirical averages.

s Langevin dynamics combines gradient methods with the idea of random walk MH sampling.

s Instead of treating t as an index, think of it as a continuous time and let

θt+∆t = θt + ∆θt

∆θt =ε∆t

2∇ ln p(X, θt) +

√εηt, ηt ∼ N(0,∆tI).

s We have:

s A random Gaussian component like random walk MHs A derivative component like gradient ascent.

s We could then have an accept/reject step similar to the one on the previous slide (but q doesn’tcancel out).

s What happens to θt+∆t − θt = ∆θt when ∆t→ 0?

dθt =ε

2∇ ln p(X, θt)dt+

√εdWt.

s We can show that we will always accept the new location.

s On RHS: The right term randomly explores the space, while the left term pulls in the directionof higher density.

s The path that θ maps out mimics the posterior density (beyond the scope of class to prove this).

s What does this mean?−If we wait and collect samples of θ every ∆t amount of time (sufficiently large), we can treatthem as samples from p(θ|X).

s We would naturally want to think of discretizing this.


s i.e., setθt+∆t = θt + ∆θt

∆θt =ε∆t

2∇ ln p(X, θt) + ηt, ηt ∼ N(0, ε∆tI).

for 0 < ∆t� 1, and accept with probability 1.

s (Notice an equivalent representation is used above.)

s The authors consider replacing ε∆t with εt such that∑∞t=1 εt =∞,

∑∞t=1 ε

2t <∞.

s This leads to the insight and contribution of the paper:

Replace∇ ln p(X, θt) with its stochastic gradient.

s That is, let θt+1 = θt + ∆θt with

∆θt =εt2

(∇ ln p(θt) +

N

n

n∑i=1

∇ ln p(xti, θt)

)+ ηt

ηt ∼ N(0, εtI).

again with∑∞

t=1 εt =∞,∑∞

t=1 ε2t <∞.

s Notice that the only difference between this and stochastic gradient for MAP is the addition ofηt.s The authors’ give intuition why we can skip the accept/reject step by setting

∑∞t=1 εt =

∞,∑∞

t=1 ε2t <∞, and also why the following converges to the Langevin dynamics

∆θt =εt2

(∇ ln p(θt) +

N

n

n∑i=1

∇ ln p(xti, θt)

)+√εtηt

ηt ∼ N(0, I) ← notice the equivalent representation

s It’s not intended to be a proof, so we won’t go over it.

s However notice that:

s √εt is huge (relatively) compared to εt for 0 < εt � 1.s The sums of both terms individually are infinite, so both provide an “infinite” amount ofinformation.


s The sum of the variances of both terms is finite for the first and infinite for the second, sosub-sampling has finite impact.

s The paper considers the toy problem

θ1 ∼ N(0, σ21), θ2 ∼ N(0, σ2

2),

xi ∼1

2N(θ1, σ

2x) +

1

2N(θ1 + θ2, σ

2x).

s Set θ1 = 0, θ2 = 1, sample N = 100 points.

s Set n = 1 and process the data set 10000 times.

s (Showed figures and experimental details from the paper.)

Firefly Monte Carlo: Exact MCMC with Subsets of Data

s We’re again in the setting where we have a lot of data that’s conditionally independent given amodel variable θ.

s MCMC algorithms such as random walk Metropolis-Hastings are time consuming:

Set θt+1 ← θ∗t+1 ∼ N(θt, εI) w.p. p(θ∗t+1)∏i p(xi|θ∗t+1)

p(θt)∏i p(xi|θt)

.

s This requires many evaluations in the joint likelihood.

s Evaluating subsets for each t is promising, but previous methods are only approximate orasymptotically correct.

s This paper presents a method for performing exact MCMC sampling by working with subsets.

s High level intuition: Introduce a binary switch for each observation and only process those forwhich the switch is on.

s The switches change randomly each iteration, so they are like fireflies (1 = light, 0 = dark).Hence “Firefly Monte Carlo”.

s This does require nice properties, tricks and simplifications which make this a non-trivialmethod for each new model.

s Again, we have a parameter of interest θ with prior p(θ).


s We have N observations X = {x1, . . . , xN} with N huge.

s The likelihood factorizes, therefore the posterior

p(θ|X) ∝ p(θ)N∏n=1

p(xn|θ).

s For notational convenience let Ln(θ) = p(xn|θ), so

p(θ|X) ∝ p(θ)N∏n=1

Ln(θ).

s The initial setup for FlyMC is remarkably straightforward:

s Introduce a new function B, with Bn(θ) the evaluation on xn with parameter θ, such that0 < Bn(θ) ≤ Ln(θ) for all θ.

s With each xn associate an auxiliary variable zn ∈ {0, 1} distributed as

p(zn|xn, θ) =

[Ln(θ)−Bn(θ)

Ln(θ)

]zn [Bn(θ)

Ln(θ)

]1−zn

s The augmented posterior is

p(θ, Z|X) ∝ p(θ)N∏n=1

p(xn|θ)p(zn|xn, θ).

s Summing out zn from p(θ)∏N

n=1 p(xn|θ)p(zn|xn, θ) leaves the original joint likelihood in tactas required.

s Therefore, this is an auxiliary variable sampling method:

s Before: Sample θ with Metropolis-Hastingss Now: Iterate sampling each zn and then θ with MH.

s As with all MCMC methods, to get samples from the desired p(θ|X), keep the θ samples andthrow away the Z samples.

s What has been done so far is always possible as a general rule. The question is if it makesthings easier (here, faster).


s To this end, we need to look at the augmented joint likelihood. For a particular term,

p(xn|θ)p(zn|xn, θ) = Ln(θ)

[Ln(θ)−Bn(θ)

Ln(θ)

]zn [Bn(θ)

Ln(θ)

]1−zn

=

{Ln(θ)−Bn(θ) if zn = 1

Bn(θ) if zn = 0

s For each iteration, we only need to evaluate the likelihood Ln(θ) at points where zn = 1.Alarm bells: What about Bn(θ)?

s What about all the Bn(θ), which appear regardless of zn?

s Look at the augmented joint likelihood:

p(X,Z, θ) = p(θ)∏

n:zn=1

Ln(θ),

p(θ) = p(θ)N∏n=1

Bn(θ), Ln(θ) =Ln(θ)−Bn(θ)

Bn(θ).

s When we sample a new θ using MH, we compute p(X,Z,θ∗t+1)

p(X,Z,θt).

s We have to evaluate

s p(θ)→ fasts Ln(θ) for each n : zn = 1→ fast if∑

n zn � Ns ∏Nn=1 Bn(θ)→ fast?

s This therefore indicates the two requirements we have about Bn(θ) for this to work well:

s Bn(θ) is close to Ln(θ) for many n and θ so that few zn = 1.s ∏Nn=1 Bn(θ) has to be quickly evaluated.

s Condition 1 states that Bn(θ) is a tight lower bound of Ln(θ) − Recall that p(zn = 1|xn, θ) =Ln(θ)−Bn(θ)

Ln(θ).

s Condition 2 essentially means that B allows us to only have to evaluate a function of θ andstatistics of X = {x1, . . . , xN}.

s (Showed intuitive figure from paper.)


s This leaves two issues:

s Actually choosing the function Bn(θ).s Sampling zn for each n.

s Notice that to sample zn ∼ Bern(Ln(θ)−Bn(θ)

Ln(θ)

), we have to evaluate Ln(θ) and Bn(θ) for each

n.

s So our computations have actually increased.

s The paper presents methods for dealing with this as well.

s Problem 1: Choosing Bn(θ). Solution: Problem specific.

s Example: Logistic regression. For xn ∈ Rd and yn ∈ {−1, 1},

Ln(θ) = (1 + e−ynxTn θ)−1 ≥ ea(ξ)(ynxTn θ)

2+ 12

(ynxTn θ)+c(ξ) = Bn(θ).

s a(ξ) and c(ξ) are functions of auxiliary parameter ξ.

s The product isN∏n=1

Bn(θ) = ea(ξ)θT (∑n xnx

Tn )θ+ 1

2θT (

∑n ynxn)+c(ξ)

s We can calculate the necessary statistics from the data in advance.∏N

n=1Bn(θ) is then extremelyfast to compute.

s Problem 2: Sampling zn. Solution: Two general methods.

s Simplest: Sample new zn for random subset each iteration.

s Alternative: Introduce MH step to propose switching light→dark and dark→light.

s Focus on zn:

s Propose zn → z′n by sampling z′n ∼ q(z′n|zn).s Accept with probability p(z′n|xn,θ)q(zn|z′n)p(zn|xn,θ)q(z′n|zn)

.

s Immediate observation: Always accept if z′n = zn.

s Focus on zn:

s Propose zn → z′n by sampling z′n ∼ q(z′n|zn).


s Accept with probability p(z′n|xn,θ)q(zn|z′n)p(zn|xn,θ)q(z′n|zn)

.

s Case z′n = 0, zn = 1: Sample total from Bern(∑

n zn, q1→0), then pick which xn uniformlyw.o. replacement.Observation: Can re-use Ln(θ) and Bn(θ) if rejected.

s Case z′n = 1, zn = 0: Sample from Bern(N −∑

n zn, q0→1), then pick which xn uniformlyw.o. replacement.Observation: Want q0→1 small to reduce evaluations of L, B.

s Final consideration: We want Bn(θ) tight, which is done by setting its auxiliary parameters.

s The authors discuss this in their paper for the logistic regression problem.

s It’s an important consideration since it will control how many zn = 1 in an iterationTighter Bn(θ) means fewer zn = 1 means faster algorithm.

s This procedure may require running a fast algorithm in advance (e.g., stochastic MAP).

s (Showed figures and experiments from the paper.)

course notes for advanced probabilistic machine …jwp2128/teaching/e9801/notes/apml...chapter 1...

Documents