bayesian nonparametrics: models based on the dirichlet...

Bayesian Nonparametrics:Models Based on the Dirichlet Process

Alessandro Panella

Department of Computer ScienceUniversity of Illinois at Chicago

Machine Learning Seminar SeriesFebruary 18, 2013

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 1 / 57

Sources and Inspirations

Tutorials (slides)P. Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics. NIPS 2011.

M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That. NIPS 2005.

Articles etc.E.B. Sudderth, Chapter in PhD thesis, 2006.

E. Fox, Chapter in PhD thesis, 2008.

Y.W. Teh, Dirichlet Processes. Encyclopedia of Machine Learning, 2010. Springer.

...


Outline

1 Introduction and backgroundBayesian learningNonparametric models

2 Finite mixture modelsBayesian modelsClustering with FMMsInference

3 Dirichlet process mixture modelsGoing nonparametric!The Dirichlet processDP mixture modelsInference

4 A little more theory. . .De Finetti’s REDUXDirichlet process REDUX

5 The hierarchical Dirichlet process


Introduction and background

Outline







Introduction and background Bayesian learning

The meaning of it all

BAYESIAN NONPARAMETRICS



Bayesian statistics

Estimate a parameter θ ∈ Θ after observing data x.

FrequentistMaximum Likelihood (ML): θMLE = argmaxθ p(x|θ) = argmaxθ L(θ : x)

Bayesian

Bayes Rule: p(θ|x) = p(x|θ)p(θ)p(x)

Bayesian prediction (using the whole posterior, not just one estimator)

p(xnew|x) =

∫

Θ

p(xnew|θ)p(θ|x) dθ

Maximum A Posteriori (MAP)

θMAP = argmaxθ

p(x|θ)p(θ)



De Finetti’s theorem

A premise:

DefinitionAn infinite sequence random variables (x1, x2, . . .) is said to be (infinitely)exchangeable if, for every N and every possible permutation π on (1, . . . ,N),

p(x1, x2, . . . , xN) = p(xπ(1), xπ(2) . . . , xπ(N))

Note: exchangeability not equal i.i.d!

Example (Polya Urn)An urn contains some red balls and some black balls; an infinite sequence ofcolors is drawn recursively as follows: draw a ball, mark down its color, thenput the ball back in the urn along with an additional ball of the same color.



De Finetti’s theorem (cont’d)

Theorem (De Finetti, 1935. Aka Representation Theorem)A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for allN, there exists a random variable θ and a probability measure p on it such that

p(x1, x2, . . . , xN) =

∫

Θ

p(θ)

N∏

i=1

p(xi|θ) dθ

i.e., there exists a parameter space and a measure on it that makes thevariables iid!

The representation theorem motivates (and encourages!) the use of Bayesianstatistics.



Bayesian learning

Hypothesis space HGiven data D, compute

p(h|D) =p(D|h)p(h)

p(D)

Then, we probably want to predict some future data D′, by either:Average over H, i.e. p(D′|D) =

∫H p(D′|h)p(h|D)p(h) dh

Choose the MAP h (or compute it directly), i.e. p(D′|D) = p(D′|hMAP)Sample from the posterior...

H can be anything! Bayesian learning as a general learning frameworkWe will consider the case in which h is a probabilistic model itself, i.e. aparameter vector θ.



A simple example

Infer the bias θ ∈ [0, 1] of a coin after observing N tosses.

H = 1,T = 0, p(H) = θ

h = θ, hence H = [0, 1]

Sequence of Bernoulli trials:

p(x1, . . . , xn|θ) = θnH (1− θ)N−nH

where nH = # heads.Unknown θ:

p(x1, . . . , xN) =

∫ 1

0θnH (1− θ)nH−kp(θ) dθ

Need to find a “good” prior p(θ). . .Beta distribution!

θ

x1 x2 xN

θ

xi

N



A simple example (cont’d)

Beta distribution: θ ∼ Beta(a, b)

p(θ|a, b) = 1B(a,b)θ

a−1(1− θ)b−1

Bayesian learning: p(h|D) ∝ p(D|h)p(h); for us:

p(θ|x1, . . . , xN) ∝ p(x1, . . . , xn|θ)p(θ)

= θnH (1− θ)nT1

B(a, b)θa−1(1− θ)b−1

∝ θnH+a−1(1− θ)nT +b−1

i.e. θ|x1, . . . , xN ∼ Beta(a + NH, b + NT)

We’re lucky! The Beta distribution is a conjugateprior to the binomial distribution.

Beta(0.1, 0.1)

Beta(1, 1)

Beta(2, 3)

Beta(10, 10)



A simple example (cont’d)

Three sequences of four tosses:

H T H H

H H H T

H H H H


Introduction and background Nonparametric models

Nonparametric models“Nonparametric” doesn’t mean “no parameters”! Rather,

The number of parameters grows as more data are observed.∞-dimensional parameter space.

Finite data ⇒ Bounded number of parameters

DefinitionA nonparametric model is a Bayesian model on an∞-dimensional parameterspace.

Example

TERMINOLOGY

Parametric model� Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

� Number of parameters grows with sample size

� ∞-dimensional parameter space

Example: Density estimation20 CHAPTER 2. BAYESIAN DECISION THEORY

x2

x1

µ

Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered onthe mean µ. The red ellipses show lines of equal probability density of the Gaussian.

being merely σ2 times the identity matrix I. Geometrically, this corresponds to thesituation in which the samples fall in equal-size hyperspherical clusters, the clusterfor the ith class being centered about the mean vector µi. The computation of thedeterminant and the inverse of Σi is particularly easy: |Σi| = σ2d and Σ−1

i = (1/σ2)I.Since both |Σi| and the (d/2) ln 2π term in Eq. 47 are independent of i, they areunimportant additive constants that can be ignored. Thus we obtain the simplediscriminant functions

gi(x) = −�x− µi�22σ2

+ ln P (ωi), (48)

where � · � is the Euclidean norm, that is,Euclideannorm

�x− µi�2 = (x− µi)t(x− µi). (49)

If the prior probabilities are not equal, then Eq. 48 shows that the squared distance�x−µ�2 must be normalized by the variance σ2 and offset by adding ln P (ωi); thus,if x is equally near two different mean vectors, the optimal decision will favor the apriori more likely category.

Regardless of whether the prior probabilities are equal or not, it is not actuallynecessary to compute distances. Expansion of the quadratic form (x − µi)

t(x − µi)yields

gi(x) = − 1

2σ2[xtx− 2µt

ix + µtiµi] + ln P (ωi), (50)

which appears to be a quadratic function of x. However, the quadratic term xtx isthe same for all i, making it an ignorable additive constant. Thus, we obtain theequivalent linear discriminant functionslinear

discriminant

gi(x) = wtix + wi0, (51)

where

Parametric

8 CHAPTER 4. NONPARAMETRIC TECHNIQUES

-2-1

01

2 -2

-1

0

1

2

0

0.05

0.1

0.15

-2-1

01

2

h = 1

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

0.2

0.4

0.6

-2-1

0

1

2

h = .5

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

1

2

3

4

-2-1

0

1

2

h = .2

!(x)

Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windowsϕ(x/h) for three different values of h. Note that because the δk(·) are normalized,different vertical scales must be used to show their structure.

p(x)p(x) p(x)

Figure 4.4: Three Parzen-window density estimates based on the same set of fivesamples, using the window functions in Fig. 4.3. As before, the vertical axes havebeen scaled to show the structure of each function.

and

limn→∞

σ2n(x) = 0. (18)

To prove convergence we must place conditions on the unknown density p(x), onthe window function ϕ(u), and on the window width hn. In general, continuity ofp(·) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarilyinvoked. With care, it can be shown that the following additional conditions assureconvergence (Problem 1):

supu

ϕ(u) <∞ (19)

lim�u�→∞

ϕ(u)

d�

i=1

ui = 0 (20)

NonparametricPeter Orbanz & Yee Whye Teh 4 / 71(from Orbanz and Teh, NIPS 2011)


Finite mixture models

Outline







Finite mixture models Bayesian models

Models in Bayesian data analysis

ModelGenerative process.Expresses how we think the data is generated.Contains hidden variables (the subject oflearning.)Specifies relations between variables.E.g. graphical models.

Posterior inferenceKnowing p(D|M, θ). . .← “how data is generated”. . . compute p(θ|M,D)

Akin to “reversing” the generative process.

θ

D

p(θ)

M

p(D|M, θ)


Finite mixture models Bayesian models

Models in Bayesian data analysis

ModelGenerative process.Expresses how we think the data is generated.Contains hidden variables (the subject oflearning.)Specifies relations between variables.E.g. graphical models.

Posterior inferenceKnowing p(D|M, θ). . .← “how data is generated”. . . compute p(θ|M,D)

Akin to “reversing” the generative process.

θ

D

p(θ)

M

p(D|M, θ) p(θ|D, M)


Finite mixture models Clustering with FMMs

Finite mixture models (FMMs)

Bayesian approach to clustering. Each data point is assumed to belong toone of K clusters.

General formA sequence of data points x = (x1, . . . , xN) each with probability

p(xi|π, θ1, . . . , θK) =

K∑

k=1

πk f (xi|θk) π ∈ ΠK−1

Generative processFor each i:

Draw a cluster assignment zi ∼ πDraw a data point xi ∼ F(θzi).



FMMs (example)

Mixture of univariate Gaussiansθk = (µk, σk)

xi ∼ N (µk, σk)p(xi|π, µ, σ) =

K∑

k=1

πk fN (xi;µk, σk)

−2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

N (1, 1) N (4, .5) N (6, .7)

π = (0.15, 0.25, 0.6)



FMMs (cont’d)

Clustering with FFMs

Need priors for π, θUsually, π is given a (symmetric) Dirichlet distribution prior.θk’s are given a suitable prior H, depending on the data.

πα

zi

xi

θk

H

N

K

π ∼ Dir(α/K, . . . , α/K)θk|H ∼ H k = 1 . . .Kzi|π ∼ π

xi|θ, zi ∼ F(θzi) i = 1 . . .N



Dirichlet distribution

Multivariate generalization of Beta.Dirichlet Distributions

(from Teh, MLSC 2008)

Dir(1, 1, 1) Dir(2, 2, 2) Dir(5, 5, 5)

Dir(5, 5, 2) Dir(5, 2, 2) Dir(0.7, 0.7, 0.7)



Dirichlet distribution (cont’d)

π ∼ Dir(α/K, . . . , α/K) iff p(π1, . . . , πK) =Γ(α)∏

k Γ(α/K)

K∏

k=1

πα/K−1k

Conjugate prior to categorical/multinomial, i.e.

π ∼ Dir(αK , . . . ,αK )

zi ∼ π i = 1 . . .N

impliesπ|z1, . . . , zN ∼ Dir

(αK

+ n1,α

K+ n2, . . . ,

α

K+ nK

)

Moreover,

p(z1, . . . , zN |α) =Γ(α)

Γ(α+ N)

K∏

k=1

Γ(nk + α/K)

Γ(α/K)

and

p(zi = k|z(−i), α) =n(−i)

k + α/Kα+ N − 1



Approximate inference for FMMsNo exact inference because of the unknown clusters identifiers z

Expectation-Maximization (EM)Widely used, but we will focus on MCMC because of the connection withDirichlet Process.

Gibbs samplingMarkov chain Monte Carlo (MCMC) integration method

Set of random variables v = {v1, v2, . . . , vM}.We want to compute p(v).Randomly initialize their values.At each iteration, sample a variable vi and hold the rest constant:

v(t)i ∼ p(vi|v(t−1)

j , j 6= i) ← usually tractable

v(t)j = v(t−1)

j

This creates a Markov chain with p(v) as equilibrium distribution.



Gibbs sampling for FMMs (example)

Mixture of 4 bivariate GaussiansNormal-inverse Wishart prior on θk = (µk,Σk), conjugate to normaldistribution.

Σk ∼ W(ν,∆) µk ∼ N (ϑ,Σk/κ)

log p(x | !, ") = −539.17 log p(x | !, ") = −497.77

log p(x | !, ") = −404.18 log p(x | !, ") = −454.15

log p(x | !, ") = −397.40 log p(x | !, ") = −442.89

Figure 2.18. Learning a mixture of K = 4 Gaussians using the Gibbs sampler of Alg. 2.1. Columnsshow the current parameters after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from tworandom initializations. Each plot is labeled by the current data log–likelihood.

log p(x | !, ") = −539.17 log p(x | !, ") = −497.77

log p(x | !, ") = −404.18 log p(x | !, ") = −454.15

log p(x | !, ") = −397.40 log p(x | !, ") = −442.89


log p(x | !, ") = −539.17 log p(x | !, ") = −497.77

log p(x | !, ") = −404.18 log p(x | !, ") = −454.15

log p(x | !, ") = −397.40 log p(x | !, ") = −442.89


T=2 T=10 T=40

(from Sudderth, 2008)



FMMs: alternative representation

πα

zi

xi

θk

H

N

K

π ∼ Dir(α)

θk ∼ H

zi ∼ πxi ∼ F(θzi)

Gα

θi

xi

N

H

58 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS

xi

!zi

"#

k

N

K

$

xi

i

#

N

G

!

H

2! 1!

2x 1x

Figure 2.9. Directed graphical representations of a K component mixture model. Mixture weights! ! Dir("), while cluster parameters are assigned independent priors #k ! H($). Left: Indicatorvariable representation, in which zi ! ! is the cluster that generates xi ! F (#zi). Right: Alternativedistributional form, in which G is a discrete distribution on ! taking K distinct values. #i ! G are theparameters of the cluster that generates xi ! F (#i). We illustrate with a mixture of K = 4 Gaussians,where cluster variances are known (bottom) and H($) is a Gaussian prior on cluster means (top).Sampled cluster means #1, #2, and corresponding Gaussians, are shown for two observations x1, x2.

The unobserved indicator variable zi ! {1, . . . , K} specifies the unique cluster associatedwith xi. Mixture models are widely used for unsupervised learning, where clusters areused to discover subsets of the data with common attributes.

In most applications of mixture models, f(x | !k) is chosen to be an appropriateexponential family. For example, Euclidean observations are often modeled via Gaus-sian mixtures, so that the parameters !k = (µk, !k) specify each cluster’s mean µk

and covariance !k. When learning mixtures from data, it is often useful to place anindependent conjugate prior H, with hyperparameters ", on each cluster’s parameters:

!k " H(") k = 1, . . . , K (2.81)

Similarly, in the absence of prior knowledge distinguishing the clusters, the mixtureweights # can be assigned a symmetric Dirichlet prior with precision $:

# " Dir! $

K, . . . ,

$

K

"(2.82)

Fig. 2.9 shows a directed graphical model summarizing this generative process. As inFig. 2.8, plates are used to compactly denote the K cluster parameters {!k}K

k=1 andN data points {xi}N

i=1. In Fig. 2.10, we illustrate several two–dimensional Gaussianmixtures sampled from a conjugate, normal–inverse–Wishart prior.

Mixture models can equivalently be expressed in terms of a discrete distribution G

G(θ) =

K∑

k=1

πkδ(θ, θk)θk ∼ Hπ ∼ Dir(α)

θi ∼ G

xi ∼ F(θi)


Dirichlet process mixture models

Outline







Dirichlet process mixture models Going nonparametric!

Going nonparametric!

The problem with finite FMMs

What if K is unknown?How many parameters?

IdeaLet’s use∞ parameters!We want something of the kind:

p(xi|π, θ1, θ2, . . .) =

∞∑

k=1

πk p(xi|θk)

How to define such a measure?We’d like the nice conjugancy properties of Dirichlet to carry on. . .Is there such a thing, the∞ limit of a Dirichlet?


Dirichlet process mixture models The Dirichlet process

The (practical) Dirichlet process

The Dirichlet process is a distribution over probability measuresover Θ.

DP(α,H)

H(θ) is the base (mean) measure.

Think µ for a Gaussian. . .. . . but in the space of probability measures.

α is the concentration parameter.

Controls the dispersion around the mean H.



The Dirichlet process (cont’d)

A draw G ∼ DP(α,H) is an infinite discrete probability measure:

G(θ) =

∞∑

k=1

πkδ(θ, θk), where

θk ∼ H, and π is sampled from a “stick-breaking prior.”

DIRICHLET PROCESS

� All clusters can contain more than oneelement ⇒ θ only contains atoms:

θ =∞�

j=1

wjδφ∗j

� What is the prior on {wj,φ∗j }?

� Stick-breaking representation:

φ∗j ∼ H

vj ∼ Beta(1,α)wj = vj

j−1�

i=1

(1 − vj)

Masses decreasing on average: GEMdistribution.

� Strictly decreasing masses: Poisson-Dirichletdistribution.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

w1w2

w3w4

[Kin75, Set94]Peter Orbanz & Yee Whye Teh 50 / 71

(from Orbanz & Teh, 2008)

G

Θ

Break a stickImagine a stick of length one. For k = 1 . . .∞, do the following:

Break the stick at a point drawn from Beta(1, α).Let πk be such value and keep the remainder of the stick.

Following standard convention, we write π ∼ GEM(α).(Details in second part of talk)



Stick-breaking, intuitivelySec. 2.5. Dirichlet Processes 101

!1"1

"2"3

"4"5

!2

!3

!4

!5

1#!1

1#!2

1#!3

1#!4

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

! = 1 ! = 5

Figure 2.22. Sequential stick–breaking construction of the infinite set of mixture weights ! ! GEM(")corresponding to a measure G ! DP(", H). Left: The first weight !1 ! Beta(1, "). Each subsequentweight !k (red) is some random proportion #k (blue) of the remaining, unbroken “stick” of probabilitymass. Right: The first K = 20 weights generated by four random stick–breaking constructions (twowith " = 1, two with " = 5). Note that the weights !k do not monotonically decrease.

discrete parameters {"k}!k=1. For a given ! and dataset size N , there are strong boundson the accuracy of particular finite truncations of this stick–breaking process [147],which are often used in approximate computational methods [29, 147, 148, 289].

Several other stick–breaking processes have been proposed which sample the pro-portions #k from di!erent distributions [147, 148, 233]. For example, the two–parameterPoisson–Dirichlet, or Pitman–Yor, process [234] can produce heavier–tailed weight dis-tributions which better match power laws arising in natural language processing [117,287]. As we show next, these stick–breaking processes sometimes lead to predictivedistributions with simple Polya urn representations.

Prediction via Polya Urns

Because Dirichlet processes produce discrete random measures G, there is a strictlypositive probability of multiple observations "i ! G taking identical values. Given Nobservations {"i}N

i=1, suppose that they take K " N distinct values {"k}Kk=1. The

posterior expectation of any set T # " (see eq. (2.172)) can then be written as

E!G(T ) | "1, . . . , "N , !, H

"=

1

! + N

#

!H(T ) +K$

k=1

Nk$!k(T )

%

(2.178)

Nk !

N$

i=1

$("i, "k) k = 1, . . . , K (2.179)

Note that Nk is defined to be the number of previous observations equaling "k, andthat K is a random variable [10, 28, 233]. Analyzing this expression, the predictivedistribution of the next observation "N+1 ! G can be explicitly characterized.


Small α⇒ lots of weight assigned to few θk’s.⇒ G will be very different from base measure H.

Large α⇒ weights equally distributed on θk’s.⇒ G will resemble the base measure H.



(from Navarro et al., 2005)

H G


Dirichlet process mixture models DP mixture models

The DP mixture model (DPMM)

Let’s use G ∼ DP(α,H) to build an infinite mixture model.

Gα

θi

xi

N

H

Sec. 2.5. Dirichlet Processes 105

xi

!zi

"#

k

N

$

%

xi

i

#

N

G

!

H

2! 1!

2x 1x

Figure 2.24. Directed graphical representations of an infinite, Dirichlet process mixture model. Mix-ture weights ! ! GEM(") follow a stick–breaking process, while cluster parameters are assigned in-dependent priors #k ! H($). Left: Indicator variable representation, in which zi ! ! is the clusterthat generates xi ! F (#zi). Right: Alternative distributional form, in which G is an infinite discretedistribution on !. #i ! G are the parameters of the cluster that generates xi ! F (#i). We illustratewith an infinite Gaussian mixture, where cluster variances are known (bottom) and H($) is a Gaussianprior on cluster means (top). Sampled cluster means #1, #2, and corresponding Gaussians, are shownfor two observations x1, x2.

Rather than choose a finite model order K, Dirichlet process mixtures use the stick–breaking prior to control complexity (see Fig. 2.22). As we discuss later, this relaxationleads to algorithms which automatically infer the number of clusters exhibited by aparticular dataset. Importantly, the predictive distribution implied by the Chineserestaurant process (eq. (2.181)) has a clustering bias, and favors simpler models inwhich observations (customers) share parameters (dishes). Additional clusters (tables)appear as more observations are generated (see Fig. 2.25).

Learning via Gibbs Sampling

Given N observations x = {xi}Ni=1 from a Dirichlet process mixture as in Fig. 2.24, we

would like to infer the number of latent clusters underlying those observations, and theirparameters !k. As with finite mixture models, the exact posterior distribution p(", ! | x)contains terms corresponding to each possible partition z of the observations [10, 187].While the Chinese restaurant process tractably specifies the prior probability of indi-vidual partitions (see eq. (2.181)), explicit enumeration of the exponentially large set ofpotential partitions is intractable. There is thus an extensive literature on approximatecomputational methods for Dirichlet process mixtures [29, 76, 121, 147, 148, 151, 222].

In this section, we generalize the Rao–Blackwellized Gibbs sampler of Alg. 2.2from finite to infinite mixture models. As before, we sample the indicator variablesz = {zi}N

i=1 assigning observations to latent clusters, marginalizing mixture weights "

G ∼ DP(α,H)

θi ∼ G

xi ∼ Fθi



DPM (cont’d)

Using explicit clusters indicators z = (z1, z2, . . . , zN).

πα

zi

xi

θk

H

N

∞

π ∼ GEM(α)θk ∼ H k = 1, . . . ,∞zi ∼ πxi ∼ Fθzi

i = 1, . . . ,N



Chinese restaurant process

So far, we only have a generative model.Is there a “nice” conjugancy property to use during inference?It turns out (details in part 2) that, if

π ∼ GEM(α)

zi ∼ π

the distribution p(z|α) =∫

p(z|π)p(π) dπ is easily tractable, and is knownas the Chinese restaurant process (CRP).



Chinese restaurant process (cont’d)

Restaurant with∞ tables with∞ capacity.

1 2 3 4 ...

zi = table at which customer i sits uponentering.Customer 1 sits at table 1Customer 2 sits:

at table 1 w. prob ∝ 1at table 2 w. prob. ∝ α

Customer i sits:at table k w. prob. ∝ nk (# ppl at k)at new table w. prob. ∝ α

p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1





1 2 3 4 ...1




p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1





1 2 3 4 ...1 2




p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1





1 2 3 4 ...1 2

3




p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1





1 2 3 4 ...1 2

3

4

5

6

7




p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1


Dirichlet process mixture models Inference

Gibbs sampling for DPMMs

Via the CRP, we can find the conditional distributions for Gibbs sampling.State: θ1, . . . , θk, z.

p(θk|x, z) ∝ p(θk)∏

i:zi=k

p(xi|θk)

= h(θk)f (xi|θk)

p(zi = k|z−i, θ, x) ∝ p(xi|θk)p(zi = k|z−i)

∝{

n(−i)k f (xi|θk) exising kα f (xi|θk) new k

πα

zi

xi

θk

H

N

∞

K grows as more data are observed, asymptotically as α log n.



Gibbs sampling for DPMMs (example)

Mixture of bivariate GaussiansT=2 T=10 T=40

(from Sudderth, 2008)log p(x | !, ") = −462.25 log p(x | !, ") = −399.82

log p(x | !, ") = −398.32 log p(x | !, ") = −399.08

log p(x | !, ") = −397.67 log p(x | !, ") = −396.71

Figure 2.26. Learning a mixture of Gaussians using the Dirichlet process Gibbs sampler of Alg. 2.3.Columns show the parameters of clusters currently assigned to observations, and corresponding datalog–likelihoods, after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from two initializations.

log p(x | !, ") = −462.25 log p(x | !, ") = −399.82

log p(x | !, ") = −398.32 log p(x | !, ") = −399.08

log p(x | !, ") = −397.67 log p(x | !, ") = −396.71


log p(x | !, ") = −462.25 log p(x | !, ") = −399.82

log p(x | !, ") = −398.32 log p(x | !, ") = −399.08

log p(x | !, ") = −397.67 log p(x | !, ") = −396.71




END OF FIRST PART.


A little more theory. . .

Outline







A little more theory. . . De Finetti’s REDUX

De Finetti’s REDUX

Theorem (De Finetti, 1935. Aka Representation Theorem)A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for allN, there exists a random variable θ and a probability measure p on it such that

p(x1, x2, . . . , xN) =

∫

Θ

p(θ)

N∏

i=1

p(xi|θ) dθ

The theorem wouldn’t be true if θ’s range is limited to Euclidean’s vectorspaces.We need to allow θ to range over measures.⇒ p(θ) is a distribution on measures, like the DP.


A little more theory. . . Dirichlet process REDUX

Dirichlet Process REDUX

DefinitionLet Θ be a measurable space (of parameters), H be a probability distributionon Θ, and α a positive scalar. A Dirichlet process is the distribution of arandom probability measure G over Θ, such that for any finite partition(T1, . . . ,Tk) of Θ, we have

(G(T1), . . . ,G(TK)) ∼ Dir(αH(T1), . . . , αH(TK)).Sec. 2.5. Dirichlet Processes 97

T1

T2T3

T1~

T2~

T3~

T4~

T5~

Figure 2.21. Dirichlet processes induce Dirichlet distributions on every finite, measurable partition.Left: An example base measure H on a bounded, two–dimensional space ! (darker regions have higherprobability). Center: A partition with K = 3 cells. The weight that a random measure G ! DP(!, H)assigns to these cells follows a Dirichlet distribution (see eq. (2.166)). We shade each cell Tk accordingto its mean E[G(Tk)] = H(Tk). Right: Another partition with K = 5 cells. The consistency of G

implies, for example, that (G(T1) + G(T2)) and G( eT1) follow identical beta distributions.

Proposition 2.5.1. Let G ! DP(!, H) be a random measure distributed according toa Dirichlet process. Given N independent observations "i ! G, the posterior measurealso follows a Dirichlet process:

p!G | "1, . . . , "N , !, H

"= DP

#! + N,

1

! + N

$!H +

N%

i=1

#!i

&'(2.169)

Proof. As shown by Ferguson [83], this result follows directly from the conjugate formof finite Dirichlet posterior distributions (see eq. (2.45)). See Sethuraman [254] for analternative proof.

There are interesting similarities between eq. (2.169) and the general form of conjugatepriors for exponential families (see Prop. 2.1.4). The Dirichlet process e!ectively definesa conjugate prior for distributions on arbitrary measurable spaces. In some contexts,the concentration parameter ! can then be seen as expressing confidence in the basemeasure H via the size of a pseudo–dataset (see [113] for further discussion).

Neutral and Tailfree Processes

The conjugacy of Prop. 2.5.1, which leads to tractable computational methods discussedlater, provides one practical motivation for the Dirichlet process. In this section, weshow that Dirichlet processes are also characterized by certain conditional independen-cies. These properties reveal both strengths and weaknesses of the Dirichlet process,and have motivated several other families of stochastic processes.

Let G be a random probability measure on a parameter space ". The distribution

Θ


E[G(Tk)] = H(Tk)



Posterior conjugancy

Via the conjugancy of the Dirichlet distribution, we know that:

p(G(T1), . . . ,G(TK)|θ ∈ Tk) = Dir(αH(T1), . . . , αH(Tk) + 1, . . . , αH(TK))

Formalizing this analysis, we obtain that if

G ∼ DP(α,H)

θi ∼ G i = 1, . . . ,N,

the posterior measure also follows a Dirichlet process:

p(G|θ1, . . . , θN , α,H) = DP(α+ N,

1α+ N

(αH +

N∑

i=1

δθi

))

The DP defines a conjugate prior for distributions on arbitrary measurespaces.



Generating samples: stick breakingSethuraman (1995): equivalent definition of the Dirichlet process, through thestick-breaking construction.

G(θ) ∼ DP(α,H) iff G(θ) =

∞∑

k=1

πkδ(θ, θk),

where θ ∼ H, and

πk = βk

k−1∏

l=1

(1− βl) βl ∼ Beta(1, α)


!1"1

"2"3

"4"5

!2

!3

!4

!5

1#!1

1#!2

1#!3

1#!4

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

! = 1 ! = 5

Figure 2.22. Sequential stick–breaking construction of the infinite set of mixture weights ! ! GEM(")corresponding to a measure G ! DP(", H). Left: The first weight !1 ! Beta(1, "). Each subsequentweight !k (red) is some random proportion #k (blue) of the remaining, unbroken “stick” of probabilitymass. Right: The first K = 20 weights generated by four random stick–breaking constructions (twowith " = 1, two with " = 5). Note that the weights !k do not monotonically decrease.

discrete parameters {"k}!k=1. For a given ! and dataset size N , there are strong boundson the accuracy of particular finite truncations of this stick–breaking process [147],which are often used in approximate computational methods [29, 147, 148, 289].

Several other stick–breaking processes have been proposed which sample the pro-portions #k from di!erent distributions [147, 148, 233]. For example, the two–parameterPoisson–Dirichlet, or Pitman–Yor, process [234] can produce heavier–tailed weight dis-tributions which better match power laws arising in natural language processing [117,287]. As we show next, these stick–breaking processes sometimes lead to predictivedistributions with simple Polya urn representations.

Prediction via Polya Urns

Because Dirichlet processes produce discrete random measures G, there is a strictlypositive probability of multiple observations "i ! G taking identical values. Given Nobservations {"i}N

i=1, suppose that they take K " N distinct values {"k}Kk=1. The

posterior expectation of any set T # " (see eq. (2.172)) can then be written as

E!G(T ) | "1, . . . , "N , !, H

"=

1

! + N

#

!H(T ) +K$

k=1

Nk$!k(T )

%

(2.178)

Nk !

N$

i=1

$("i, "k) k = 1, . . . , K (2.179)

Note that Nk is defined to be the number of previous observations equaling "k, andthat K is a random variable [10, 28, 233]. Analyzing this expression, the predictivedistribution of the next observation "N+1 ! G can be explicitly characterized.




Stick-breaking (derivation) [Teh 2007]

We know that (posterior):

G ∼ DP(α,H)θ|G ∼ G ⇔

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

Consider the partition (Θ,Θ \ θ) of Θ. We have:

(G(Θ),G(Θ \ θ)) ∼ Dir(

(α+ 1)αH + δθα+ 1

(θ), (α+ 1)αH + δθα+ 1

(Θ \ θ))

= Dir(1, α) = Beta(1, α)

G has point mass located at θ:

G = βδθ + (1− β)G′ β ∼ Beta(1, α)

and G′ is the renormalized probability measure with the point massremovedÉWhat is G′?




We have:

G ∼ DP(α,H)θ|G ∼ G ⇒

θ ∼ H


α+1

)

G = βδθ + (1− β)G′

β ∼ Beta(1, α)

Consider a further partition θ,T1, . . . ,TK) of Θ:

(G(θ),G(T1), . . . ,G(TK)) = (β, (1− β)G′(T1), . . . , (1− β)G′(TK))

∼ Dir(1, αH(T1), . . . , αH(TK))

Using the agglomerative/decimative property of Dirichlet, we get

(G′(T1), . . . ,G′(TK)) ∼ Dir(αH(T1), . . . , αH(TK))

G′ ∼ DP(α,H)




Therefore,

G ∼ DP(α,H)

G = β1δθ1 + (1− β1)G1

G = β1δθ1 + (1− β1)(β2δθ2 + (1− β2)G2)

...

G =

∞∑

k=1

πkδθk

where

πk = βk

k−1∏

l=1

(1− βl) βl ∼ Beta(1, α),

which is the stick-breaking construction.



Chinese restaurant (derivation)

Once again, we start from the posterior:

p(G|θ1, . . . , θN , α,H) = DP(α+ N,

1α+ N

(αH +

N∑

i=1

δθi

))

The expected measure of any subset T ⊂ Θ is:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

N∑

i=1

δθi(T))

Since G is discrete, some of the {θi}Ni=1 ∼ G take identical values.

Assume K ≤ N unique values:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

K∑

i=1

Nkδθi(T))




A bit informally. . .Let Tk contain θk and shrink it arbitrarily. To the limit, we have that

p(θN+1 = θ|θ1, . . . , θN , α,H) =1

α+ N

(αh(θ) +

K∑

i=1

Nkδθi(θ))

This is the generalized Polya urn schemeAn urn contains one ball for each preceding observation, with a different colorfor each distinct θk. For each ball drawn from the urn, we replace that ball andadd one more ball of the same color. There is a special “weighted” ball whichis drawn with probability proportional to α normal balls, and has a new,previously unseen color θk. [This description is from Sudderth, 2008]

This allows to sample from a Dirichlet process without explicitlyconstructing the underlying G ∼ DP(α,H).




The Dirichlet process implicitly partitions the data.Let zi indicate the subset (cluster) associated with the i6th observation,i.e. θi = θzi .From the previous slide, we get:

p(zN+1 = z|z1, . . . , zN , α) =1

α+ N

(αδ(z, k) +

K∑

i=1

Nkδ(z, k))

This is the Chinese restaurant process (CRP).

1 2 3 4 ...1 2

3

4

5

6

7

It induces an exchangeable distribution on partitions.The joint distribution is invariant to the order the observations areassigned to clusters.



Take away messageThese representations are all equivalent!

Posterior DP:

G ∼ DP(α,H)θ|G ∼ G ⇔

θ ∼ H


α+1

)

Stick-breaking construction:

G(θ) =

∞∑

k=1

πkδ(θ, θk)θk ∼ Hπ ∼ GEM(α)

Generalized Polya urn

p(θN+1 = θ|θ1, . . . , θN , α,H) =1

α+ N

(αh(θ) +

K∑

i=1

Nkδθi(θ))

Chinese restaurant process

p(zN+1 = z|z1, . . . , zN , α) =1

α+ N

(αδ(z, k) +

K∑

i=1

Nkδ(z, k))


The hierarchical Dirichlet process

Outline








The DP mixture model (DPMM)

Let’s use G ∼ DP(α,H) to build an infinite mixture model.

Gα

θi

xi

N

H


xi

!zi

"#

k

N

$

%

xi

i

#

N

G

!

H

2! 1!

2x 1x

Figure 2.24. Directed graphical representations of an infinite, Dirichlet process mixture model. Mix-ture weights ! ! GEM(") follow a stick–breaking process, while cluster parameters are assigned in-dependent priors #k ! H($). Left: Indicator variable representation, in which zi ! ! is the clusterthat generates xi ! F (#zi). Right: Alternative distributional form, in which G is an infinite discretedistribution on !. #i ! G are the parameters of the cluster that generates xi ! F (#i). We illustratewith an infinite Gaussian mixture, where cluster variances are known (bottom) and H($) is a Gaussianprior on cluster means (top). Sampled cluster means #1, #2, and corresponding Gaussians, are shownfor two observations x1, x2.

Rather than choose a finite model order K, Dirichlet process mixtures use the stick–breaking prior to control complexity (see Fig. 2.22). As we discuss later, this relaxationleads to algorithms which automatically infer the number of clusters exhibited by aparticular dataset. Importantly, the predictive distribution implied by the Chineserestaurant process (eq. (2.181)) has a clustering bias, and favors simpler models inwhich observations (customers) share parameters (dishes). Additional clusters (tables)appear as more observations are generated (see Fig. 2.25).

Learning via Gibbs Sampling

Given N observations x = {xi}Ni=1 from a Dirichlet process mixture as in Fig. 2.24, we

would like to infer the number of latent clusters underlying those observations, and theirparameters !k. As with finite mixture models, the exact posterior distribution p(", ! | x)contains terms corresponding to each possible partition z of the observations [10, 187].While the Chinese restaurant process tractably specifies the prior probability of indi-vidual partitions (see eq. (2.181)), explicit enumeration of the exponentially large set ofpotential partitions is intractable. There is thus an extensive literature on approximatecomputational methods for Dirichlet process mixtures [29, 76, 121, 147, 148, 151, 222].

In this section, we generalize the Rao–Blackwellized Gibbs sampler of Alg. 2.2from finite to infinite mixture models. As before, we sample the indicator variablesz = {zi}N

i=1 assigning observations to latent clusters, marginalizing mixture weights "

G ∼ DP(α,H)

θi ∼ G

xi ∼ Fθi



Related subgroups of data

Dataset with J related groups x = (x1, . . . , xJ).xj = (xj1, . . . , xjNj) contains Nj observations.We want these group to share clusters (transfer knowledge.)

i

xij

2 m1

x2 xmjjx1j

i

xij

2 m1

x2 xmjjx1j

(from jordan, 2005)



Hierarchical Dirichlet process (HDP)

Global probability measure G0 ∼ DP(γ,H)

Defines a set of shared clusters.

G0(θ) =

∞∑

k=1

βkδ(θ, θk)θk ∼ Hβ ∼ GEM(γ)

Group specific distributions Gj ∼ DP(α,G0)

Gj(θ) =

∞∑

t=1

πkδ(θ, θk)θt ∼ G0π ∼ GEM(γ)

Note G0 as base measure!Each local cluster has parameter θk copied from some global cluster

For each group, data points are generated according to:

θji ∼ Gj

xji ∼ F(θji)



The HDP mixture model (DPMM)

Gjα

θji

xji

N

H

G0γ

J

116 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS

xji

!

"

JNj

zji

#

$j

%k&

'

xji

%

j!

H

"

ji

JNj

G

0G

12% 11% 22%21%

0G

1G 2G

H

12x 11x 21x 22x

Figure 2.28. Directed graphical representations of a hierarchical Dirichlet process (HDP) mixturemodel. Global cluster weights ! ! GEM(!) follow a stick–breaking process, while cluster parametersare assigned independent priors "k ! H(#). Left: Explicit stick–breaking representation, in whicheach group reuses the global clusters with weights "j ! DP($, !). zji ! "j indicates the clusterthat generates xji ! F ("zji). Right: Alternative distributional form, in which G0 ! DP(!, H) is aninfinite discrete distribution on !, and Gj ! DP($, G0) a reweighted, group–specific distribution."ji ! Gj are then the parameters of the cluster that generates xji ! F ("ji). We illustrate with ashared, infinite Gaussian mixture, where cluster variances are known (bottom) and H(#) is a Gaussianprior on cluster means (top). Sampled cluster means "j1, "j2, and corresponding Gaussians, are shownfor two observations xj1, xj2 in each of two groups G1, G2.

measures on the positive integers [289]. Thus, ! determines the average weight of localclusters (E[!jk] = "k), while # controls the variability of cluster weights across groups.Note that eq. (2.202) suggests the alternative graphical model of Fig. 2.28, in whichzji ! "j directly indicates the global cluster associated with xji. In contrast, Fig. 2.29indirectly determines global cluster assignments via local clusters, taking zji = kjtji .

Comparing these representations to Fig. 2.11, we see that HDPs share clusters asin the LDA model, but remove the need for model order selection. In terms of theDDP framework, the global measure G0 provides a particular, convenient mechanismfor inducing dependencies among the mixture weights in di!erent groups. Note thatthe discreteness of G0 plays a critical role in this construction. If, for example, we hadinstead taken Gj ! DP(#, H) with H continuous, the stick–breaking construction ofThm. 2.5.3 shows that groups would learn independent sets of disjoint clusters.

Extending the analogy of Fig. 2.23, we may alternatively formulate the HDP rep-resentation of Fig. 2.29 in terms of a Chinese restaurant franchise [289]. In this inter-pretation, each group defines a separate restaurant in which customers (observations)xji sit at tables (clusters) tji. Each table shares a single dish (parameter) !$t, which is

G0 ∼ DP(γ,H)

Gj ∼ DP(α,G0)

θji ∼ Gj

xji ∼ Fθji



The HDP mixture model (DPMM)

Gj(θ) =

∞∑

t=1

πkδ(θ, θk)θt ∼ G0π ∼ GEM(γ)

G0 is discrete.Each group might create several copies of the same global cluster.Aggregating the probabilities:

Gj(θ) =

∞∑

t=1

πkδ(θ, θk) πjk =∑

t:kjt=k

πjt

It can be shown that π ∼ DP(α, β).β = (β1, β2, . . .): average weight of local clusters.π = (π1, π2, . . .) group-specific weights.α controls the variability of clusters weight across groups.



THANK YOU. QUESTIONS?


bayesian nonparametrics: models based on the dirichlet...

Documents