bayesian nonparametrics: models based on the dirichlet...

89
Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 1 / 57

Upload: others

Post on 19-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Bayesian Nonparametrics:Models Based on the Dirichlet Process

Alessandro Panella

Department of Computer ScienceUniversity of Illinois at Chicago

Machine Learning Seminar SeriesFebruary 18, 2013

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 1 / 57

Page 2: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Sources and Inspirations

Tutorials (slides)P. Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics. NIPS 2011.

M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That. NIPS 2005.

Articles etc.E.B. Sudderth, Chapter in PhD thesis, 2006.

E. Fox, Chapter in PhD thesis, 2008.

Y.W. Teh, Dirichlet Processes. Encyclopedia of Machine Learning, 2010. Springer.

...

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 2 / 57

Page 3: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Outline

1 Introduction and backgroundBayesian learningNonparametric models

2 Finite mixture modelsBayesian modelsClustering with FMMsInference

3 Dirichlet process mixture modelsGoing nonparametric!The Dirichlet processDP mixture modelsInference

4 A little more theory. . .De Finetti’s REDUXDirichlet process REDUX

5 The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 3 / 57

Page 4: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background

Outline

1 Introduction and backgroundBayesian learningNonparametric models

2 Finite mixture modelsBayesian modelsClustering with FMMsInference

3 Dirichlet process mixture modelsGoing nonparametric!The Dirichlet processDP mixture modelsInference

4 A little more theory. . .De Finetti’s REDUXDirichlet process REDUX

5 The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 4 / 57

Page 5: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

The meaning of it all

BAYESIAN NONPARAMETRICS

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

Page 6: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

The meaning of it all

BAYESIAN NONPARAMETRICS

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

Page 7: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

The meaning of it all

BAYESIAN NONPARAMETRICS

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57

Page 8: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

Bayesian statistics

Estimate a parameter θ ∈ Θ after observing data x.

FrequentistMaximum Likelihood (ML): θMLE = argmaxθ p(x|θ) = argmaxθ L(θ : x)

Bayesian

Bayes Rule: p(θ|x) = p(x|θ)p(θ)p(x)

Bayesian prediction (using the whole posterior, not just one estimator)

p(xnew|x) =

Θ

p(xnew|θ)p(θ|x) dθ

Maximum A Posteriori (MAP)

θMAP = argmaxθ

p(x|θ)p(θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

Page 9: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

Bayesian statistics

Estimate a parameter θ ∈ Θ after observing data x.

FrequentistMaximum Likelihood (ML): θMLE = argmaxθ p(x|θ) = argmaxθ L(θ : x)

Bayesian

Bayes Rule: p(θ|x) = p(x|θ)p(θ)p(x)

Bayesian prediction (using the whole posterior, not just one estimator)

p(xnew|x) =

Θ

p(xnew|θ)p(θ|x) dθ

Maximum A Posteriori (MAP)

θMAP = argmaxθ

p(x|θ)p(θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

Page 10: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

Bayesian statistics

Estimate a parameter θ ∈ Θ after observing data x.

FrequentistMaximum Likelihood (ML): θMLE = argmaxθ p(x|θ) = argmaxθ L(θ : x)

Bayesian

Bayes Rule: p(θ|x) = p(x|θ)p(θ)p(x)

Bayesian prediction (using the whole posterior, not just one estimator)

p(xnew|x) =

Θ

p(xnew|θ)p(θ|x) dθ

Maximum A Posteriori (MAP)

θMAP = argmaxθ

p(x|θ)p(θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57

Page 11: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

De Finetti’s theorem

A premise:

DefinitionAn infinite sequence random variables (x1, x2, . . .) is said to be (infinitely)exchangeable if, for every N and every possible permutation π on (1, . . . ,N),

p(x1, x2, . . . , xN) = p(xπ(1), xπ(2) . . . , xπ(N))

Note: exchangeability not equal i.i.d!

Example (Polya Urn)An urn contains some red balls and some black balls; an infinite sequence ofcolors is drawn recursively as follows: draw a ball, mark down its color, thenput the ball back in the urn along with an additional ball of the same color.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57

Page 12: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

De Finetti’s theorem

A premise:

DefinitionAn infinite sequence random variables (x1, x2, . . .) is said to be (infinitely)exchangeable if, for every N and every possible permutation π on (1, . . . ,N),

p(x1, x2, . . . , xN) = p(xπ(1), xπ(2) . . . , xπ(N))

Note: exchangeability not equal i.i.d!

Example (Polya Urn)An urn contains some red balls and some black balls; an infinite sequence ofcolors is drawn recursively as follows: draw a ball, mark down its color, thenput the ball back in the urn along with an additional ball of the same color.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57

Page 13: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

De Finetti’s theorem (cont’d)

Theorem (De Finetti, 1935. Aka Representation Theorem)A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for allN, there exists a random variable θ and a probability measure p on it such that

p(x1, x2, . . . , xN) =

Θ

p(θ)

N∏

i=1

p(xi|θ) dθ

i.e., there exists a parameter space and a measure on it that makes thevariables iid!

The representation theorem motivates (and encourages!) the use of Bayesianstatistics.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57

Page 14: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

De Finetti’s theorem (cont’d)

Theorem (De Finetti, 1935. Aka Representation Theorem)A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for allN, there exists a random variable θ and a probability measure p on it such that

p(x1, x2, . . . , xN) =

Θ

p(θ)

N∏

i=1

p(xi|θ) dθ

i.e., there exists a parameter space and a measure on it that makes thevariables iid!

The representation theorem motivates (and encourages!) the use of Bayesianstatistics.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57

Page 15: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

Bayesian learning

Hypothesis space HGiven data D, compute

p(h|D) =p(D|h)p(h)

p(D)

Then, we probably want to predict some future data D′, by either:Average over H, i.e. p(D′|D) =

∫H p(D′|h)p(h|D)p(h) dh

Choose the MAP h (or compute it directly), i.e. p(D′|D) = p(D′|hMAP)Sample from the posterior...

H can be anything! Bayesian learning as a general learning frameworkWe will consider the case in which h is a probabilistic model itself, i.e. aparameter vector θ.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 9 / 57

Page 16: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

A simple example

Infer the bias θ ∈ [0, 1] of a coin after observing N tosses.

H = 1,T = 0, p(H) = θ

h = θ, hence H = [0, 1]

Sequence of Bernoulli trials:

p(x1, . . . , xn|θ) = θnH (1− θ)N−nH

where nH = # heads.Unknown θ:

p(x1, . . . , xN) =

∫ 1

0θnH (1− θ)nH−kp(θ) dθ

Need to find a “good” prior p(θ). . .Beta distribution!

θ

x1 x2 xN

θ

xi

N

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 10 / 57

Page 17: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

A simple example (cont’d)

Beta distribution: θ ∼ Beta(a, b)

p(θ|a, b) = 1B(a,b)θ

a−1(1− θ)b−1

Bayesian learning: p(h|D) ∝ p(D|h)p(h); for us:

p(θ|x1, . . . , xN) ∝ p(x1, . . . , xn|θ)p(θ)

= θnH (1− θ)nT1

B(a, b)θa−1(1− θ)b−1

∝ θnH+a−1(1− θ)nT +b−1

i.e. θ|x1, . . . , xN ∼ Beta(a + NH, b + NT)

We’re lucky! The Beta distribution is a conjugateprior to the binomial distribution.

Beta(0.1, 0.1)

Beta(1, 1)

Beta(2, 3)

Beta(10, 10)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57

Page 18: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

A simple example (cont’d)

Beta distribution: θ ∼ Beta(a, b)

p(θ|a, b) = 1B(a,b)θ

a−1(1− θ)b−1

Bayesian learning: p(h|D) ∝ p(D|h)p(h); for us:

p(θ|x1, . . . , xN) ∝ p(x1, . . . , xn|θ)p(θ)

= θnH (1− θ)nT1

B(a, b)θa−1(1− θ)b−1

∝ θnH+a−1(1− θ)nT +b−1

i.e. θ|x1, . . . , xN ∼ Beta(a + NH, b + NT)

We’re lucky! The Beta distribution is a conjugateprior to the binomial distribution.

Beta(0.1, 0.1)

Beta(1, 1)

Beta(2, 3)

Beta(10, 10)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57

Page 19: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Bayesian learning

A simple example (cont’d)

Three sequences of four tosses:

H T H H

H H H T

H H H H

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 12 / 57

Page 20: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Nonparametric models

Nonparametric models“Nonparametric” doesn’t mean “no parameters”! Rather,

The number of parameters grows as more data are observed.∞-dimensional parameter space.

Finite data ⇒ Bounded number of parameters

DefinitionA nonparametric model is a Bayesian model on an∞-dimensional parameterspace.

Example

TERMINOLOGY

Parametric model� Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

� Number of parameters grows with sample size

� ∞-dimensional parameter space

Example: Density estimation20 CHAPTER 2. BAYESIAN DECISION THEORY

x2

x1

µ

Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered onthe mean µ. The red ellipses show lines of equal probability density of the Gaussian.

being merely σ2 times the identity matrix I. Geometrically, this corresponds to thesituation in which the samples fall in equal-size hyperspherical clusters, the clusterfor the ith class being centered about the mean vector µi. The computation of thedeterminant and the inverse of Σi is particularly easy: |Σi| = σ2d and Σ−1

i = (1/σ2)I.Since both |Σi| and the (d/2) ln 2π term in Eq. 47 are independent of i, they areunimportant additive constants that can be ignored. Thus we obtain the simplediscriminant functions

gi(x) = −�x− µi�22σ2

+ ln P (ωi), (48)

where � · � is the Euclidean norm, that is,Euclideannorm

�x− µi�2 = (x− µi)t(x− µi). (49)

If the prior probabilities are not equal, then Eq. 48 shows that the squared distance�x−µ�2 must be normalized by the variance σ2 and offset by adding ln P (ωi); thus,if x is equally near two different mean vectors, the optimal decision will favor the apriori more likely category.

Regardless of whether the prior probabilities are equal or not, it is not actuallynecessary to compute distances. Expansion of the quadratic form (x − µi)

t(x − µi)yields

gi(x) = − 1

2σ2[xtx− 2µt

ix + µtiµi] + ln P (ωi), (50)

which appears to be a quadratic function of x. However, the quadratic term xtx isthe same for all i, making it an ignorable additive constant. Thus, we obtain theequivalent linear discriminant functionslinear

discriminant

gi(x) = wtix + wi0, (51)

where

Parametric

8 CHAPTER 4. NONPARAMETRIC TECHNIQUES

-2-1

01

2 -2

-1

0

1

2

0

0.05

0.1

0.15

-2-1

01

2

h = 1

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

0.2

0.4

0.6

-2-1

0

1

2

h = .5

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

1

2

3

4

-2-1

0

1

2

h = .2

!(x)

Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windowsϕ(x/h) for three different values of h. Note that because the δk(·) are normalized,different vertical scales must be used to show their structure.

p(x)p(x) p(x)

Figure 4.4: Three Parzen-window density estimates based on the same set of fivesamples, using the window functions in Fig. 4.3. As before, the vertical axes havebeen scaled to show the structure of each function.

and

limn→∞

σ2n(x) = 0. (18)

To prove convergence we must place conditions on the unknown density p(x), onthe window function ϕ(u), and on the window width hn. In general, continuity ofp(·) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarilyinvoked. With care, it can be shown that the following additional conditions assureconvergence (Problem 1):

supu

ϕ(u) <∞ (19)

lim�u�→∞

ϕ(u)

d�

i=1

ui = 0 (20)

NonparametricPeter Orbanz & Yee Whye Teh 4 / 71(from Orbanz and Teh, NIPS 2011)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 13 / 57

Page 21: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Introduction and background Nonparametric models

Nonparametric models“Nonparametric” doesn’t mean “no parameters”! Rather,

The number of parameters grows as more data are observed.∞-dimensional parameter space.

Finite data ⇒ Bounded number of parameters

DefinitionA nonparametric model is a Bayesian model on an∞-dimensional parameterspace.

Example

TERMINOLOGY

Parametric model� Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

� Number of parameters grows with sample size

� ∞-dimensional parameter space

Example: Density estimation20 CHAPTER 2. BAYESIAN DECISION THEORY

x2

x1

µ

Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered onthe mean µ. The red ellipses show lines of equal probability density of the Gaussian.

being merely σ2 times the identity matrix I. Geometrically, this corresponds to thesituation in which the samples fall in equal-size hyperspherical clusters, the clusterfor the ith class being centered about the mean vector µi. The computation of thedeterminant and the inverse of Σi is particularly easy: |Σi| = σ2d and Σ−1

i = (1/σ2)I.Since both |Σi| and the (d/2) ln 2π term in Eq. 47 are independent of i, they areunimportant additive constants that can be ignored. Thus we obtain the simplediscriminant functions

gi(x) = −�x− µi�22σ2

+ ln P (ωi), (48)

where � · � is the Euclidean norm, that is,Euclideannorm

�x− µi�2 = (x− µi)t(x− µi). (49)

If the prior probabilities are not equal, then Eq. 48 shows that the squared distance�x−µ�2 must be normalized by the variance σ2 and offset by adding ln P (ωi); thus,if x is equally near two different mean vectors, the optimal decision will favor the apriori more likely category.

Regardless of whether the prior probabilities are equal or not, it is not actuallynecessary to compute distances. Expansion of the quadratic form (x − µi)

t(x − µi)yields

gi(x) = − 1

2σ2[xtx− 2µt

ix + µtiµi] + ln P (ωi), (50)

which appears to be a quadratic function of x. However, the quadratic term xtx isthe same for all i, making it an ignorable additive constant. Thus, we obtain theequivalent linear discriminant functionslinear

discriminant

gi(x) = wtix + wi0, (51)

where

Parametric

8 CHAPTER 4. NONPARAMETRIC TECHNIQUES

-2-1

01

2 -2

-1

0

1

2

0

0.05

0.1

0.15

-2-1

01

2

h = 1

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

0.2

0.4

0.6

-2-1

0

1

2

h = .5

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

1

2

3

4

-2-1

0

1

2

h = .2

!(x)

Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windowsϕ(x/h) for three different values of h. Note that because the δk(·) are normalized,different vertical scales must be used to show their structure.

p(x)p(x) p(x)

Figure 4.4: Three Parzen-window density estimates based on the same set of fivesamples, using the window functions in Fig. 4.3. As before, the vertical axes havebeen scaled to show the structure of each function.

and

limn→∞

σ2n(x) = 0. (18)

To prove convergence we must place conditions on the unknown density p(x), onthe window function ϕ(u), and on the window width hn. In general, continuity ofp(·) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarilyinvoked. With care, it can be shown that the following additional conditions assureconvergence (Problem 1):

supu

ϕ(u) <∞ (19)

lim�u�→∞

ϕ(u)

d�

i=1

ui = 0 (20)

NonparametricPeter Orbanz & Yee Whye Teh 4 / 71(from Orbanz and Teh, NIPS 2011)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 13 / 57

Page 22: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models

Outline

1 Introduction and backgroundBayesian learningNonparametric models

2 Finite mixture modelsBayesian modelsClustering with FMMsInference

3 Dirichlet process mixture modelsGoing nonparametric!The Dirichlet processDP mixture modelsInference

4 A little more theory. . .De Finetti’s REDUXDirichlet process REDUX

5 The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 14 / 57

Page 23: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Bayesian models

Models in Bayesian data analysis

ModelGenerative process.Expresses how we think the data is generated.Contains hidden variables (the subject oflearning.)Specifies relations between variables.E.g. graphical models.

Posterior inferenceKnowing p(D|M, θ). . .← “how data is generated”. . . compute p(θ|M,D)

Akin to “reversing” the generative process.

θ

D

p(θ)

M

p(D|M, θ)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 15 / 57

Page 24: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Bayesian models

Models in Bayesian data analysis

ModelGenerative process.Expresses how we think the data is generated.Contains hidden variables (the subject oflearning.)Specifies relations between variables.E.g. graphical models.

Posterior inferenceKnowing p(D|M, θ). . .← “how data is generated”. . . compute p(θ|M,D)

Akin to “reversing” the generative process.

θ

D

p(θ)

M

p(D|M, θ) p(θ|D, M)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 15 / 57

Page 25: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Clustering with FMMs

Finite mixture models (FMMs)

Bayesian approach to clustering. Each data point is assumed to belong toone of K clusters.

General formA sequence of data points x = (x1, . . . , xN) each with probability

p(xi|π, θ1, . . . , θK) =

K∑

k=1

πk f (xi|θk) π ∈ ΠK−1

Generative processFor each i:

Draw a cluster assignment zi ∼ πDraw a data point xi ∼ F(θzi).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 16 / 57

Page 26: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Clustering with FMMs

FMMs (example)

Mixture of univariate Gaussiansθk = (µk, σk)

xi ∼ N (µk, σk)p(xi|π, µ, σ) =

K∑

k=1

πk fN (xi;µk, σk)

−2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

N (1, 1) N (4, .5) N (6, .7)

π = (0.15, 0.25, 0.6)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 17 / 57

Page 27: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Clustering with FMMs

FMMs (cont’d)

Clustering with FFMs

Need priors for π, θUsually, π is given a (symmetric) Dirichlet distribution prior.θk’s are given a suitable prior H, depending on the data.

πα

zi

xi

θk

H

N

K

π ∼ Dir(α/K, . . . , α/K)θk|H ∼ H k = 1 . . .Kzi|π ∼ π

xi|θ, zi ∼ F(θzi) i = 1 . . .N

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 18 / 57

Page 28: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Clustering with FMMs

Dirichlet distribution

Multivariate generalization of Beta.Dirichlet Distributions

(from Teh, MLSC 2008)

Dir(1, 1, 1) Dir(2, 2, 2) Dir(5, 5, 5)

Dir(5, 5, 2) Dir(5, 2, 2) Dir(0.7, 0.7, 0.7)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 19 / 57

Page 29: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Clustering with FMMs

Dirichlet distribution (cont’d)

π ∼ Dir(α/K, . . . , α/K) iff p(π1, . . . , πK) =Γ(α)∏

k Γ(α/K)

K∏

k=1

πα/K−1k

Conjugate prior to categorical/multinomial, i.e.

π ∼ Dir(αK , . . . ,αK )

zi ∼ π i = 1 . . .N

impliesπ|z1, . . . , zN ∼ Dir

(αK

+ n1,α

K+ n2, . . . ,

α

K+ nK

)

Moreover,

p(z1, . . . , zN |α) =Γ(α)

Γ(α+ N)

K∏

k=1

Γ(nk + α/K)

Γ(α/K)

and

p(zi = k|z(−i), α) =n(−i)

k + α/Kα+ N − 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 20 / 57

Page 30: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α,H) =p(x|z,H)p(z|α)∑z p(x|z,H)p(z|α)

, where

p(z|α) =Γ(α)

Γ(α+ N)

K∏k=1

Γ(nk + α/K)

Γ(α/K)

p(x|z,H) =

∫Θ

[ N∏i=1

p(xi|θzi )

K∏k=1

H(θk)]

πα

zi

xi

θk

H

N

K

Parameter estimation: infer π, θ

p(π, θ|x, α,H) =∑

z

[p(π|z, α)

K∏

k=1

p(θk|x,H)]p(z|x, α,H)

⇒ No analytic procedure.Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

Page 31: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α,H) =p(x|z,H)p(z|α)∑z p(x|z,H)p(z|α)

, where

p(z|α) =Γ(α)

Γ(α+ N)

K∏k=1

Γ(nk + α/K)

Γ(α/K)

p(x|z,H) =

∫Θ

[ N∏i=1

p(xi|θzi )

K∏k=1

H(θk)]

πα

zi

xi

θk

H

N

K

Parameter estimation: infer π, θ

p(π, θ|x, α,H) =∑

z

[p(π|z, α)

K∏

k=1

p(θk|x,H)]p(z|x, α,H)

⇒ No analytic procedure.Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

Page 32: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α,H) =p(x|z,H)p(z|α)∑z p(x|z,H)p(z|α)

, where

p(z|α) =Γ(α)

Γ(α+ N)

K∏k=1

Γ(nk + α/K)

Γ(α/K)

p(x|z,H) =

∫Θ

[ N∏i=1

p(xi|θzi )

K∏k=1

H(θk)]

πα

zi

xi

θk

H

N

K

Parameter estimation: infer π, θ

p(π, θ|x, α,H) =∑

z

[p(π|z, α)

K∏

k=1

p(θk|x,H)]p(z|x, α,H)

⇒ No analytic procedure.Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

Page 33: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α,H) =p(x|z,H)p(z|α)∑z p(x|z,H)p(z|α)

, where

p(z|α) =Γ(α)

Γ(α+ N)

K∏k=1

Γ(nk + α/K)

Γ(α/K)

p(x|z,H) =

∫Θ

[ N∏i=1

p(xi|θzi )

K∏k=1

H(θk)]

πα

zi

xi

θk

H

N

K

Parameter estimation: infer π, θ

p(π, θ|x, α,H) =∑

z

[p(π|z, α)

K∏

k=1

p(θk|x,H)]p(z|x, α,H)

⇒ No analytic procedure.Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

Page 34: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Inference in FMMs

Clustering: infer z (marginalize over π, θ)

p(z|x, α,H) =p(x|z,H)p(z|α)∑z p(x|z,H)p(z|α)

, where

p(z|α) =Γ(α)

Γ(α+ N)

K∏k=1

Γ(nk + α/K)

Γ(α/K)

p(x|z,H) =

∫Θ

[ N∏i=1

p(xi|θzi )

K∏k=1

H(θk)]

πα

zi

xi

θk

H

N

K

Parameter estimation: infer π, θ

p(π, θ|x, α,H) =∑

z

[p(π|z, α)

K∏

k=1

p(θk|x,H)]p(z|x, α,H)

⇒ No analytic procedure.Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 21 / 57

Page 35: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Approximate inference for FMMsNo exact inference because of the unknown clusters identifiers z

Expectation-Maximization (EM)Widely used, but we will focus on MCMC because of the connection withDirichlet Process.

Gibbs samplingMarkov chain Monte Carlo (MCMC) integration method

Set of random variables v = {v1, v2, . . . , vM}.We want to compute p(v).Randomly initialize their values.At each iteration, sample a variable vi and hold the rest constant:

v(t)i ∼ p(vi|v(t−1)

j , j 6= i) ← usually tractable

v(t)j = v(t−1)

j

This creates a Markov chain with p(v) as equilibrium distribution.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 22 / 57

Page 36: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Gibbs sampling for FMMs

State variables: z1, . . . , zN , θ1, . . . , θK , π.

Conditional distributions:

p(π|z, θ) = Dir(α

K+ n1, . . . ,

α

K+ nk

)

p(θk|x, z) ∝ p(θk)∏

i:zi=k

p(xi|θk)

= H(θk)∏

i:zi=k

Fθk (xi)

p(zi = k|π, θ, x) ∝ p(zi = k|πk)p(xi|zi = k, θk)

= πk Fθk (xi)

We can avoid sampling π:

p(zi = k|z−i, θ, x) ∝ p(xi|θk)p(zi = k|z−i)

∝ Fθk (xi)(n(−i)

k + α/K)

πα

zi

xi

θk

H

N

K

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 23 / 57

Page 37: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

Gibbs sampling for FMMs (example)

Mixture of 4 bivariate GaussiansNormal-inverse Wishart prior on θk = (µk,Σk), conjugate to normaldistribution.

Σk ∼ W(ν,∆) µk ∼ N (ϑ,Σk/κ)

log p(x | !, ") = −539.17 log p(x | !, ") = −497.77

log p(x | !, ") = −404.18 log p(x | !, ") = −454.15

log p(x | !, ") = −397.40 log p(x | !, ") = −442.89

Figure 2.18. Learning a mixture of K = 4 Gaussians using the Gibbs sampler of Alg. 2.1. Columnsshow the current parameters after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from tworandom initializations. Each plot is labeled by the current data log–likelihood.

log p(x | !, ") = −539.17 log p(x | !, ") = −497.77

log p(x | !, ") = −404.18 log p(x | !, ") = −454.15

log p(x | !, ") = −397.40 log p(x | !, ") = −442.89

Figure 2.18. Learning a mixture of K = 4 Gaussians using the Gibbs sampler of Alg. 2.1. Columnsshow the current parameters after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from tworandom initializations. Each plot is labeled by the current data log–likelihood.

log p(x | !, ") = −539.17 log p(x | !, ") = −497.77

log p(x | !, ") = −404.18 log p(x | !, ") = −454.15

log p(x | !, ") = −397.40 log p(x | !, ") = −442.89

Figure 2.18. Learning a mixture of K = 4 Gaussians using the Gibbs sampler of Alg. 2.1. Columnsshow the current parameters after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from tworandom initializations. Each plot is labeled by the current data log–likelihood.

T=2 T=10 T=40

(from Sudderth, 2008)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 24 / 57

Page 38: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Finite mixture models Inference

FMMs: alternative representation

πα

zi

xi

θk

H

N

K

π ∼ Dir(α)

θk ∼ H

zi ∼ πxi ∼ F(θzi)

θi

xi

N

H

58 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS

xi

!zi

"#

k

N

K

$

xi

i

#

N

G

!

H

2! 1!

2x 1x

Figure 2.9. Directed graphical representations of a K component mixture model. Mixture weights! ! Dir("), while cluster parameters are assigned independent priors #k ! H($). Left: Indicatorvariable representation, in which zi ! ! is the cluster that generates xi ! F (#zi). Right: Alternativedistributional form, in which G is a discrete distribution on ! taking K distinct values. #i ! G are theparameters of the cluster that generates xi ! F (#i). We illustrate with a mixture of K = 4 Gaussians,where cluster variances are known (bottom) and H($) is a Gaussian prior on cluster means (top).Sampled cluster means #1, #2, and corresponding Gaussians, are shown for two observations x1, x2.

The unobserved indicator variable zi ! {1, . . . , K} specifies the unique cluster associatedwith xi. Mixture models are widely used for unsupervised learning, where clusters areused to discover subsets of the data with common attributes.

In most applications of mixture models, f(x | !k) is chosen to be an appropriateexponential family. For example, Euclidean observations are often modeled via Gaus-sian mixtures, so that the parameters !k = (µk, !k) specify each cluster’s mean µk

and covariance !k. When learning mixtures from data, it is often useful to place anindependent conjugate prior H, with hyperparameters ", on each cluster’s parameters:

!k " H(") k = 1, . . . , K (2.81)

Similarly, in the absence of prior knowledge distinguishing the clusters, the mixtureweights # can be assigned a symmetric Dirichlet prior with precision $:

# " Dir! $

K, . . . ,

$

K

"(2.82)

Fig. 2.9 shows a directed graphical model summarizing this generative process. As inFig. 2.8, plates are used to compactly denote the K cluster parameters {!k}K

k=1 andN data points {xi}N

i=1. In Fig. 2.10, we illustrate several two–dimensional Gaussianmixtures sampled from a conjugate, normal–inverse–Wishart prior.

Mixture models can equivalently be expressed in terms of a discrete distribution G

G(θ) =

K∑

k=1

πkδ(θ, θk)θk ∼ Hπ ∼ Dir(α)

θi ∼ G

xi ∼ F(θi)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 25 / 57

Page 39: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models

Outline

1 Introduction and backgroundBayesian learningNonparametric models

2 Finite mixture modelsBayesian modelsClustering with FMMsInference

3 Dirichlet process mixture modelsGoing nonparametric!The Dirichlet processDP mixture modelsInference

4 A little more theory. . .De Finetti’s REDUXDirichlet process REDUX

5 The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 26 / 57

Page 40: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models Going nonparametric!

Going nonparametric!

The problem with finite FMMs

What if K is unknown?How many parameters?

IdeaLet’s use∞ parameters!We want something of the kind:

p(xi|π, θ1, θ2, . . .) =

∞∑

k=1

πk p(xi|θk)

How to define such a measure?We’d like the nice conjugancy properties of Dirichlet to carry on. . .Is there such a thing, the∞ limit of a Dirichlet?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 27 / 57

Page 41: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models Going nonparametric!

Going nonparametric!

The problem with finite FMMs

What if K is unknown?How many parameters?

IdeaLet’s use∞ parameters!We want something of the kind:

p(xi|π, θ1, θ2, . . .) =

∞∑

k=1

πk p(xi|θk)

How to define such a measure?We’d like the nice conjugancy properties of Dirichlet to carry on. . .Is there such a thing, the∞ limit of a Dirichlet?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 27 / 57

Page 42: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models Going nonparametric!

Going nonparametric!

The problem with finite FMMs

What if K is unknown?How many parameters?

IdeaLet’s use∞ parameters!We want something of the kind:

p(xi|π, θ1, θ2, . . .) =

∞∑

k=1

πk p(xi|θk)

How to define such a measure?We’d like the nice conjugancy properties of Dirichlet to carry on. . .Is there such a thing, the∞ limit of a Dirichlet?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 27 / 57

Page 43: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models The Dirichlet process

The (practical) Dirichlet process

The Dirichlet process is a distribution over probability measuresover Θ.

DP(α,H)

H(θ) is the base (mean) measure.

Think µ for a Gaussian. . .. . . but in the space of probability measures.

α is the concentration parameter.

Controls the dispersion around the mean H.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 28 / 57

Page 44: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models The Dirichlet process

The Dirichlet process (cont’d)

A draw G ∼ DP(α,H) is an infinite discrete probability measure:

G(θ) =

∞∑

k=1

πkδ(θ, θk), where

θk ∼ H, and π is sampled from a “stick-breaking prior.”

DIRICHLET PROCESS

� All clusters can contain more than oneelement ⇒ θ only contains atoms:

θ =∞�

j=1

wjδφ∗j

� What is the prior on {wj,φ∗j }?

� Stick-breaking representation:

φ∗j ∼ H

vj ∼ Beta(1,α)wj = vj

j−1�

i=1

(1 − vj)

Masses decreasing on average: GEMdistribution.

� Strictly decreasing masses: Poisson-Dirichletdistribution.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

w1w2

w3w4

[Kin75, Set94]Peter Orbanz & Yee Whye Teh 50 / 71

(from Orbanz & Teh, 2008)

G

Θ

Break a stickImagine a stick of length one. For k = 1 . . .∞, do the following:

Break the stick at a point drawn from Beta(1, α).Let πk be such value and keep the remainder of the stick.

Following standard convention, we write π ∼ GEM(α).(Details in second part of talk)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 29 / 57

Page 45: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models The Dirichlet process

The Dirichlet process (cont’d)

A draw G ∼ DP(α,H) is an infinite discrete probability measure:

G(θ) =

∞∑

k=1

πkδ(θ, θk), where

θk ∼ H, and π is sampled from a “stick-breaking prior.”

DIRICHLET PROCESS

� All clusters can contain more than oneelement ⇒ θ only contains atoms:

θ =∞�

j=1

wjδφ∗j

� What is the prior on {wj,φ∗j }?

� Stick-breaking representation:

φ∗j ∼ H

vj ∼ Beta(1,α)wj = vj

j−1�

i=1

(1 − vj)

Masses decreasing on average: GEMdistribution.

� Strictly decreasing masses: Poisson-Dirichletdistribution.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

w1w2

w3w4

[Kin75, Set94]Peter Orbanz & Yee Whye Teh 50 / 71

(from Orbanz & Teh, 2008)

G

Θ

Break a stickImagine a stick of length one. For k = 1 . . .∞, do the following:

Break the stick at a point drawn from Beta(1, α).Let πk be such value and keep the remainder of the stick.

Following standard convention, we write π ∼ GEM(α).(Details in second part of talk)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 29 / 57

Page 46: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models The Dirichlet process

Stick-breaking, intuitivelySec. 2.5. Dirichlet Processes 101

!1"1

"2"3

"4"5

!2

!3

!4

!5

1#!1

1#!2

1#!3

1#!4

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

! = 1 ! = 5

Figure 2.22. Sequential stick–breaking construction of the infinite set of mixture weights ! ! GEM(")corresponding to a measure G ! DP(", H). Left: The first weight !1 ! Beta(1, "). Each subsequentweight !k (red) is some random proportion #k (blue) of the remaining, unbroken “stick” of probabilitymass. Right: The first K = 20 weights generated by four random stick–breaking constructions (twowith " = 1, two with " = 5). Note that the weights !k do not monotonically decrease.

discrete parameters {"k}!k=1. For a given ! and dataset size N , there are strong boundson the accuracy of particular finite truncations of this stick–breaking process [147],which are often used in approximate computational methods [29, 147, 148, 289].

Several other stick–breaking processes have been proposed which sample the pro-portions #k from di!erent distributions [147, 148, 233]. For example, the two–parameterPoisson–Dirichlet, or Pitman–Yor, process [234] can produce heavier–tailed weight dis-tributions which better match power laws arising in natural language processing [117,287]. As we show next, these stick–breaking processes sometimes lead to predictivedistributions with simple Polya urn representations.

Prediction via Polya Urns

Because Dirichlet processes produce discrete random measures G, there is a strictlypositive probability of multiple observations "i ! G taking identical values. Given Nobservations {"i}N

i=1, suppose that they take K " N distinct values {"k}Kk=1. The

posterior expectation of any set T # " (see eq. (2.172)) can then be written as

E!G(T ) | "1, . . . , "N , !, H

"=

1

! + N

#

!H(T ) +K$

k=1

Nk$!k(T )

%

(2.178)

Nk !

N$

i=1

$("i, "k) k = 1, . . . , K (2.179)

Note that Nk is defined to be the number of previous observations equaling "k, andthat K is a random variable [10, 28, 233]. Analyzing this expression, the predictivedistribution of the next observation "N+1 ! G can be explicitly characterized.

(from Sudderth, 2008)

Small α⇒ lots of weight assigned to few θk’s.⇒ G will be very different from base measure H.

Large α⇒ weights equally distributed on θk’s.⇒ G will resemble the base measure H.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 30 / 57

Page 47: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models The Dirichlet process

(from Navarro et al., 2005)

H G

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 31 / 57

Page 48: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

The DP mixture model (DPMM)

Let’s use G ∼ DP(α,H) to build an infinite mixture model.

θi

xi

N

H

Sec. 2.5. Dirichlet Processes 105

xi

!zi

"#

k

N

$

%

xi

i

#

N

G

!

H

2! 1!

2x 1x

Figure 2.24. Directed graphical representations of an infinite, Dirichlet process mixture model. Mix-ture weights ! ! GEM(") follow a stick–breaking process, while cluster parameters are assigned in-dependent priors #k ! H($). Left: Indicator variable representation, in which zi ! ! is the clusterthat generates xi ! F (#zi). Right: Alternative distributional form, in which G is an infinite discretedistribution on !. #i ! G are the parameters of the cluster that generates xi ! F (#i). We illustratewith an infinite Gaussian mixture, where cluster variances are known (bottom) and H($) is a Gaussianprior on cluster means (top). Sampled cluster means #1, #2, and corresponding Gaussians, are shownfor two observations x1, x2.

Rather than choose a finite model order K, Dirichlet process mixtures use the stick–breaking prior to control complexity (see Fig. 2.22). As we discuss later, this relaxationleads to algorithms which automatically infer the number of clusters exhibited by aparticular dataset. Importantly, the predictive distribution implied by the Chineserestaurant process (eq. (2.181)) has a clustering bias, and favors simpler models inwhich observations (customers) share parameters (dishes). Additional clusters (tables)appear as more observations are generated (see Fig. 2.25).

Learning via Gibbs Sampling

Given N observations x = {xi}Ni=1 from a Dirichlet process mixture as in Fig. 2.24, we

would like to infer the number of latent clusters underlying those observations, and theirparameters !k. As with finite mixture models, the exact posterior distribution p(", ! | x)contains terms corresponding to each possible partition z of the observations [10, 187].While the Chinese restaurant process tractably specifies the prior probability of indi-vidual partitions (see eq. (2.181)), explicit enumeration of the exponentially large set ofpotential partitions is intractable. There is thus an extensive literature on approximatecomputational methods for Dirichlet process mixtures [29, 76, 121, 147, 148, 151, 222].

In this section, we generalize the Rao–Blackwellized Gibbs sampler of Alg. 2.2from finite to infinite mixture models. As before, we sample the indicator variablesz = {zi}N

i=1 assigning observations to latent clusters, marginalizing mixture weights "

G ∼ DP(α,H)

θi ∼ G

xi ∼ Fθi

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 32 / 57

Page 49: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

DPM (cont’d)

Using explicit clusters indicators z = (z1, z2, . . . , zN).

πα

zi

xi

θk

H

N

π ∼ GEM(α)θk ∼ H k = 1, . . . ,∞zi ∼ πxi ∼ Fθzi

i = 1, . . . ,N

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 33 / 57

Page 50: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

Chinese restaurant process

So far, we only have a generative model.Is there a “nice” conjugancy property to use during inference?It turns out (details in part 2) that, if

π ∼ GEM(α)

zi ∼ π

the distribution p(z|α) =∫

p(z|π)p(π) dπ is easily tractable, and is knownas the Chinese restaurant process (CRP).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 34 / 57

Page 51: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with∞ tables with∞ capacity.

1 2 3 4 ...

zi = table at which customer i sits uponentering.Customer 1 sits at table 1Customer 2 sits:

at table 1 w. prob ∝ 1at table 2 w. prob. ∝ α

Customer i sits:at table k w. prob. ∝ nk (# ppl at k)at new table w. prob. ∝ α

p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

Page 52: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with∞ tables with∞ capacity.

1 2 3 4 ...1

zi = table at which customer i sits uponentering.Customer 1 sits at table 1Customer 2 sits:

at table 1 w. prob ∝ 1at table 2 w. prob. ∝ α

Customer i sits:at table k w. prob. ∝ nk (# ppl at k)at new table w. prob. ∝ α

p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

Page 53: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with∞ tables with∞ capacity.

1 2 3 4 ...1 2

zi = table at which customer i sits uponentering.Customer 1 sits at table 1Customer 2 sits:

at table 1 w. prob ∝ 1at table 2 w. prob. ∝ α

Customer i sits:at table k w. prob. ∝ nk (# ppl at k)at new table w. prob. ∝ α

p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

Page 54: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with∞ tables with∞ capacity.

1 2 3 4 ...1 2

3

zi = table at which customer i sits uponentering.Customer 1 sits at table 1Customer 2 sits:

at table 1 w. prob ∝ 1at table 2 w. prob. ∝ α

Customer i sits:at table k w. prob. ∝ nk (# ppl at k)at new table w. prob. ∝ α

p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

Page 55: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models DP mixture models

Chinese restaurant process (cont’d)

Restaurant with∞ tables with∞ capacity.

1 2 3 4 ...1 2

3

4

5

6

7

zi = table at which customer i sits uponentering.Customer 1 sits at table 1Customer 2 sits:

at table 1 w. prob ∝ 1at table 2 w. prob. ∝ α

Customer i sits:at table k w. prob. ∝ nk (# ppl at k)at new table w. prob. ∝ α

p(zi = k) =nk

α+ i− 1

p(zi = knew) =α

α+ i− 1

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 35 / 57

Page 56: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models Inference

Gibbs sampling for DPMMs

Via the CRP, we can find the conditional distributions for Gibbs sampling.State: θ1, . . . , θk, z.

p(θk|x, z) ∝ p(θk)∏

i:zi=k

p(xi|θk)

= h(θk)f (xi|θk)

p(zi = k|z−i, θ, x) ∝ p(xi|θk)p(zi = k|z−i)

∝{

n(−i)k f (xi|θk) exising kα f (xi|θk) new k

πα

zi

xi

θk

H

N

K grows as more data are observed, asymptotically as α log n.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 36 / 57

Page 57: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models Inference

Gibbs sampling for DPMMs (example)

Mixture of bivariate GaussiansT=2 T=10 T=40

(from Sudderth, 2008)log p(x | !, ") = −462.25 log p(x | !, ") = −399.82

log p(x | !, ") = −398.32 log p(x | !, ") = −399.08

log p(x | !, ") = −397.67 log p(x | !, ") = −396.71

Figure 2.26. Learning a mixture of Gaussians using the Dirichlet process Gibbs sampler of Alg. 2.3.Columns show the parameters of clusters currently assigned to observations, and corresponding datalog–likelihoods, after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from two initializations.

log p(x | !, ") = −462.25 log p(x | !, ") = −399.82

log p(x | !, ") = −398.32 log p(x | !, ") = −399.08

log p(x | !, ") = −397.67 log p(x | !, ") = −396.71

Figure 2.26. Learning a mixture of Gaussians using the Dirichlet process Gibbs sampler of Alg. 2.3.Columns show the parameters of clusters currently assigned to observations, and corresponding datalog–likelihoods, after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from two initializations.

log p(x | !, ") = −462.25 log p(x | !, ") = −399.82

log p(x | !, ") = −398.32 log p(x | !, ") = −399.08

log p(x | !, ") = −397.67 log p(x | !, ") = −396.71

Figure 2.26. Learning a mixture of Gaussians using the Dirichlet process Gibbs sampler of Alg. 2.3.Columns show the parameters of clusters currently assigned to observations, and corresponding datalog–likelihoods, after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from two initializations.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 37 / 57

Page 58: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

Dirichlet process mixture models Inference

END OF FIRST PART.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 38 / 57

Page 59: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . .

Outline

1 Introduction and backgroundBayesian learningNonparametric models

2 Finite mixture modelsBayesian modelsClustering with FMMsInference

3 Dirichlet process mixture modelsGoing nonparametric!The Dirichlet processDP mixture modelsInference

4 A little more theory. . .De Finetti’s REDUXDirichlet process REDUX

5 The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 39 / 57

Page 60: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . De Finetti’s REDUX

De Finetti’s REDUX

Theorem (De Finetti, 1935. Aka Representation Theorem)A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for allN, there exists a random variable θ and a probability measure p on it such that

p(x1, x2, . . . , xN) =

Θ

p(θ)

N∏

i=1

p(xi|θ) dθ

The theorem wouldn’t be true if θ’s range is limited to Euclidean’s vectorspaces.We need to allow θ to range over measures.⇒ p(θ) is a distribution on measures, like the DP.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 40 / 57

Page 61: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . De Finetti’s REDUX

De Finetti’s REDUX

Theorem (De Finetti, 1935. Aka Representation Theorem)A sequence of random variables (x1, x2, . . .) is infinitely exchangeable if for allN, there exists a random variable θ and a probability measure p on it such that

p(x1, x2, . . . , xN) =

Θ

p(θ)

N∏

i=1

p(xi|θ) dθ

The theorem wouldn’t be true if θ’s range is limited to Euclidean’s vectorspaces.We need to allow θ to range over measures.⇒ p(θ) is a distribution on measures, like the DP.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 40 / 57

Page 62: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Dirichlet Process REDUX

DefinitionLet Θ be a measurable space (of parameters), H be a probability distributionon Θ, and α a positive scalar. A Dirichlet process is the distribution of arandom probability measure G over Θ, such that for any finite partition(T1, . . . ,Tk) of Θ, we have

(G(T1), . . . ,G(TK)) ∼ Dir(αH(T1), . . . , αH(TK)).Sec. 2.5. Dirichlet Processes 97

T1

T2T3

T1~

T2~

T3~

T4~

T5~

Figure 2.21. Dirichlet processes induce Dirichlet distributions on every finite, measurable partition.Left: An example base measure H on a bounded, two–dimensional space ! (darker regions have higherprobability). Center: A partition with K = 3 cells. The weight that a random measure G ! DP(!, H)assigns to these cells follows a Dirichlet distribution (see eq. (2.166)). We shade each cell Tk accordingto its mean E[G(Tk)] = H(Tk). Right: Another partition with K = 5 cells. The consistency of G

implies, for example, that (G(T1) + G(T2)) and G( eT1) follow identical beta distributions.

Proposition 2.5.1. Let G ! DP(!, H) be a random measure distributed according toa Dirichlet process. Given N independent observations "i ! G, the posterior measurealso follows a Dirichlet process:

p!G | "1, . . . , "N , !, H

"= DP

#! + N,

1

! + N

$!H +

N%

i=1

#!i

&'(2.169)

Proof. As shown by Ferguson [83], this result follows directly from the conjugate formof finite Dirichlet posterior distributions (see eq. (2.45)). See Sethuraman [254] for analternative proof.

There are interesting similarities between eq. (2.169) and the general form of conjugatepriors for exponential families (see Prop. 2.1.4). The Dirichlet process e!ectively definesa conjugate prior for distributions on arbitrary measurable spaces. In some contexts,the concentration parameter ! can then be seen as expressing confidence in the basemeasure H via the size of a pseudo–dataset (see [113] for further discussion).

Neutral and Tailfree Processes

The conjugacy of Prop. 2.5.1, which leads to tractable computational methods discussedlater, provides one practical motivation for the Dirichlet process. In this section, weshow that Dirichlet processes are also characterized by certain conditional independen-cies. These properties reveal both strengths and weaknesses of the Dirichlet process,and have motivated several other families of stochastic processes.

Let G be a random probability measure on a parameter space ". The distribution

Θ

(from Sudderth, 2008)

E[G(Tk)] = H(Tk)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 41 / 57

Page 63: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Posterior conjugancy

Via the conjugancy of the Dirichlet distribution, we know that:

p(G(T1), . . . ,G(TK)|θ ∈ Tk) = Dir(αH(T1), . . . , αH(Tk) + 1, . . . , αH(TK))

Formalizing this analysis, we obtain that if

G ∼ DP(α,H)

θi ∼ G i = 1, . . . ,N,

the posterior measure also follows a Dirichlet process:

p(G|θ1, . . . , θN , α,H) = DP(α+ N,

1α+ N

(αH +

N∑

i=1

δθi

))

The DP defines a conjugate prior for distributions on arbitrary measurespaces.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 42 / 57

Page 64: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Generating samples: stick breakingSethuraman (1995): equivalent definition of the Dirichlet process, through thestick-breaking construction.

G(θ) ∼ DP(α,H) iff G(θ) =

∞∑

k=1

πkδ(θ, θk),

where θ ∼ H, and

πk = βk

k−1∏

l=1

(1− βl) βl ∼ Beta(1, α)

Sec. 2.5. Dirichlet Processes 101

!1"1

"2"3

"4"5

!2

!3

!4

!5

1#!1

1#!2

1#!3

1#!4

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

k

! k

! = 1 ! = 5

Figure 2.22. Sequential stick–breaking construction of the infinite set of mixture weights ! ! GEM(")corresponding to a measure G ! DP(", H). Left: The first weight !1 ! Beta(1, "). Each subsequentweight !k (red) is some random proportion #k (blue) of the remaining, unbroken “stick” of probabilitymass. Right: The first K = 20 weights generated by four random stick–breaking constructions (twowith " = 1, two with " = 5). Note that the weights !k do not monotonically decrease.

discrete parameters {"k}!k=1. For a given ! and dataset size N , there are strong boundson the accuracy of particular finite truncations of this stick–breaking process [147],which are often used in approximate computational methods [29, 147, 148, 289].

Several other stick–breaking processes have been proposed which sample the pro-portions #k from di!erent distributions [147, 148, 233]. For example, the two–parameterPoisson–Dirichlet, or Pitman–Yor, process [234] can produce heavier–tailed weight dis-tributions which better match power laws arising in natural language processing [117,287]. As we show next, these stick–breaking processes sometimes lead to predictivedistributions with simple Polya urn representations.

Prediction via Polya Urns

Because Dirichlet processes produce discrete random measures G, there is a strictlypositive probability of multiple observations "i ! G taking identical values. Given Nobservations {"i}N

i=1, suppose that they take K " N distinct values {"k}Kk=1. The

posterior expectation of any set T # " (see eq. (2.172)) can then be written as

E!G(T ) | "1, . . . , "N , !, H

"=

1

! + N

#

!H(T ) +K$

k=1

Nk$!k(T )

%

(2.178)

Nk !

N$

i=1

$("i, "k) k = 1, . . . , K (2.179)

Note that Nk is defined to be the number of previous observations equaling "k, andthat K is a random variable [10, 28, 233]. Analyzing this expression, the predictivedistribution of the next observation "N+1 ! G can be explicitly characterized.

(from Sudderth, 2008)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 43 / 57

Page 65: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior):

G ∼ DP(α,H)θ|G ∼ G ⇔

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

Consider the partition (Θ,Θ \ θ) of Θ. We have:

(G(Θ),G(Θ \ θ)) ∼ Dir(

(α+ 1)αH + δθα+ 1

(θ), (α+ 1)αH + δθα+ 1

(Θ \ θ))

= Dir(1, α) = Beta(1, α)

G has point mass located at θ:

G = βδθ + (1− β)G′ β ∼ Beta(1, α)

and G′ is the renormalized probability measure with the point massremovedÉWhat is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

Page 66: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior):

G ∼ DP(α,H)θ|G ∼ G ⇔

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

Consider the partition (Θ,Θ \ θ) of Θ. We have:

(G(Θ),G(Θ \ θ)) ∼ Dir(

(α+ 1)αH + δθα+ 1

(θ), (α+ 1)αH + δθα+ 1

(Θ \ θ))

= Dir(1, α) = Beta(1, α)

G has point mass located at θ:

G = βδθ + (1− β)G′ β ∼ Beta(1, α)

and G′ is the renormalized probability measure with the point massremovedÉWhat is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

Page 67: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior):

G ∼ DP(α,H)θ|G ∼ G ⇔

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

Consider the partition (Θ,Θ \ θ) of Θ. We have:

(G(Θ),G(Θ \ θ)) ∼ Dir(

(α+ 1)αH + δθα+ 1

(θ), (α+ 1)αH + δθα+ 1

(Θ \ θ))

= Dir(1, α) = Beta(1, α)

G has point mass located at θ:

G = βδθ + (1− β)G′ β ∼ Beta(1, α)

and G′ is the renormalized probability measure with the point massremovedÉWhat is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

Page 68: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We know that (posterior):

G ∼ DP(α,H)θ|G ∼ G ⇔

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

Consider the partition (Θ,Θ \ θ) of Θ. We have:

(G(Θ),G(Θ \ θ)) ∼ Dir(

(α+ 1)αH + δθα+ 1

(θ), (α+ 1)αH + δθα+ 1

(Θ \ θ))

= Dir(1, α) = Beta(1, α)

G has point mass located at θ:

G = βδθ + (1− β)G′ β ∼ Beta(1, α)

and G′ is the renormalized probability measure with the point massremovedÉWhat is G′?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 44 / 57

Page 69: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We have:

G ∼ DP(α,H)θ|G ∼ G ⇒

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

G = βδθ + (1− β)G′

β ∼ Beta(1, α)

Consider a further partition θ,T1, . . . ,TK) of Θ:

(G(θ),G(T1), . . . ,G(TK)) = (β, (1− β)G′(T1), . . . , (1− β)G′(TK))

∼ Dir(1, αH(T1), . . . , αH(TK))

Using the agglomerative/decimative property of Dirichlet, we get

(G′(T1), . . . ,G′(TK)) ∼ Dir(αH(T1), . . . , αH(TK))

G′ ∼ DP(α,H)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 45 / 57

Page 70: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We have:

G ∼ DP(α,H)θ|G ∼ G ⇒

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

G = βδθ + (1− β)G′

β ∼ Beta(1, α)

Consider a further partition θ,T1, . . . ,TK) of Θ:

(G(θ),G(T1), . . . ,G(TK)) = (β, (1− β)G′(T1), . . . , (1− β)G′(TK))

∼ Dir(1, αH(T1), . . . , αH(TK))

Using the agglomerative/decimative property of Dirichlet, we get

(G′(T1), . . . ,G′(TK)) ∼ Dir(αH(T1), . . . , αH(TK))

G′ ∼ DP(α,H)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 45 / 57

Page 71: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

We have:

G ∼ DP(α,H)θ|G ∼ G ⇒

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

G = βδθ + (1− β)G′

β ∼ Beta(1, α)

Consider a further partition θ,T1, . . . ,TK) of Θ:

(G(θ),G(T1), . . . ,G(TK)) = (β, (1− β)G′(T1), . . . , (1− β)G′(TK))

∼ Dir(1, αH(T1), . . . , αH(TK))

Using the agglomerative/decimative property of Dirichlet, we get

(G′(T1), . . . ,G′(TK)) ∼ Dir(αH(T1), . . . , αH(TK))

G′ ∼ DP(α,H)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 45 / 57

Page 72: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Stick-breaking (derivation) [Teh 2007]

Therefore,

G ∼ DP(α,H)

G = β1δθ1 + (1− β1)G1

G = β1δθ1 + (1− β1)(β2δθ2 + (1− β2)G2)

...

G =

∞∑

k=1

πkδθk

where

πk = βk

k−1∏

l=1

(1− βl) βl ∼ Beta(1, α),

which is the stick-breaking construction.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 46 / 57

Page 73: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

Once again, we start from the posterior:

p(G|θ1, . . . , θN , α,H) = DP(α+ N,

1α+ N

(αH +

N∑

i=1

δθi

))

The expected measure of any subset T ⊂ Θ is:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

N∑

i=1

δθi(T))

Since G is discrete, some of the {θi}Ni=1 ∼ G take identical values.

Assume K ≤ N unique values:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

K∑

i=1

Nkδθi(T))

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 47 / 57

Page 74: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

Once again, we start from the posterior:

p(G|θ1, . . . , θN , α,H) = DP(α+ N,

1α+ N

(αH +

N∑

i=1

δθi

))

The expected measure of any subset T ⊂ Θ is:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

N∑

i=1

δθi(T))

Since G is discrete, some of the {θi}Ni=1 ∼ G take identical values.

Assume K ≤ N unique values:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

K∑

i=1

Nkδθi(T))

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 47 / 57

Page 75: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

Once again, we start from the posterior:

p(G|θ1, . . . , θN , α,H) = DP(α+ N,

1α+ N

(αH +

N∑

i=1

δθi

))

The expected measure of any subset T ⊂ Θ is:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

N∑

i=1

δθi(T))

Since G is discrete, some of the {θi}Ni=1 ∼ G take identical values.

Assume K ≤ N unique values:

E[G(T)|θ1, . . . , θN , α,H

]=

1α+ N

(αH +

K∑

i=1

Nkδθi(T))

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 47 / 57

Page 76: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

A bit informally. . .Let Tk contain θk and shrink it arbitrarily. To the limit, we have that

p(θN+1 = θ|θ1, . . . , θN , α,H) =1

α+ N

(αh(θ) +

K∑

i=1

Nkδθi(θ))

This is the generalized Polya urn schemeAn urn contains one ball for each preceding observation, with a different colorfor each distinct θk. For each ball drawn from the urn, we replace that ball andadd one more ball of the same color. There is a special “weighted” ball whichis drawn with probability proportional to α normal balls, and has a new,previously unseen color θk. [This description is from Sudderth, 2008]

This allows to sample from a Dirichlet process without explicitlyconstructing the underlying G ∼ DP(α,H).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 48 / 57

Page 77: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

A bit informally. . .Let Tk contain θk and shrink it arbitrarily. To the limit, we have that

p(θN+1 = θ|θ1, . . . , θN , α,H) =1

α+ N

(αh(θ) +

K∑

i=1

Nkδθi(θ))

This is the generalized Polya urn schemeAn urn contains one ball for each preceding observation, with a different colorfor each distinct θk. For each ball drawn from the urn, we replace that ball andadd one more ball of the same color. There is a special “weighted” ball whichis drawn with probability proportional to α normal balls, and has a new,previously unseen color θk. [This description is from Sudderth, 2008]

This allows to sample from a Dirichlet process without explicitlyconstructing the underlying G ∼ DP(α,H).

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 48 / 57

Page 78: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

The Dirichlet process implicitly partitions the data.Let zi indicate the subset (cluster) associated with the i6th observation,i.e. θi = θzi .From the previous slide, we get:

p(zN+1 = z|z1, . . . , zN , α) =1

α+ N

(αδ(z, k) +

K∑

i=1

Nkδ(z, k))

This is the Chinese restaurant process (CRP).

1 2 3 4 ...1 2

3

4

5

6

7

It induces an exchangeable distribution on partitions.The joint distribution is invariant to the order the observations areassigned to clusters.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 49 / 57

Page 79: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Chinese restaurant (derivation)

The Dirichlet process implicitly partitions the data.Let zi indicate the subset (cluster) associated with the i6th observation,i.e. θi = θzi .From the previous slide, we get:

p(zN+1 = z|z1, . . . , zN , α) =1

α+ N

(αδ(z, k) +

K∑

i=1

Nkδ(z, k))

This is the Chinese restaurant process (CRP).

1 2 3 4 ...1 2

3

4

5

6

7

It induces an exchangeable distribution on partitions.The joint distribution is invariant to the order the observations areassigned to clusters.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 49 / 57

Page 80: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

A little more theory. . . Dirichlet process REDUX

Take away messageThese representations are all equivalent!

Posterior DP:

G ∼ DP(α,H)θ|G ∼ G ⇔

θ ∼ H

G|θ ∼ DP(α+ 1, αH+δθ

α+1

)

Stick-breaking construction:

G(θ) =

∞∑

k=1

πkδ(θ, θk)θk ∼ Hπ ∼ GEM(α)

Generalized Polya urn

p(θN+1 = θ|θ1, . . . , θN , α,H) =1

α+ N

(αh(θ) +

K∑

i=1

Nkδθi(θ))

Chinese restaurant process

p(zN+1 = z|z1, . . . , zN , α) =1

α+ N

(αδ(z, k) +

K∑

i=1

Nkδ(z, k))

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 50 / 57

Page 81: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

Outline

1 Introduction and backgroundBayesian learningNonparametric models

2 Finite mixture modelsBayesian modelsClustering with FMMsInference

3 Dirichlet process mixture modelsGoing nonparametric!The Dirichlet processDP mixture modelsInference

4 A little more theory. . .De Finetti’s REDUXDirichlet process REDUX

5 The hierarchical Dirichlet process

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 51 / 57

Page 82: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

The DP mixture model (DPMM)

Let’s use G ∼ DP(α,H) to build an infinite mixture model.

θi

xi

N

H

Sec. 2.5. Dirichlet Processes 105

xi

!zi

"#

k

N

$

%

xi

i

#

N

G

!

H

2! 1!

2x 1x

Figure 2.24. Directed graphical representations of an infinite, Dirichlet process mixture model. Mix-ture weights ! ! GEM(") follow a stick–breaking process, while cluster parameters are assigned in-dependent priors #k ! H($). Left: Indicator variable representation, in which zi ! ! is the clusterthat generates xi ! F (#zi). Right: Alternative distributional form, in which G is an infinite discretedistribution on !. #i ! G are the parameters of the cluster that generates xi ! F (#i). We illustratewith an infinite Gaussian mixture, where cluster variances are known (bottom) and H($) is a Gaussianprior on cluster means (top). Sampled cluster means #1, #2, and corresponding Gaussians, are shownfor two observations x1, x2.

Rather than choose a finite model order K, Dirichlet process mixtures use the stick–breaking prior to control complexity (see Fig. 2.22). As we discuss later, this relaxationleads to algorithms which automatically infer the number of clusters exhibited by aparticular dataset. Importantly, the predictive distribution implied by the Chineserestaurant process (eq. (2.181)) has a clustering bias, and favors simpler models inwhich observations (customers) share parameters (dishes). Additional clusters (tables)appear as more observations are generated (see Fig. 2.25).

Learning via Gibbs Sampling

Given N observations x = {xi}Ni=1 from a Dirichlet process mixture as in Fig. 2.24, we

would like to infer the number of latent clusters underlying those observations, and theirparameters !k. As with finite mixture models, the exact posterior distribution p(", ! | x)contains terms corresponding to each possible partition z of the observations [10, 187].While the Chinese restaurant process tractably specifies the prior probability of indi-vidual partitions (see eq. (2.181)), explicit enumeration of the exponentially large set ofpotential partitions is intractable. There is thus an extensive literature on approximatecomputational methods for Dirichlet process mixtures [29, 76, 121, 147, 148, 151, 222].

In this section, we generalize the Rao–Blackwellized Gibbs sampler of Alg. 2.2from finite to infinite mixture models. As before, we sample the indicator variablesz = {zi}N

i=1 assigning observations to latent clusters, marginalizing mixture weights "

G ∼ DP(α,H)

θi ∼ G

xi ∼ Fθi

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 52 / 57

Page 83: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

Related subgroups of data

Dataset with J related groups x = (x1, . . . , xJ).xj = (xj1, . . . , xjNj) contains Nj observations.We want these group to share clusters (transfer knowledge.)

i

xij

2 m1

x2 xmjjx1j

i

xij

2 m1

x2 xmjjx1j

(from jordan, 2005)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 53 / 57

Page 84: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

Hierarchical Dirichlet process (HDP)

Global probability measure G0 ∼ DP(γ,H)

Defines a set of shared clusters.

G0(θ) =

∞∑

k=1

βkδ(θ, θk)θk ∼ Hβ ∼ GEM(γ)

Group specific distributions Gj ∼ DP(α,G0)

Gj(θ) =

∞∑

t=1

πkδ(θ, θk)θt ∼ G0π ∼ GEM(γ)

Note G0 as base measure!Each local cluster has parameter θk copied from some global cluster

For each group, data points are generated according to:

θji ∼ Gj

xji ∼ F(θji)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 54 / 57

Page 85: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

Hierarchical Dirichlet process (HDP)

Global probability measure G0 ∼ DP(γ,H)

Defines a set of shared clusters.

G0(θ) =

∞∑

k=1

βkδ(θ, θk)θk ∼ Hβ ∼ GEM(γ)

Group specific distributions Gj ∼ DP(α,G0)

Gj(θ) =

∞∑

t=1

πkδ(θ, θk)θt ∼ G0π ∼ GEM(γ)

Note G0 as base measure!Each local cluster has parameter θk copied from some global cluster

For each group, data points are generated according to:

θji ∼ Gj

xji ∼ F(θji)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 54 / 57

Page 86: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

Hierarchical Dirichlet process (HDP)

Global probability measure G0 ∼ DP(γ,H)

Defines a set of shared clusters.

G0(θ) =

∞∑

k=1

βkδ(θ, θk)θk ∼ Hβ ∼ GEM(γ)

Group specific distributions Gj ∼ DP(α,G0)

Gj(θ) =

∞∑

t=1

πkδ(θ, θk)θt ∼ G0π ∼ GEM(γ)

Note G0 as base measure!Each local cluster has parameter θk copied from some global cluster

For each group, data points are generated according to:

θji ∼ Gj

xji ∼ F(θji)

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 54 / 57

Page 87: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

The HDP mixture model (DPMM)

Gjα

θji

xji

N

H

G0γ

J

116 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS

xji

!

"

JNj

zji

#

$j

%k&

'

xji

%

j!

H

"

ji

JNj

G

0G

12% 11% 22%21%

0G

1G 2G

H

12x 11x 21x 22x

Figure 2.28. Directed graphical representations of a hierarchical Dirichlet process (HDP) mixturemodel. Global cluster weights ! ! GEM(!) follow a stick–breaking process, while cluster parametersare assigned independent priors "k ! H(#). Left: Explicit stick–breaking representation, in whicheach group reuses the global clusters with weights "j ! DP($, !). zji ! "j indicates the clusterthat generates xji ! F ("zji). Right: Alternative distributional form, in which G0 ! DP(!, H) is aninfinite discrete distribution on !, and Gj ! DP($, G0) a reweighted, group–specific distribution."ji ! Gj are then the parameters of the cluster that generates xji ! F ("ji). We illustrate with ashared, infinite Gaussian mixture, where cluster variances are known (bottom) and H(#) is a Gaussianprior on cluster means (top). Sampled cluster means "j1, "j2, and corresponding Gaussians, are shownfor two observations xj1, xj2 in each of two groups G1, G2.

measures on the positive integers [289]. Thus, ! determines the average weight of localclusters (E[!jk] = "k), while # controls the variability of cluster weights across groups.Note that eq. (2.202) suggests the alternative graphical model of Fig. 2.28, in whichzji ! "j directly indicates the global cluster associated with xji. In contrast, Fig. 2.29indirectly determines global cluster assignments via local clusters, taking zji = kjtji .

Comparing these representations to Fig. 2.11, we see that HDPs share clusters asin the LDA model, but remove the need for model order selection. In terms of theDDP framework, the global measure G0 provides a particular, convenient mechanismfor inducing dependencies among the mixture weights in di!erent groups. Note thatthe discreteness of G0 plays a critical role in this construction. If, for example, we hadinstead taken Gj ! DP(#, H) with H continuous, the stick–breaking construction ofThm. 2.5.3 shows that groups would learn independent sets of disjoint clusters.

Extending the analogy of Fig. 2.23, we may alternatively formulate the HDP rep-resentation of Fig. 2.29 in terms of a Chinese restaurant franchise [289]. In this inter-pretation, each group defines a separate restaurant in which customers (observations)xji sit at tables (clusters) tji. Each table shares a single dish (parameter) !$t, which is

G0 ∼ DP(γ,H)

Gj ∼ DP(α,G0)

θji ∼ Gj

xji ∼ Fθji

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 55 / 57

Page 88: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

The HDP mixture model (DPMM)

Gj(θ) =

∞∑

t=1

πkδ(θ, θk)θt ∼ G0π ∼ GEM(γ)

G0 is discrete.Each group might create several copies of the same global cluster.Aggregating the probabilities:

Gj(θ) =

∞∑

t=1

πkδ(θ, θk) πjk =∑

t:kjt=k

πjt

It can be shown that π ∼ DP(α, β).β = (β1, β2, . . .): average weight of local clusters.π = (π1, π2, . . .) group-specific weights.α controls the variability of clusters weight across groups.

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 56 / 57

Page 89: Bayesian Nonparametrics: Models Based on the Dirichlet Processapanella/slides/nonparametric_bayes.pdf · Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro

The hierarchical Dirichlet process

THANK YOU. QUESTIONS?

Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 57 / 57