hierarchical dirichelet processes

Hierarchical Dirichelet Processes

Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. BleiNIPS 2004

Presented by Yuting Qi

ECE Dept., Duke Univ.

08/26/05

Sharing Clusters Among Related Groups:

Overview

Motivation Dirichlet Processes Hierarchical Dirichlet Processes Inference Experimental results Conclusions

Motivation

Multi-task learning: clustering Goal:

Share clusters among multiple related clustering problems (model-based).

Approach: Hierarchical; Nonparametric Bayesian; DP Mixture Model: learn a generative

model over the data, treating the classes as hidden variables;

Dirichlet Processes

Let (,) be a measurable space, G0 be a probability measure on the space, and be a positive real number.

A Dirichlet process is any distribution of a random probability measure G over (,) such that, for all finite partitions (A1,…,Ar) of ,

G ~DP(, G0 ) if G is a random probability measure with distribution given by the Dirichlet process.

Draws G from DP are generally not distinct, discrete, , Өk~G0, βk are random and depend on .

Properties:

Chinese Restaurant Processes

CRP(the polya urn scheme) Φ1,…,Φi-1, i.i.d., r.v., distributed according to G; Ө1,

…, ӨK to be the distinct values taken on by Φ1,…,Φi-1, nk be # of Φi’= Өk, 0<i’<i,

This slide is from “Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process”, NLP Group, Stanford, Feb. 2005

DP Mixture Model

One of the most important application of DP: nonparametric prior distribution on the components of a

mixture model.

Why no direct application of density estimation? Because G is discrete?

)(~|

~|

),(~ 00

iii

i

Fx

GG

GDPG

HDP – Problem statement

We have J groups of data, {Xj}, j=1,…, J. For each group, Xj={xji}, i=1, …, nj.

In each group, Xj={xji} are modeled with a mixture model. The mixing proportions are specific to the group.

Different groups share the same set of mixture components (underlying clusters, ), but different group is a different combination of the mixture components.

Goal: Discover the distribution of within a group; Discover the distribution of across groups;

HDP - General representation

G0: the global prob. measure ~ DP(r, H) , r: concentration parameter, H is the base measure.

Gj: the probability distribution for group j, ~ DP(α, G0).

Φji : the hidden parameters of distribution F(Φji) corresponding to xji.

The overall model is:

Two-level DPs.

HDP - General representation

G0 places non-zeros mass only on , thus,

, i.i.d, r.v. distributed according to H.

HDP-CR franchise First level: within each group, DP mixture

Φj1,…,Φji-1, i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to be the values taken on by Φj1,…,Φji-1, njk be # of Φji’= Ѱjt, 0<i’<i.

Second level: across group, sharing components Base measure of each group is a draw from DP:

Ѱjt | G0 ~ G0, G0 ~ DP(r, H),

Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of Ѱjt=Өk, all j, t.

)(~|,~|),,(~ 00 jijijijjjij FxGGGDPG

HDP-CR franchise

Values of Φji are shared among groups.

Integrating out G0

Inference- MCMC Gibbs sampling the posterior in the CR franchise:

Instead of directly dealing with Φji & Ѱjt to get p(Φ, Ѱ|X), p(t, k, Ө|X) is achieved by sampling t, k, Ө, where,

t={tji}, tji is the table index that Φji associated with, Φji=Ѱjtji. K={kjt}, kjt is the index that Ѱjt takes value on Өk, Ѱjt=Өkjt.

Knowing the prior distribution as shown in CPR franchise, the posterior is sampled iteratively,

Sampling t:

Sampling K:

Sampling Ө:

Experiments on the synthetic data

Data description: We have three group data; Each group is a Gaussian mixture; Different group can share same clusters; Each cluster has 50 2-D data points, features are independent;

0 2 4 6 8 10

1

2

3

4

5

6

x(1)

x(2)

Original data

Group1Group2Group3

1

2 7 6

43 5

Group 1: [1, 2, 3, 7]

1

2 7 6

43 5

Group 2: [3, 4, 5, 7]

1

2 7 6

43 5

Group 3: [5, 6, 1, 7]


HDPs definition:

here, F(xji|φji) is Gussian distribution, φji={μji, σji}; φji take values on one of θk={μk, σk}, k=1….

μ ~ N(m, σ/β), σ-1 ~ Gamma (a, b), i. e., H is Norm-Gamma joint distribution. m, β, a, b are given hyperparameters.

Goal: Model each group as a Gaussian mixture ;

Model the cluster distribution over groups ;


Results on Synthetic Data Global distribution:

0 2 4 6 8 10

1

2

3

4

5

6

x(1)

x(2)

Estimated underly ing distribution

Group1Group2Group3

Estimated over all groups and the corresponding mixing proportions

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3Global mixing propotion (over groups)

Component Index

Mix

ing

pro

po

tion

The number of components is open-ended, here only partial is shown.


1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

35

40

45

502-th group mixture propotion (over data)

Component Index

Mix

ing

pro

po

tion

1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

35

40

45


Component Index

Mix

tin

g p

rop

otio

n

1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

35

40

45


Component Index

Mix

ing

pro

po

tion

Mixture within each group:

The number of components in each group is also open-ended, here only partial is shown.

Conclusions & discussions

This hierarchical Bayesian method can automatically determine the appropriate number of mixture components needed.

A set of DPs are coupled via their base measure to achieve the component sharing among groups.

Non-parametric priors; not non-parametric density estimation.

hierarchical dirichelet processes

Documents

probability distribution

group data

group j

dp mixture j1

gussian distribution

dp mixture modelone

random probability measure

hdpcr franchisevalues