Download - Hierarchical Dirichelet Processes
Hierarchical Dirichelet Processes
Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. BleiNIPS 2004
Presented by Yuting Qi
ECE Dept., Duke Univ.
08/26/05
Sharing Clusters Among Related Groups:
Overview
Motivation Dirichlet Processes Hierarchical Dirichlet Processes Inference Experimental results Conclusions
Motivation
Multi-task learning: clustering Goal:
Share clusters among multiple related clustering problems (model-based).
Approach: Hierarchical; Nonparametric Bayesian; DP Mixture Model: learn a generative
model over the data, treating the classes as hidden variables;
Dirichlet Processes
Let (,) be a measurable space, G0 be a probability measure on the space, and be a positive real number.
A Dirichlet process is any distribution of a random probability measure G over (,) such that, for all finite partitions (A1,…,Ar) of ,
G ~DP(, G0 ) if G is a random probability measure with distribution given by the Dirichlet process.
Draws G from DP are generally not distinct, discrete, , Өk~G0, βk are random and depend on .
Properties:
Chinese Restaurant Processes
CRP(the polya urn scheme) Φ1,…,Φi-1, i.i.d., r.v., distributed according to G; Ө1,
…, ӨK to be the distinct values taken on by Φ1,…,Φi-1, nk be # of Φi’= Өk, 0<i’<i,
This slide is from “Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process”, NLP Group, Stanford, Feb. 2005
DP Mixture Model
One of the most important application of DP: nonparametric prior distribution on the components of a
mixture model.
Why no direct application of density estimation? Because G is discrete?
)(~|
~|
),(~ 00
iii
i
Fx
GG
GDPG
HDP – Problem statement
We have J groups of data, {Xj}, j=1,…, J. For each group, Xj={xji}, i=1, …, nj.
In each group, Xj={xji} are modeled with a mixture model. The mixing proportions are specific to the group.
Different groups share the same set of mixture components (underlying clusters, ), but different group is a different combination of the mixture components.
Goal: Discover the distribution of within a group; Discover the distribution of across groups;
HDP - General representation
G0: the global prob. measure ~ DP(r, H) , r: concentration parameter, H is the base measure.
Gj: the probability distribution for group j, ~ DP(α, G0).
Φji : the hidden parameters of distribution F(Φji) corresponding to xji.
The overall model is:
Two-level DPs.
HDP - General representation
G0 places non-zeros mass only on , thus,
, i.i.d, r.v. distributed according to H.
HDP-CR franchise First level: within each group, DP mixture
Φj1,…,Φji-1, i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to be the values taken on by Φj1,…,Φji-1, njk be # of Φji’= Ѱjt, 0<i’<i.
Second level: across group, sharing components Base measure of each group is a draw from DP:
Ѱjt | G0 ~ G0, G0 ~ DP(r, H),
Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of Ѱjt=Өk, all j, t.
)(~|,~|),,(~ 00 jijijijjjij FxGGGDPG
HDP-CR franchise
Values of Φji are shared among groups.
Integrating out G0
Inference- MCMC Gibbs sampling the posterior in the CR franchise:
Instead of directly dealing with Φji & Ѱjt to get p(Φ, Ѱ|X), p(t, k, Ө|X) is achieved by sampling t, k, Ө, where,
t={tji}, tji is the table index that Φji associated with, Φji=Ѱjtji. K={kjt}, kjt is the index that Ѱjt takes value on Өk, Ѱjt=Өkjt.
Knowing the prior distribution as shown in CPR franchise, the posterior is sampled iteratively,
Sampling t:
Sampling K:
Sampling Ө:
Experiments on the synthetic data
Data description: We have three group data; Each group is a Gaussian mixture; Different group can share same clusters; Each cluster has 50 2-D data points, features are independent;
0 2 4 6 8 10
1
2
3
4
5
6
x(1)
x(2)
Original data
Group1Group2Group3
1
2 7 6
43 5
Group 1: [1, 2, 3, 7]
1
2 7 6
43 5
Group 2: [3, 4, 5, 7]
1
2 7 6
43 5
Group 3: [5, 6, 1, 7]
Experiments on the synthetic data
HDPs definition:
here, F(xji|φji) is Gussian distribution, φji={μji, σji}; φji take values on one of θk={μk, σk}, k=1….
μ ~ N(m, σ/β), σ-1 ~ Gamma (a, b), i. e., H is Norm-Gamma joint distribution. m, β, a, b are given hyperparameters.
Goal: Model each group as a Gaussian mixture ;
Model the cluster distribution over groups ;
Experiments on the synthetic data
Results on Synthetic Data Global distribution:
0 2 4 6 8 10
1
2
3
4
5
6
x(1)
x(2)
Estimated underly ing distribution
Group1Group2Group3
Estimated over all groups and the corresponding mixing proportions
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3Global mixing propotion (over groups)
Component Index
Mix
ing
pro
po
tion
The number of components is open-ended, here only partial is shown.
Experiments on the synthetic data
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
35
40
45
502-th group mixture propotion (over data)
Component Index
Mix
ing
pro
po
tion
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
35
40
45
501-th group mixture propotion (over data)
Component Index
Mix
tin
g p
rop
otio
n
1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
30
35
40
45
503-th group mixture propotion (over data)
Component Index
Mix
ing
pro
po
tion
Mixture within each group:
The number of components in each group is also open-ended, here only partial is shown.
Conclusions & discussions
This hierarchical Bayesian method can automatically determine the appropriate number of mixture components needed.
A set of DPs are coupled via their base measure to achieve the component sharing among groups.
Non-parametric priors; not non-parametric density estimation.