david dunson (with leo duan) - cirm · david dunson (with leo duan) (duke) bayes distance...

67
Bayesian Distance Clustering David Dunson (with Leo Duan) Departments of Statistical Science & Mathematics Duke University [email protected] David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 1 / 31

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

David Dunson (with Leo Duan)

Departments of Statistical Science & MathematicsDuke University

[email protected]

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 1 / 31

Page 2: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

1 Motivation

2 Bayesian Distance Clustering

3 Theory: A Deeper Dive

4 Applications

5 Discussion

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 2 / 31

Page 3: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Motivation

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 3 / 31

Page 4: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsClustering

One of the main tools for unsupervised statistical analysis

Often 1st step to simplify complex data — dividing them intosmall & homogeneous groups.

Great uncertainty exists in clustering

Clusters tend to overlap.Generative models via mixture likelihood are extremely useful &popular. Consider yi ∈ Y ⊆ Rp

yiiid∼ f , f (y) =

k∑h=1

πhK(y ; θh).

Equivalently, there is a latent clustering label ci ∈ {1, . . . , k}

yi ∼ K(θci ), pr(ci = h) = πh.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 4 / 31

Page 5: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsClustering

One of the main tools for unsupervised statistical analysisOften 1st step to simplify complex data — dividing them intosmall & homogeneous groups.

Great uncertainty exists in clustering

Clusters tend to overlap.Generative models via mixture likelihood are extremely useful &popular. Consider yi ∈ Y ⊆ Rp

yiiid∼ f , f (y) =

k∑h=1

πhK(y ; θh).

Equivalently, there is a latent clustering label ci ∈ {1, . . . , k}

yi ∼ K(θci ), pr(ci = h) = πh.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 4 / 31

Page 6: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsClustering

One of the main tools for unsupervised statistical analysisOften 1st step to simplify complex data — dividing them intosmall & homogeneous groups.

Great uncertainty exists in clustering

Clusters tend to overlap.Generative models via mixture likelihood are extremely useful &popular. Consider yi ∈ Y ⊆ Rp

yiiid∼ f , f (y) =

k∑h=1

πhK(y ; θh).

Equivalently, there is a latent clustering label ci ∈ {1, . . . , k}

yi ∼ K(θci ), pr(ci = h) = πh.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 4 / 31

Page 7: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsClustering

One of the main tools for unsupervised statistical analysisOften 1st step to simplify complex data — dividing them intosmall & homogeneous groups.

Great uncertainty exists in clusteringClusters tend to overlap.

Generative models via mixture likelihood are extremely useful &popular. Consider yi ∈ Y ⊆ Rp

yiiid∼ f , f (y) =

k∑h=1

πhK(y ; θh).

Equivalently, there is a latent clustering label ci ∈ {1, . . . , k}

yi ∼ K(θci ), pr(ci = h) = πh.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 4 / 31

Page 8: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsClustering

One of the main tools for unsupervised statistical analysisOften 1st step to simplify complex data — dividing them intosmall & homogeneous groups.

Great uncertainty exists in clusteringClusters tend to overlap.Generative models via mixture likelihood are extremely useful &popular. Consider yi ∈ Y ⊆ Rp

yiiid∼ f , f (y) =

k∑h=1

πhK(y ; θh).

Equivalently, there is a latent clustering label ci ∈ {1, . . . , k}

yi ∼ K(θci ), pr(ci = h) = πh.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 4 / 31

Page 9: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsHow to choose K???

Dilemma: balancing complexity vs flexibility.

Simple K – fewer parameters, easy computation.e.g. Gaussian mixture (GMM) and its approximation (K-means).But strong assumptions lack robustness: sub-Gaussian tails,symmetry, elliptical contour.Many practical problems. e.g. as n increases, slight violation toassumption yields unbounded number of clusters (Miller andDunson 2018).

# of clusters in fitting GMM to 2 mildly skewed Gaussians

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 5 / 31

Page 10: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsHow to choose K???

Dilemma: balancing complexity vs flexibility.Simple K – fewer parameters, easy computation.

e.g. Gaussian mixture (GMM) and its approximation (K-means).But strong assumptions lack robustness: sub-Gaussian tails,symmetry, elliptical contour.Many practical problems. e.g. as n increases, slight violation toassumption yields unbounded number of clusters (Miller andDunson 2018).

# of clusters in fitting GMM to 2 mildly skewed Gaussians

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 5 / 31

Page 11: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsHow to choose K???

Dilemma: balancing complexity vs flexibility.Simple K – fewer parameters, easy computation.e.g. Gaussian mixture (GMM) and its approximation (K-means).

But strong assumptions lack robustness: sub-Gaussian tails,symmetry, elliptical contour.Many practical problems. e.g. as n increases, slight violation toassumption yields unbounded number of clusters (Miller andDunson 2018).

# of clusters in fitting GMM to 2 mildly skewed Gaussians

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 5 / 31

Page 12: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsHow to choose K???

Dilemma: balancing complexity vs flexibility.Simple K – fewer parameters, easy computation.e.g. Gaussian mixture (GMM) and its approximation (K-means).But strong assumptions lack robustness: sub-Gaussian tails,symmetry, elliptical contour.

Many practical problems. e.g. as n increases, slight violation toassumption yields unbounded number of clusters (Miller andDunson 2018).

# of clusters in fitting GMM to 2 mildly skewed Gaussians

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 5 / 31

Page 13: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering MethodsHow to choose K???

Dilemma: balancing complexity vs flexibility.Simple K – fewer parameters, easy computation.e.g. Gaussian mixture (GMM) and its approximation (K-means).But strong assumptions lack robustness: sub-Gaussian tails,symmetry, elliptical contour.Many practical problems. e.g. as n increases, slight violation toassumption yields unbounded number of clusters (Miller andDunson 2018).

# of clusters in fitting GMM to 2 mildly skewed GaussiansDavid Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 5 / 31

Page 14: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering Methods

More flexible K:1. Skewed-Gaussian (Lin, Lee and Yen 2007)2. t-kernel (Peel and McLachlan 2000)3. Another layer of uniform mixture (Rodriguez and Walker2014)4. Copula (Kosmidis and Karlis 2015)

# of parameters becomes unwieldy — easy to overfit in finite n& create intractable computation for large p.Imagine: How many parameters do you need to handle skewnessin yi ∈ R100? How do you run MCMC on those?

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 6 / 31

Page 15: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering Methods

More flexible K:1. Skewed-Gaussian (Lin, Lee and Yen 2007)2. t-kernel (Peel and McLachlan 2000)3. Another layer of uniform mixture (Rodriguez and Walker2014)4. Copula (Kosmidis and Karlis 2015)# of parameters becomes unwieldy — easy to overfit in finite n& create intractable computation for large p.

Imagine: How many parameters do you need to handle skewnessin yi ∈ R100? How do you run MCMC on those?

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 6 / 31

Page 16: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering Methods

More flexible K:1. Skewed-Gaussian (Lin, Lee and Yen 2007)2. t-kernel (Peel and McLachlan 2000)3. Another layer of uniform mixture (Rodriguez and Walker2014)4. Copula (Kosmidis and Karlis 2015)# of parameters becomes unwieldy — easy to overfit in finite n& create intractable computation for large p.Imagine: How many parameters do you need to handle skewnessin yi ∈ R100? How do you run MCMC on those?

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 6 / 31

Page 17: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering Methods

Some new solutions by ‘assumption weakening’1. Gaussian mixture with one extra component of improperuniform (Coretto and Hennig 2017)2. Creating a KL-divergence ball around exact Gaussian mixturelikelihood (Miller and Dunson 2018)

Not clear on how to extend those to more complicated objects— e.g. clustering multiple time series in brain EEG data.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7 / 31

Page 18: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering Methods

Some new solutions by ‘assumption weakening’1. Gaussian mixture with one extra component of improperuniform (Coretto and Hennig 2017)2. Creating a KL-divergence ball around exact Gaussian mixturelikelihood (Miller and Dunson 2018)Not clear on how to extend those to more complicated objects— e.g. clustering multiple time series in brain EEG data.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7 / 31

Page 19: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Survey on Clustering Methods

Apparently, we want some new tool that is:Probabilistic — critical for UQ.Simple — small # of parameters, even for complicated data.Theoretically well justified and (almost) tuning free.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 8 / 31

Page 20: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 9 / 31

Page 21: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWhy using distances? First, consider pairwise differences:di ,i ′ = yi − y ′i

1 If yi and y ′i are in the same cluster, due to iid:

Ed ri ,i ′ = ~0 for r = 1, 3, 5, . . .

2 Unimodal at ~0 as long as yi is unimodal (Hodges and Lehmann1954).

3 Intuitively, the within-cluster distances ‖di ,i ′‖ will nowconcentrate at 0 (for most norms ‖.‖).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 10 / 31

Page 22: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWhy using distances? First, consider pairwise differences:di ,i ′ = yi − y ′i

1 If yi and y ′i are in the same cluster, due to iid:

Ed ri ,i ′ = ~0 for r = 1, 3, 5, . . .

2 Unimodal at ~0 as long as yi is unimodal (Hodges and Lehmann1954).

3 Intuitively, the within-cluster distances ‖di ,i ′‖ will nowconcentrate at 0 (for most norms ‖.‖).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 10 / 31

Page 23: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWhy using distances? First, consider pairwise differences:di ,i ′ = yi − y ′i

1 If yi and y ′i are in the same cluster, due to iid:

Ed ri ,i ′ = ~0 for r = 1, 3, 5, . . .

2 Unimodal at ~0 as long as yi is unimodal (Hodges and Lehmann1954).

3 Intuitively, the within-cluster distances ‖di ,i ′‖ will nowconcentrate at 0 (for most norms ‖.‖).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 10 / 31

Page 24: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWhy using distances? First, consider pairwise differences:di ,i ′ = yi − y ′i

1 If yi and y ′i are in the same cluster, due to iid:

Ed ri ,i ′ = ~0 for r = 1, 3, 5, . . .

2 Unimodal at ~0 as long as yi is unimodal (Hodges and Lehmann1954).

3 Intuitively, the within-cluster distances ‖di ,i ′‖ will nowconcentrate at 0 (for most norms ‖.‖).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 10 / 31

Page 25: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clusteringdi ,i ′ are much easier to model than yi !

How do we obtain a ‘coherent’ likelihood for di ,i ′?Imagine an unknown oracle likelihood, conditioned on latentlabels c(n)

L∗(y(n); c(n)) =k∏

h=1

∏i :ci =h

Kh(yi) =k∏

h=1Lh(y [h])

Transform Lh(y [h]) using y [h]1 and (d [h]

2,1, . . . , d[h]nh,1)

L∗h(y [h]) = Kh(y [h]1 )

nh∏i=2

Gh(y [h]i − y [h]

1 | y[h]1 )

= Kh(y [h]

1 | d[h]2,1, . . . , d

[h]nh,1

)Gh(d [h]

2,1, . . . , d[h]nh,1

)Discard y [h]

1 (requiring heavy assumptions) by integrating it out.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 11 / 31

Page 26: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clusteringdi ,i ′ are much easier to model than yi !How do we obtain a ‘coherent’ likelihood for di ,i ′?

Imagine an unknown oracle likelihood, conditioned on latentlabels c(n)

L∗(y(n); c(n)) =k∏

h=1

∏i :ci =h

Kh(yi) =k∏

h=1Lh(y [h])

Transform Lh(y [h]) using y [h]1 and (d [h]

2,1, . . . , d[h]nh,1)

L∗h(y [h]) = Kh(y [h]1 )

nh∏i=2

Gh(y [h]i − y [h]

1 | y[h]1 )

= Kh(y [h]

1 | d[h]2,1, . . . , d

[h]nh,1

)Gh(d [h]

2,1, . . . , d[h]nh,1

)Discard y [h]

1 (requiring heavy assumptions) by integrating it out.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 11 / 31

Page 27: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clusteringdi ,i ′ are much easier to model than yi !How do we obtain a ‘coherent’ likelihood for di ,i ′?Imagine an unknown oracle likelihood, conditioned on latentlabels c(n)

L∗(y(n); c(n)) =k∏

h=1

∏i :ci =h

Kh(yi) =k∏

h=1Lh(y [h])

Transform Lh(y [h]) using y [h]1 and (d [h]

2,1, . . . , d[h]nh,1)

L∗h(y [h]) = Kh(y [h]1 )

nh∏i=2

Gh(y [h]i − y [h]

1 | y[h]1 )

= Kh(y [h]

1 | d[h]2,1, . . . , d

[h]nh,1

)Gh(d [h]

2,1, . . . , d[h]nh,1

)Discard y [h]

1 (requiring heavy assumptions) by integrating it out.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 11 / 31

Page 28: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clusteringdi ,i ′ are much easier to model than yi !How do we obtain a ‘coherent’ likelihood for di ,i ′?Imagine an unknown oracle likelihood, conditioned on latentlabels c(n)

L∗(y(n); c(n)) =k∏

h=1

∏i :ci =h

Kh(yi) =k∏

h=1Lh(y [h])

Transform Lh(y [h]) using y [h]1 and (d [h]

2,1, . . . , d[h]nh,1)

L∗h(y [h]) = Kh(y [h]1 )

nh∏i=2

Gh(y [h]i − y [h]

1 | y[h]1 )

= Kh(y [h]

1 | d[h]2,1, . . . , d

[h]nh,1

)Gh(d [h]

2,1, . . . , d[h]nh,1

)

Discard y [h]1 (requiring heavy assumptions) by integrating it out.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 11 / 31

Page 29: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clusteringdi ,i ′ are much easier to model than yi !How do we obtain a ‘coherent’ likelihood for di ,i ′?Imagine an unknown oracle likelihood, conditioned on latentlabels c(n)

L∗(y(n); c(n)) =k∏

h=1

∏i :ci =h

Kh(yi) =k∏

h=1Lh(y [h])

Transform Lh(y [h]) using y [h]1 and (d [h]

2,1, . . . , d[h]nh,1)

L∗h(y [h]) = Kh(y [h]1 )

nh∏i=2

Gh(y [h]i − y [h]

1 | y[h]1 )

= Kh(y [h]

1 | d[h]2,1, . . . , d

[h]nh,1

)Gh(d [h]

2,1, . . . , d[h]nh,1

)Discard y [h]

1 (requiring heavy assumptions) by integrating it out.David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 11 / 31

Page 30: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWe get a Distance Likelihood

L(y(n); c(n)) =k∏

h=1Gh(d [h]

2,1, . . . , d[h]nh,1

).

Since we don’t know the oracle, we need to specify Gh.This is much easier to do, since we only need to worry aboutcovariance and tails!Note d [h]

i ,i ′ = d [h]i ,1 − d [h]

i ′,1, we propose an over-complete form asthe simple product:

Gh(d [h]

2,1, . . . , d[h]nh,1

)=

nh∏i=2

gαhh (d [h]

i ,1 ){ nh∏

i=3

∏1<i ′<i

gαhh (d [h]

i ,i ′)},

=nh∏

i=2

∏i ′<i

gαhh (d [h]

i ,i ′),

gh: marginal density; αh: calibration parameter (to be set later).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 12 / 31

Page 31: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWe get a Distance Likelihood

L(y(n); c(n)) =k∏

h=1Gh(d [h]

2,1, . . . , d[h]nh,1

).

Since we don’t know the oracle, we need to specify Gh.

This is much easier to do, since we only need to worry aboutcovariance and tails!Note d [h]

i ,i ′ = d [h]i ,1 − d [h]

i ′,1, we propose an over-complete form asthe simple product:

Gh(d [h]

2,1, . . . , d[h]nh,1

)=

nh∏i=2

gαhh (d [h]

i ,1 ){ nh∏

i=3

∏1<i ′<i

gαhh (d [h]

i ,i ′)},

=nh∏

i=2

∏i ′<i

gαhh (d [h]

i ,i ′),

gh: marginal density; αh: calibration parameter (to be set later).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 12 / 31

Page 32: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWe get a Distance Likelihood

L(y(n); c(n)) =k∏

h=1Gh(d [h]

2,1, . . . , d[h]nh,1

).

Since we don’t know the oracle, we need to specify Gh.This is much easier to do, since we only need to worry aboutcovariance and tails!

Note d [h]i ,i ′ = d [h]

i ,1 − d [h]i ′,1, we propose an over-complete form as

the simple product:

Gh(d [h]

2,1, . . . , d[h]nh,1

)=

nh∏i=2

gαhh (d [h]

i ,1 ){ nh∏

i=3

∏1<i ′<i

gαhh (d [h]

i ,i ′)},

=nh∏

i=2

∏i ′<i

gαhh (d [h]

i ,i ′),

gh: marginal density; αh: calibration parameter (to be set later).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 12 / 31

Page 33: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringWe get a Distance Likelihood

L(y(n); c(n)) =k∏

h=1Gh(d [h]

2,1, . . . , d[h]nh,1

).

Since we don’t know the oracle, we need to specify Gh.This is much easier to do, since we only need to worry aboutcovariance and tails!Note d [h]

i ,i ′ = d [h]i ,1 − d [h]

i ′,1, we propose an over-complete form asthe simple product:

Gh(d [h]

2,1, . . . , d[h]nh,1

)=

nh∏i=2

gαhh (d [h]

i ,1 ){ nh∏

i=3

∏1<i ′<i

gαhh (d [h]

i ,i ′)},

=nh∏

i=2

∏i ′<i

gαhh (d [h]

i ,i ′),

gh: marginal density; αh: calibration parameter (to be set later).David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 12 / 31

Page 34: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance ClusteringDue to linear constraint d [h]

i ,i ′ = d [h]i ,1 − d [h]

i ′,1, covariances among d [h]i ,i ′ ’s

are automatically induced:

Lemmavar(d [h]

i ,i ′) = 2Σ, cov(d [h]i ,i ′ , d

[h]i ,i ′′) = Σ for i ′ 6= i ′′, i 6= i ′, i 6= i ′′.

0.0

0.2

0.4

0.6

0.8

1.0

−1.5 −0.5 0.5 1.5

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

d21

d 31

d21

−1

0

1

d 31

−1

0

1

Density

Likelihood for 3 distances Gh(D[h]) = exp(−|d21|) exp(−|d31|) exp(−|d32|).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 13 / 31

Page 35: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

Product form makes it easy to satisfy coherence conditions:

Lemma(Exchangeability)

L(y(n); c(n)) = L(y(n∗); c(n∗))

with (n∗) = {1∗, . . . , n∗} denoting a set of permuted indices.

Lemma(Marginalization)

L(y(n)) =∫Y

L(y(n+1))dyn+1.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 14 / 31

Page 36: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

Choosing marginal gh(d [h]i ,i ′) is straightforward, mostly based on

tail assumption and how to handle the covariance of yi .

How d [h]i ,i ′ enters gh also determines what type of distance is used.

For example, a multivariate Gaussian density

gh(d [h]i ,i ′) ∝ exp

(− 1

2σ−1h d [h]

i ,i ′TS−1d [h]

i ,i ′

)

corresponds to Mahalanobis distance∆(y [h]

i , y [h]i ′ ) = (y [h]

i − y [h]i ′ )TS−1(y [h]

i − y [h]i ′ ).

For simplicity, we will assume diagonal covariance for yi andfocus on guarantee on tail robustness.Computation is simple via Gibbs sampling.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 15 / 31

Page 37: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

Choosing marginal gh(d [h]i ,i ′) is straightforward, mostly based on

tail assumption and how to handle the covariance of yi .How d [h]

i ,i ′ enters gh also determines what type of distance is used.

For example, a multivariate Gaussian density

gh(d [h]i ,i ′) ∝ exp

(− 1

2σ−1h d [h]

i ,i ′TS−1d [h]

i ,i ′

)

corresponds to Mahalanobis distance∆(y [h]

i , y [h]i ′ ) = (y [h]

i − y [h]i ′ )TS−1(y [h]

i − y [h]i ′ ).

For simplicity, we will assume diagonal covariance for yi andfocus on guarantee on tail robustness.Computation is simple via Gibbs sampling.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 15 / 31

Page 38: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

Choosing marginal gh(d [h]i ,i ′) is straightforward, mostly based on

tail assumption and how to handle the covariance of yi .How d [h]

i ,i ′ enters gh also determines what type of distance is used.For example, a multivariate Gaussian density

gh(d [h]i ,i ′) ∝ exp

(− 1

2σ−1h d [h]

i ,i ′TS−1d [h]

i ,i ′

)

corresponds to Mahalanobis distance∆(y [h]

i , y [h]i ′ ) = (y [h]

i − y [h]i ′ )TS−1(y [h]

i − y [h]i ′ ).

For simplicity, we will assume diagonal covariance for yi andfocus on guarantee on tail robustness.Computation is simple via Gibbs sampling.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 15 / 31

Page 39: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

Choosing marginal gh(d [h]i ,i ′) is straightforward, mostly based on

tail assumption and how to handle the covariance of yi .How d [h]

i ,i ′ enters gh also determines what type of distance is used.For example, a multivariate Gaussian density

gh(d [h]i ,i ′) ∝ exp

(− 1

2σ−1h d [h]

i ,i ′TS−1d [h]

i ,i ′

)

corresponds to Mahalanobis distance∆(y [h]

i , y [h]i ′ ) = (y [h]

i − y [h]i ′ )TS−1(y [h]

i − y [h]i ′ ).

For simplicity, we will assume diagonal covariance for yi andfocus on guarantee on tail robustness.

Computation is simple via Gibbs sampling.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 15 / 31

Page 40: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Bayesian Distance Clustering

Choosing marginal gh(d [h]i ,i ′) is straightforward, mostly based on

tail assumption and how to handle the covariance of yi .How d [h]

i ,i ′ enters gh also determines what type of distance is used.For example, a multivariate Gaussian density

gh(d [h]i ,i ′) ∝ exp

(− 1

2σ−1h d [h]

i ,i ′TS−1d [h]

i ,i ′

)

corresponds to Mahalanobis distance∆(y [h]

i , y [h]i ′ ) = (y [h]

i − y [h]i ′ )TS−1(y [h]

i − y [h]i ′ ).

For simplicity, we will assume diagonal covariance for yi andfocus on guarantee on tail robustness.Computation is simple via Gibbs sampling.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 15 / 31

Page 41: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Theory: A Deeper Dive

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 16 / 31

Page 42: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Tail bound on pairwise distancesChernoff bound on distancesLemma(Concentration inequality) If all y [h]

i ’s are sub-exponential with boundparameters (νh, bh), then ‖d [h]

i ,i ′‖∞ = maxpj=1 |d

[h]i ,i ′,j | has

pr(‖d [h]i ,i ′‖∞ > t) ≤ 2p exp{−t/(2bh)} for t > 2ν2

h/bh.

For p not too large, d [h]i ,i ′ can be well approximated by a Laplace.

p-dimensional Laplace with diagonal covariance is equivalent tousing weighted-`1 distances

gh(d [h]i ,i ′) = (2−p

p∏j=1

σ−1h,j ) exp

(−

p∑j=1|d [h]

i ,i ′,j |/σh,j).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 17 / 31

Page 43: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Tail bound on pairwise distancesChernoff bound on distancesLemma(Concentration inequality) If all y [h]

i ’s are sub-exponential with boundparameters (νh, bh), then ‖d [h]

i ,i ′‖∞ = maxpj=1 |d

[h]i ,i ′,j | has

pr(‖d [h]i ,i ′‖∞ > t) ≤ 2p exp{−t/(2bh)} for t > 2ν2

h/bh.

For p not too large, d [h]i ,i ′ can be well approximated by a Laplace.

p-dimensional Laplace with diagonal covariance is equivalent tousing weighted-`1 distances

gh(d [h]i ,i ′) = (2−p

p∏j=1

σ−1h,j ) exp

(−

p∑j=1|d [h]

i ,i ′,j |/σh,j).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 17 / 31

Page 44: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on ClusteringOne natural question: what we lose when discarding information inK(y [h]

1 | .)?A relevant information concept: Bregman divergence (Bregman1967) between two random variables x and y

Bφ(x , y) = φ(x)− φ(y)− (x − y)′Oφ(y)φ : dom φ→ R a strictly convex and differentiable function;Oφ(y) the gradient of φ at y .

Generality: For any regular exponential Kh, there always exists aBregman re-parameterization [Banerjee et al (2005)]

Kh(yi ; θh) = exp {T (yi)′θh − ψ(θh)}κ(yi)⇔ exp [−Bφ {T (yi), µh}] bφ{T (yi)},

T : minimal sufficient statistics for θh; µh = Eyi∼KhT (yi).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 18 / 31

Page 45: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on ClusteringOne natural question: what we lose when discarding information inK(y [h]

1 | .)?A relevant information concept: Bregman divergence (Bregman1967) between two random variables x and y

Bφ(x , y) = φ(x)− φ(y)− (x − y)′Oφ(y)φ : dom φ→ R a strictly convex and differentiable function;Oφ(y) the gradient of φ at y .Generality: For any regular exponential Kh, there always exists aBregman re-parameterization [Banerjee et al (2005)]

Kh(yi ; θh) = exp {T (yi)′θh − ψ(θh)}κ(yi)⇔ exp [−Bφ {T (yi), µh}] bφ{T (yi)},

T : minimal sufficient statistics for θh; µh = Eyi∼KhT (yi).David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 18 / 31

Page 46: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on Clustering

For model-based clustering, maximizing mixture likelihood ⇔minimizing total Bregman divergence wrt the mean of T (yi):

Hy =k∑

h=1H [h]

y , H [h]y =

nh∑i=1

Bφ {T (yi), µh} .

For distance clustering, now view each distance as some form ofpairwise Bregman divergence

Hd =k∑

h=1H [h]

d , H [h]d = αhλ

−1h

nh∑i=1

nh∑i ′=1

{T (y [h]

i ),T (y [h]i ′ )}.

How do they compare?

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 19 / 31

Page 47: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on Clustering

For model-based clustering, maximizing mixture likelihood ⇔minimizing total Bregman divergence wrt the mean of T (yi):

Hy =k∑

h=1H [h]

y , H [h]y =

nh∑i=1

Bφ {T (yi), µh} .

For distance clustering, now view each distance as some form ofpairwise Bregman divergence

Hd =k∑

h=1H [h]

d , H [h]d = αhλ

−1h

nh∑i=1

nh∑i ′=1

{T (y [h]

i ),T (y [h]i ′ )}.

How do they compare?

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 19 / 31

Page 48: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on Clustering

For model-based clustering, maximizing mixture likelihood ⇔minimizing total Bregman divergence wrt the mean of T (yi):

Hy =k∑

h=1H [h]

y , H [h]y =

nh∑i=1

Bφ {T (yi), µh} .

For distance clustering, now view each distance as some form ofpairwise Bregman divergence

Hd =k∑

h=1H [h]

d , H [h]d = αhλ

−1h

nh∑i=1

nh∑i ′=1

{T (y [h]

i ),T (y [h]i ′ )}.

How do they compare?

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 19 / 31

Page 49: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on ClusteringLemma(Expected Bregman Divergences) The model-based and distance-basedBregman divergences have this relationship:

Ey [h]H [h]d = (2nhαhλ

−1h ) Ey [h]

[ nh∑i=1

H [h]y + Bφ{µh,T (y [h]

i )}2

],

where the expectation over y [h] is taken with respect to Kh.

RHS after E: symmetrized Bregman divergence b/w T (y [h]i ) and µh.

We can make two expected divergences equal, by settingαh = 1/nh and learn λh adaptively from its posterior.

NO Bregman information is lost for clustering!Although we implicitly assume T and φ are chosen in the sameway in both H [h]

d and H [h]y , this is easy to achieve / approximate

via sensible choice of distance density.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 20 / 31

Page 50: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on ClusteringLemma(Expected Bregman Divergences) The model-based and distance-basedBregman divergences have this relationship:

Ey [h]H [h]d = (2nhαhλ

−1h ) Ey [h]

[ nh∑i=1

H [h]y + Bφ{µh,T (y [h]

i )}2

],

where the expectation over y [h] is taken with respect to Kh.

RHS after E: symmetrized Bregman divergence b/w T (y [h]i ) and µh.

We can make two expected divergences equal, by settingαh = 1/nh and learn λh adaptively from its posterior.NO Bregman information is lost for clustering!

Although we implicitly assume T and φ are chosen in the sameway in both H [h]

d and H [h]y , this is easy to achieve / approximate

via sensible choice of distance density.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 20 / 31

Page 51: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on ClusteringLemma(Expected Bregman Divergences) The model-based and distance-basedBregman divergences have this relationship:

Ey [h]H [h]d = (2nhαhλ

−1h ) Ey [h]

[ nh∑i=1

H [h]y + Bφ{µh,T (y [h]

i )}2

],

where the expectation over y [h] is taken with respect to Kh.

RHS after E: symmetrized Bregman divergence b/w T (y [h]i ) and µh.

We can make two expected divergences equal, by settingαh = 1/nh and learn λh adaptively from its posterior.NO Bregman information is lost for clustering!Although we implicitly assume T and φ are chosen in the sameway in both H [h]

d and H [h]y , this is easy to achieve / approximate

via sensible choice of distance density.David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 20 / 31

Page 52: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on Clustering

A toy example, for h = 1, . . . , k ,

y [h]i ∼ No(µh, σ

2h)

with y [h]i ∈ R.

Model-based H [h]y = ∑

i(y[h]i − µh)2/σ2

h.

Distance-based H [h]d = (2nhαhλ

−1h )∑nh

i=1∑nh

i ′=1(d [h]i ,i ′)2.

When αh = 1/nh and λh = 2σ2h,

EyH [h]y = EyH [h]

d = n − 1

λh = 2σ2h can be learned from the posterior because

var(d [h]i ,i ′) = 2var(y [h]

i ).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 21 / 31

Page 53: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on Clustering

A toy example, for h = 1, . . . , k ,

y [h]i ∼ No(µh, σ

2h)

with y [h]i ∈ R.

Model-based H [h]y = ∑

i(y[h]i − µh)2/σ2

h.Distance-based H [h]

d = (2nhαhλ−1h )∑nh

i=1∑nh

i ′=1(d [h]i ,i ′)2.

When αh = 1/nh and λh = 2σ2h,

EyH [h]y = EyH [h]

d = n − 1

λh = 2σ2h can be learned from the posterior because

var(d [h]i ,i ′) = 2var(y [h]

i ).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 21 / 31

Page 54: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on Clustering

A toy example, for h = 1, . . . , k ,

y [h]i ∼ No(µh, σ

2h)

with y [h]i ∈ R.

Model-based H [h]y = ∑

i(y[h]i − µh)2/σ2

h.Distance-based H [h]

d = (2nhαhλ−1h )∑nh

i=1∑nh

i ′=1(d [h]i ,i ′)2.

When αh = 1/nh and λh = 2σ2h,

EyH [h]y = EyH [h]

d = n − 1

λh = 2σ2h can be learned from the posterior because

var(d [h]i ,i ′) = 2var(y [h]

i ).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 21 / 31

Page 55: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Information Theory on Clustering

A toy example, for h = 1, . . . , k ,

y [h]i ∼ No(µh, σ

2h)

with y [h]i ∈ R.

Model-based H [h]y = ∑

i(y[h]i − µh)2/σ2

h.Distance-based H [h]

d = (2nhαhλ−1h )∑nh

i=1∑nh

i ′=1(d [h]i ,i ′)2.

When αh = 1/nh and λh = 2σ2h,

EyH [h]y = EyH [h]

d = n − 1

λh = 2σ2h can be learned from the posterior because

var(d [h]i ,i ′) = 2var(y [h]

i ).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 21 / 31

Page 56: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Data Application

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 22 / 31

Page 57: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Sim1: Robust to skewness

0.0

0.1

0.2

0.3

0.4

0 1 2 3 4 5

Value

de

nsity

True density (red) of a mixtureof two right skewed Gaussians.

BD

CG

MM

0 1 2 3 4 5

0.000.250.500.751.00

0.000.250.500.751.00

Value

Pro

b1

Posterior assignment pr(ci = 1).Dashed: oracle probability. Red:Bayes dist. clustering. Green:Gaussian mixture (GMM)

Interestingly, at small n = 200 and increasing p, Bayesian DistanceClustering performs even better (in adjusted Rand index) than mixture ofskewed Gaussians (the ‘true’ kernel), likely due to fewer parameters.

p Bayes Dist. Clustering Mix. of Gaussians Mix. of Skewed Gaussians5 0.76 (0.71, 0.81) 0.55 (0.40, 0.61) 0.76 (0.72, 0.80)10 0.72 (0.68, 0.76) 0.33(0.25, 0.46) 0.62 (0.53, 0.71)30 0.71 (0.67, 0.76) 0.25 (0.20, 0.30) 0.43 (0.37, 0.50)

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 23 / 31

Page 58: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Sim2: Easy to handle constrained data.

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

y1

y2

(a) Data on unitcircle colored by truecluster labels.

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

y1y2

(b) Point clusteringestimates from BDC.

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

y1

y2

(c) Point clusteringestimates from amixture of Gaussianmodel.

Figure: Clustering data from two-component mixture of von-Mises Fisherwith µ1 = (1, 0) and µ2 = (1/

√2, 1/√

2).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 24 / 31

Page 59: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Neuroscience application: Clustering brain voxels using geneexpression.

Goal: Estimate functional partitions by segmenting the mousebrain according to the gene expression levels, and compare withstructural partitions as defined in brain anatomy.

Data: Gene transcriptome over mid-coronal section of 41× 58voxels, excluding the empty ones, sample size n = 1781. Eachvoxel has expression levels for 3241 genes. Data are obtainedfrom Allen Mouse Brain Atlas (Lein et al 2007).To avoid curse-of-dimensionality on distances: we followprevious visualization application on the same dataset (Mahfouzet al 2014), and extract the first p = 30 principal components asthe source data yi .

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 25 / 31

Page 60: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Neuroscience application: Clustering brain voxels using geneexpression.

Goal: Estimate functional partitions by segmenting the mousebrain according to the gene expression levels, and compare withstructural partitions as defined in brain anatomy.Data: Gene transcriptome over mid-coronal section of 41× 58voxels, excluding the empty ones, sample size n = 1781. Eachvoxel has expression levels for 3241 genes. Data are obtainedfrom Allen Mouse Brain Atlas (Lein et al 2007).

To avoid curse-of-dimensionality on distances: we followprevious visualization application on the same dataset (Mahfouzet al 2014), and extract the first p = 30 principal components asthe source data yi .

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 25 / 31

Page 61: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Neuroscience application: Clustering brain voxels using geneexpression.

Goal: Estimate functional partitions by segmenting the mousebrain according to the gene expression levels, and compare withstructural partitions as defined in brain anatomy.Data: Gene transcriptome over mid-coronal section of 41× 58voxels, excluding the empty ones, sample size n = 1781. Eachvoxel has expression levels for 3241 genes. Data are obtainedfrom Allen Mouse Brain Atlas (Lein et al 2007).To avoid curse-of-dimensionality on distances: we followprevious visualization application on the same dataset (Mahfouzet al 2014), and extract the first p = 30 principal components asthe source data yi .

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 25 / 31

Page 62: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Neuroscience application: Clustering brain voxels using geneexpression.We fit an overfitted model with k = 20 and Dirichlet-(1/20) on thecomponent weights. Comparing point estimates:

0

10

20

30

40

0 20 40 60

X1

X2

(a) StructuralPartitions (knownanatomical labels).

0

10

20

30

40

0 20 40 60

X1

X2

(b) FunctionalPartitions (clusteredby BDC).

0

10

20

30

40

0 20 40 60

X1

X2

(c) FunctionalPartitions (clusteredby Gaussian MixtureModel).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 26 / 31

Page 63: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Neuroscience application: Clustering brain voxels using geneexpression.

0

10

20

30

40

0 20 40 60

X1

X2

(a) Functional Partitions:point estimates {ci} by BDC

0

10

20

30

40

0 20 40 60

X1

X2

0.0

0.1

0.2

0.3

0.4Uncertainty

(b) Uncertainty: pr(ci 6= ci )

Most uncertainty is in the inner layers of the cortical plate (upperparts of the brain).

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 27 / 31

Page 64: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Comparing point estimates against others:

Table: Comparison of label point estimates using Bayesian distanceclustering (BDC), Gaussian mixture model (GMM), spectral clustering(SC), DBSCAN and Mixture of Factor Analyzers (MFA). The similaritymeasure is computed with respect to the anatomical structure labels .

BDC GMM SC DBSCAN MFAAdjusted Rand Index 0.49 0.31 0.45 0.43 0.43Normalized Mutual Information 0.51 0.42 0.46 0.44 0.47Adjusted Mutual Information 0.51 0.42 0.47 0.45 0.47

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 28 / 31

Page 65: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Discussion

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 29 / 31

Page 66: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Discussion

More general distances / divergences can be included: e.g.geodesic, transport distances, etc.So far we have focused on moderate n, as the number ofdistances increases in O(n2). Interesting to develop new scalablesolution.For high dimension data, “Dimension reduction + Clustering”can cause loss of information. Useful to develop new distancesthat preserve discriminability.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 30 / 31

Page 67: David Dunson (with Leo Duan) - CIRM · David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 7/31. Survey on Clustering Methods Apparently, we want some new tool that is:

Primary References

Duan, L. L., & Dunson, D. B. (2018) Bayesian Distance Clustering. arXiv preprintarXiv:1810.08537.Miller, J. W. & Dunson, D. B. (2018). Robust Bayesian inference via coarsening. Journalof the American Statistical Association, Online.Lein, E. S., Hawrylycz, M. J., AO, N., Ayres, M., Bensinger, A., Bernard, A., BOE, A.F., Boguski, M. S., Brockway, K. S. & Byrnes, E. J. (2007). Genome-wide atlas of geneexpression in the adult mouse brain. Nature 445, 168.Hodges, J. & Lehmann, E. (1954). Matching in paired comparisons. The Annals ofMathematical Statistics 25, 787-791.Coretto, P. & Hennig, C. (2016). Robust improper maximum likelihood: tuning,computation, and a comparison with other methods for robust Gaussian clustering.Journal of the American Statistical Association 111, 1648- 1659.Banerjee, A., Merugu, S., Dhillon, I. S. & Ghosh, J. (2005). Clustering with Bregmandivergences. Journal of Machine Learning Research 6, 1705-1749.Bregman, L. M. (1967). The relaxation method of finding the common point of convexsets and its application to the solution of problems in convex programming. USSRComputational Mathematics and Mathematical Physics 7, 200-217.

David Dunson (with Leo Duan) (Duke) Bayes Distance Clustering 31 / 31