david m. blei...• dendrogram: plot each merge at the (negative) similarity between the two merged...

Hierarchical clustering

David M. Blei

COS424Princeton University

February 28, 2008

D. Blei Clustering 02 1 / 21

• Hierarchical clustering is a widely used data analysis tool.

• The idea is to build a binary tree of the data that successivelymerges similar groups of points

• Visualizing this tree provides a useful summary of the data

Hierarchical clusering vs. k-means

• Recall that k-means or k-medoids requires

• A number of clusters k• An initial assignment of data to clusters• A distance measure between data d(xn, xm)

• Hierarchical clustering only requires a measure of similarity betweengroups of data points.

• A number of clusters k

• An initial assignment of data to clusters• A distance measure between data d(xn, xm)

• A number of clusters k• An initial assignment of data to clusters

• A distance measure between data d(xn, xm)

Agglomerative clustering

• We will talk about agglomerative clustering.

• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster

• Algorithm:

1 Place each data point into its own singleton group

2 Repeat: iteratively merge the two closest groups3 Until: all the data are merged into a single cluster

• Algorithm:

1 Place each data point into its own singleton group2 Repeat: iteratively merge the two closest groups

3 Until: all the data are merged into a single cluster

• Algorithm:

Example

● ●

0 20 40 60 80

Example

● ●

0 20 40 60 80

iteration 001

Example

● ●

0 20 40 60 80

iteration 002

Example

● ●

0 20 40 60 80

iteration 003

Example

● ●

0 20 40 60 80

iteration 004

Example

● ●

0 20 40 60 80

iteration 005

Example

● ●

0 20 40 60 80

iteration 006

Example

● ●

0 20 40 60 80

iteration 007

Example

● ●

0 20 40 60 80

iteration 008

Example

● ●

0 20 40 60 80

iteration 009

Example

● ●

0 20 40 60 80

iteration 010

Example

● ●

0 20 40 60 80

iteration 011

Example

● ●

0 20 40 60 80

iteration 012

Example

● ●

0 20 40 60 80

iteration 013

Example

● ●

0 20 40 60 80

iteration 014

Example

● ●

0 20 40 60 80

iteration 015

Example

● ●

0 20 40 60 80

iteration 016

Example

● ●

0 20 40 60 80

iteration 017

Example

● ●

0 20 40 60 80

iteration 018

Example

● ●

0 20 40 60 80

iteration 019

Example

● ●

0 20 40 60 80

iteration 020

Example

● ●

0 20 40 60 80

iteration 021

Example

● ●

0 20 40 60 80

iteration 022

Example

● ●

0 20 40 60 80

iteration 023

Example

● ●

0 20 40 60 80

iteration 024

• Each level of the resulting tree is a segmentation of the data

• The algorithm results in a sequence of groupings

• It is up to the user to choose a ”natural” clustering from thissequence

Dendrogram

• Agglomerative clustering is monotonic

• The similarity between merged clusters is monotone decreasingwith the level of the merge.

• Dendrogram: Plot each merge at the (negative) similarity betweenthe two merged groups

• Provides an interpretable visualization of the algorithm and data

• Useful summarization tool, part of why hierarchical clustering ispopular

Dendrogram

Dendrogram of example data

2959 27

16 244

81 184 25

Cluster Dendrogram

hclust (*, "complete")dist(x)

Groups that merge at high values relative to the merger values of theirsubgroups are candidates for natural clusters. (Tibshirani et al., 2001)

Group similarity

• Given a distance measure between points, the user has many choicesfor how to define intergroup similarity.

• Three most popular choices

• Single-linkage: the similarity of the closest pair

dSL(G ,H) = mini∈G ,j∈H

• Complete linkage: the similarity of the furthest pair

dCL(G ,H) = maxi∈G ,j∈H

• Group average: the average similarity between groups

dGA =1

∑i∈G

∑j∈H

Group similarity

dGA =1

∑i∈G

∑j∈H

Group similarity

dGA =1

∑i∈G

∑j∈H

Group similarity

dGA =1

∑i∈G

∑j∈H

Group similarity

dGA =1

∑i∈G

∑j∈H

Properties of intergroup similarity

• Single linkage can produce “chaining,” where a sequence of closeobservations in different groups cause early merges of those groups

• Complete linkage has the opposite problem. It might not mergeclose groups because of outlier members that are far apart.

• Group average represents a natural compromise, but depends on thescale of the similarities. Applying a monotone transformation to thesimilarities can change the results.

Caveats

• Hierarchical clustering should be treated with caution.

• Different decisions about group similarities can lead to vastlydifferent dendrograms.

• The algorithm imposes a hierarchical structure on the data, evendata for which such structure is not appropriate.

Caveats

Examples

• “Repeated Observation of Breast Tumor Subtypes in IndependentGene Expression Data Sets” (Sorlie et al., 2003)

• Hierarchical clustering of gene expression data lead to new theories

• Later, theories tested in the lab.

Examples

• “The Balance of Roger de Piles” (Studdert-Kennedy and Davenport,1974)

• Roger de Piles rated 57 paintings along different dimensions.

• These authors cluster them using different methods, includinghierarchical clustering

• They discuss the different clusters. (They are art critics.)

Examples

Good: They are cautious. “The value of this analysis...will depend onany interesting speculation it may provoke.”

Examples

• “Similarity Grouping of Australian Universities” (Stanley andReynlds, 1994)

• Use hierarchical clustering on Austrailian universities

• Use features such as

• # of staff in different departments• entry scores• funding• evaluations

Examples

• # of staff in different departments

• entry scores• funding• evaluations

Examples

• # of staff in different departments• entry scores

• funding• evaluations

Examples

• # of staff in different departments• entry scores• funding

• evaluations

Examples

One dendrogram

Another dendrogram—notice that it’s different

• Split values: They notice that there’s no kink and conclude thatthere is no cluster structure in Austrailian universities.

• Good: Cautious interpretation of clustering, analysis of clusteringbased on multiple subsets of the features.

• Bad: Their conclusions—we can’t cluster Australianuniversities—ignores all the algorithmic choices that were made.

Examples

• “Comovement of International Equity Markets: A TaxonomicApproach” (Panton et al., 1976)

• Data: weekly rates of return for stocks in twelve countries

• Run agglometerative clustering year by year

• Interpret the structure and examine stability over different timeperiods

Examples

Good: Cautious. “This study is only descriptive...A logical subsequentresearch area is to explain observed structural properties and the causesof structural change.”

david m. blei...• dendrogram: plot each merge at the (negative) similarity between the two merged...

Documents

blei lafferty2009

an interpretable graph-based image...

stairstep-like dendrogram cut: a permutation test...

blei ngjordan2003

interpretable clustering: an optimization approach

interpretable and globally optimal prediction for textual...

interpretable machine learning for insurance

linear regression, logistic regression, and generalized...

histogram pooling operators an interpretable …

deco bly uk ltd deco blei p. gmbh

the haar wavelet transform of a dendrogram: additional...

interpretable nonnegative matrix decompositions

interpretable machine learning

variational inferencejbg/teaching/cmsc_726/17a.pdf ·...

gersh man blei 2012

interpretable neural architecture search via …

david m. blei - computer science department at princeton...

blei weiss

interpretable neural predictions with differentiable

k-center and dendrogram...