data mining taylor statistics 202: data...

42
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Hierarchical clustering Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor December 2, 2012 1/1

Upload: others

Post on 22-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningHierarchical clustering

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

December 2, 2012

1 / 1

Page 2: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Description

Produces a set of nested clusters organized as ahierarchical tree.

Can be visualized as a dendrogram: A tree like diagramthat records the sequences of merges or splits.

2 / 1

Page 3: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Hierarchical Clustering

 Produces a set of nested clusters organized as a hierarchical tree

 Can be visualized as a dendrogram –  A tree like diagram that records the sequences of

merges or splits

A clustering and its dendrogram.

3 / 1

Page 4: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Strengths

Do not have to assume any particular number of clusters.Each horizontal cut of the tree yields a clustering.

The tree may correspond to a meaningful taxonomy: (e.g.,animal kingdom, phylogeny reconstruction, ...)

Need only a similarity or distance matrix forimplementation.

4 / 1

Page 5: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Agglomerative

Start with the points as individual clusters.

At each step, merge the closest pair of clusters until onlyone cluster (or some fixed number k clusters) remain.

5 / 1

Page 6: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Divisive

Start with one, all-inclusive cluster.

At each step, split a cluster until each cluster contains apoint (or there are k clusters).

6 / 1

Page 7: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Agglomerative Clustering Algorithm

1 Compute the proximity matrix.

2 Let each data point be a cluster.3 While there is more than one cluster:

1 Merge the two closest clusters.2 Update the proximity matrix.

The major difference is the computation of proximity of twoclusters.

7 / 1

Page 8: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Starting point for agglomerative clustering.

8 / 1

Page 9: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Intermediate Situation

  After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate point, with 5 clusters.

9 / 1

Page 10: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

We will merge C2 and C5.

10 / 1

Page 11: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

After Merging

  The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How do we update proximity matrix?

11 / 1

Page 12: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Proximity Matrix

We need a notion of similarity between clusters.

12 / 1

Page 13: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Single linkage uses the minimum distance.

13 / 1

Page 14: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Complete linkage uses the maximum distance.

14 / 1

Page 15: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Group average linkage uses the average distance betweengroups.

15 / 1

Page 16: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

× ×

Centroid uses the distance between the centroids of the clusters(presumes one can compute centroids...)

16 / 1

Page 17: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Cluster Similarity: MIN or Single Link

 Similarity of two clusters is based on the two most similar (closest) points in the different clusters –  Determined by one pair of points, i.e., by one link in

the proximity graph.

1 2 3 4 5

Proximity matrix and dendrogram of single linkage.

17 / 1

Page 18: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Distance matrix for nested clusterings

111 0.24 0.22 0.37 0.34 0.23222 0.15 0.20 0.14 0.25

333 0.15 0.28 0.11444 0.29 0.22

555 0.39666

18 / 1

Page 19: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

Nested cluster representation and dendrogram of single linkage.

19 / 1

Page 20: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Single linkage

Can handle irregularly shaped regions fairly naturally.

Sensitive to noise and outliers in the form of “chaining”.

20 / 1

Page 21: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (single linkage)

21 / 1

Page 22: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (single linkage)

22 / 1

Page 23: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Cluster Similarity: MAX or Complete Linkage

 Similarity of two clusters is based on the two least similar (most distant) points in the different clusters –  Determined by all pairs of points in the two clusters

1 2 3 4 5

Proximity matrix and dendrogram of complete linkage.

23 / 1

Page 24: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Hierarchical Clustering: MAX

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2 5

3

4

Nested cluster and dendrogram of complete linkage.

24 / 1

Page 25: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Complete linkage

Less sensitive to noise and outliers than single linkage.

Regions are generally compact, but may violate“closeness”. That is, points may much closer to somepoints in neighbouring cluster than its own cluster.

This manifests itself as breaking large clusters.

Clusters are biased to be globular.

25 / 1

Page 26: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (complete linkage)

26 / 1

Page 27: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (complete linkage)

27 / 1

Page 28: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (complete linkage)

28 / 1

Page 29: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

Hierarchical Clustering: Group Average

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

5

3

4

Nested cluster and dendrogram of group average linkage.

29 / 1

Page 30: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Average linkage

Given two elements of the partition Cr ,Cs , we mightconsider

dGA(Cr ,Cs) =1

|Cr ||Cs |∑

xxx∈Cr ,yyy∈Cs

d(xxx ,yyy)

A compromise between single and complete linkage.

Shares globular clusters of complete, less sensitive thansingle.

30 / 1

Page 31: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (average linkage)

31 / 1

Page 32: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (average linkage)

32 / 1

Page 33: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Ward’s linkage

Similarity of two clusters is based on the increase insquared error when two clusters are merged.

Similar to average if dissimilarity between points isdistance squared. Hence, it shares many properties ofaverage linkage.

A hierarchical analogue of K -means.

Sometimes used to initialize K -means.

33 / 1

Page 34: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (Ward’s linkage)

34 / 1

Page 35: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

The Iris data (Ward’s linkage)

35 / 1

Page 36: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (complete linkage)

36 / 1

Page 37: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (single linkage)

37 / 1

Page 38: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (average linkage)

38 / 1

Page 39: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

NCI data (Ward’s linkage)

39 / 1

Page 40: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Computational issues

O(n2) space since it uses the proximity matrix.

O(n3) time in many cases as there are N steps, and ateach step a matrix of size N2 must be updated and/orsearched.

40 / 1

Page 41: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

Hierarchical clustering

Statistical issues

Once a decision is made to combine two clusters, it cannotbe undone.

No objective function is directly minimized.

Different schemes have problems with one or more of thefollowing:

Sensitivity to noise and outliers.Difficulty handling different sized clusters and convexshapes.Breaking large cluster.

41 / 1

Page 42: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/hierarchical.pdf · Taylor Statistics 202: Data Mining Hierarchical clustering

Statistics 202:Data Mining

c©JonathanTaylor

42 / 1