hierarchical clustering - statistical sciencercs46/lectures_2017/10-unsupervise/...proportion of...

63
Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63

Upload: others

Post on 27-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Hierarchical clustering

Rebecca C. Steorts, Duke University

STA 325, Chapter 10 ISL

1 / 63

Page 2: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Agenda

I K-means versus Hierarchical clusteringI Agglomerative vs divisive clusteringI Dendogram (tree)I Hierarchical clustering algorithmI Single, Complete, and Average linkageI Application to genomic (PCA versus Hierarchical clustering)

2 / 63

Page 3: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

From K-means to Hierarchical clusteringRecall two properties of K-means clustering:

1. It fits exactly K clusters (as specified)2. Final clustering assignment depends on the chosen initial

cluster centers

I Assume pairwise dissimilarites dij between data points.I Hierarchical clustering produces a consistent result, without the

need to choose initial starting positions (number of clusters).

Catch: choose a way to measure the dissimilarity between groups,called the linkage

I Given the linkage, hierarchical clustering produces a sequenceof clustering assignments.

I At one end, all points are in their own cluster, at the other end,all points are in one cluster

3 / 63

Page 4: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Agglomerative vs divisive clustering

Agglomerative (i.e., bottom-up):

I Start with all points in their own groupI Until there is only one cluster, repeatedly: merge the two

groups that have the smallest dissimilarity

Divisive (i.e., top-down):

I Start with all points in one clusterI Until all points are in their own cluster, repeatedly: split the

group into two resulting in the biggest dissimilarity

Agglomerative strategies are simpler, we’ll focus on them. Divisivemethods are still important, but you can read about these on yourown if you want to learn more.

4 / 63

Page 5: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Simple example

Given these data points, an agglomerative algorithm might decideon a clustering sequence as follows:

●●

●●

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.4

0.6

0.8

Dimension 1

Dim

ensi

on 2

1

23

4

56

7

Step 1: {1}, {2}, {3}, {4}, {5}, {6}, {7};Step 2: {1}, {2, 3}, {4}, {5}, {6}, {7};Step 3: {1, 7}, {2, 3}, {4}, {5}, {6};Step 4: {1, 7}, {2, 3}, {4, 5}, {6};Step 5: {1, 7}, {2, 3, 6}, {4, 5};Step 6: {1, 7}, {2, 3, 4, 5, 6};Step 7: {1, 2, 3, 4, 5, 6, 7}.

5 / 63

Page 6: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Algorithm

1. Place each data point into its own singleton group.2. Repeat: iteratively merge the two closest groups3. Until: all the data are merged into a single cluster

6 / 63

Page 7: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Example

7 / 63

Page 8: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 2

8 / 63

Page 9: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 3

9 / 63

Page 10: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 4

10 / 63

Page 11: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 5

11 / 63

Page 12: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 6

12 / 63

Page 13: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 7

13 / 63

Page 14: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 8

14 / 63

Page 15: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 9

15 / 63

Page 16: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 10

16 / 63

Page 17: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 11

17 / 63

Page 18: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 12

18 / 63

Page 19: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 13

19 / 63

Page 20: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 14

20 / 63

Page 21: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 15

21 / 63

Page 22: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 16

22 / 63

Page 23: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 17

23 / 63

Page 24: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 18

24 / 63

Page 25: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 19

25 / 63

Page 26: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 20

26 / 63

Page 27: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 21

27 / 63

Page 28: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 22

28 / 63

Page 29: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 23

29 / 63

Page 30: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Iteration 24

30 / 63

Page 31: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Clustering

Suppose you are using the above algorithm to cluster the datapoints in groups.

I How do you know when to stop?I How should we compare the data points?

Let’s investigate this further!

31 / 63

Page 32: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Agglomerative clustering

I Each level of the resulting tree is a segmentation of the dataI The algorithm results in a sequence of groupingsI It is up to the user to choose a “natural” clustering from this

sequence

32 / 63

Page 33: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

DendogramWe can also represent the sequence of clustering assignments as adendrogram:

●●

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.4

0.6

0.8

Dimension 1

Dim

ensi

on 2

1

23

4

56

7

1 7 4 5 6 2 3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Hei

ght

Note that cutting the dendrogram horizontally partitions the datapoints into clusters

33 / 63

Page 34: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Dendogram

I Agglomerative clustering is monotonicI The similarity between merged clusters is monotone decreasing

with the level of the merge.I Dendrogram: Plot each merge at the (negative) similarity

between the two merged groupsI Provides an interpretable visualization of the algorithm and

dataI Useful summarization tool, part of why hierarchical clustering is

popular

34 / 63

Page 35: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Group similarity

Given a distance similarity measure (say, Eucliclean) between points,the user has many choices on how to define intergroup similarity.

1. Single linkage: the similiarity of the closest pair

dSL(G , H) = mini∈G,j∈H

di ,j

2. Complete linkage: the similarity of the furthest pair

dCL(G , H) = maxi∈G,j∈H

di ,j

3. Group-average: the average similarity between groups

dGA = 1NGNH

∑i∈G

∑j∈H

di ,j

35 / 63

Page 36: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Single LinkageIn single linkage (i.e., nearest-neighbor linkage), the dissimilaritybetween G , H is the smallest dissimilarity between two points inopposite groups:

dsingle(G , H) = mini∈G, j∈H

dij

Example (dissimilarities dij aredistances, groups are markedby colors): single linkage scoredsingle(G , H) is the distance ofthe closest pair

●●

● ●

●● ●

● ●

●●

●●

●●

−2 −1 0 1 2

−2

−1

01

2

36 / 63

Page 37: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Single Linkage ExampleHere n = 60, Xi ∈ R2, dij = ‖Xi − Xj‖2. Cutting the tree ath = 0.9 gives the clustering assignments marked by colors

●●

●●

● ●

● ●

−2 −1 0 1 2 3

−2

−1

01

23

0.0

0.2

0.4

0.6

0.8

1.0

Hei

ght

Cut interpretation: for each point Xi , there is another point Xj in itscluster with dij ≤ 0.9

37 / 63

Page 38: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Complete LinkageIn complete linkage (i.e., furthest-neighbor linkage), dissimilaritybetween G , H is the largest dissimilarity between two points inopposite groups:

dcomplete(G , H) = maxi∈G, j∈H

dij

Example (dissimilarities dij aredistances, groups are markedby colors): complete linkagescore dcomplete(G , H) is the dis-tance of the furthest pair

●●

● ●

●● ●

● ●

●●

●●

●●

−2 −1 0 1 2

−2

−1

01

2

38 / 63

Page 39: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Complete Linkage ExampleSame data as before. Cutting the tree at h = 5 gives the clusteringassignments marked by colors

●●

●●

● ●

● ●

−2 −1 0 1 2 3

−2

−1

01

23

01

23

45

6

Hei

ght

Cut interpretation: for each point Xi , every other point Xj in itscluster satisfies dij ≤ 5

39 / 63

Page 40: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Average LinkageIn average linkage, the dissimilarity between G , H is the averagedissimilarity over all points in opposite groups:

daverage(G , H) = 1nG · nH

∑i∈G, j∈H

dij

Example (dissimilarities dij aredistances, groups are markedby colors): average linkagescore daverage(G , H) is the av-erage distance across all pairs

(Plot here only shows distancesbetween the blue points andone red point)

●●

● ●

●● ●

● ●

●●

●●

●●

−2 −1 0 1 2

−2

−1

01

2

40 / 63

Page 41: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Average linkage example

Same data as before. Cutting the tree at h = 2.5 gives clusteringassignments marked by the colors

●●

●●

● ●

● ●

−2 −1 0 1 2 3

−2

−1

01

23

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Hei

ght

Cut interpretation: there really isn’t a good one!

41 / 63

Page 42: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Properties of intergroup similarity

I Single linkage can produce “chaining,” where a sequence ofclose observations in different groups cause early merges ofthose groups

I Complete linkage has the opposite problem. It might not mergeclose groups because of outlier members that are far apart.

I Group average represents a natural compromise, but dependson the scale of the similarities. Applying a monotonetransformation to the similarities can change the results.

42 / 63

Page 43: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Things to consider

I Hierarchical clustering should be treated with caution.I Different decisions about group similarities can lead to vastly

different dendrograms.I The algorithm imposes a hierarchical structure on the data,

even data for which such structure is not appropriate.

43 / 63

Page 44: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Application on genomic data

I Unsupervised methods are often used in the analysis ofgenomic data.

I PCA and hierarchical clustering are very common tools. Wewill explore both on a genomic data set.

I We illustrate these methods on the NCI60 cancer cell linemicroarray data, which consists of 6,830 gene expressionmeasurements on 64 cancer cell lines.

44 / 63

Page 45: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Application on genomic data

library(ISLR)nci.labs <- NCI60$labsnci.data <- NCI60$data

I Each cell line is labeled with a cancer type.I We do not make use of the cancer types in performing PCA

and clustering, as these are unsupervised techniques.I After performing PCA and clustering, we will check to see the

extent to which these cancer types agree with the results ofthese unsupervised techniques.

45 / 63

Page 46: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Exploring the data

dim(nci.data)

## [1] 64 6830

# cancer types for the cell linesnci.labs[1:4]

## [1] "CNS" "CNS" "CNS" "RENAL"

table(nci.labs)

## nci.labs## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA## 7 5 7 1 1 6## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE## 1 1 8 9 6 2## RENAL UNKNOWN## 9 1

46 / 63

Page 47: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

PCA

pr.out <- prcomp(nci.data, scale=TRUE)

47 / 63

Page 48: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

PCA

We now plot the first few principal component score vectors, inorder to visualize the data.

First, we create a simple function that assigns a distinct color toeach element of a numeric vector. The function will be used toassign a color to each of the 64 cell lines, based on the cancer typeto which it corresponds.

48 / 63

Page 49: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Simple color function

#Input: positive integer, vector#Output: vector containing that

#number of distinct colorsCols=function(vec){

cols<-rainbow(length(unique(vec)))return(cols[as.numeric(as.factor(vec))])}

49 / 63

Page 50: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

PCA

−40 −20 0 20 40 60

−60

−40

−20

020

Z1

Z2

−40 −20 0 20 40 60−

40−

200

2040

Z1

Z3

On the whole, cell lines corresponding to a single cancer type dotend to have similar values on the first few PC score vectors. Thisindicates that cell lines from the same cancer type tend to havepretty similar gene expression levels. 50 / 63

Page 51: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Proportion of Variance Explainedsummary(pr.out)

## Importance of components%s:## PC1 PC2 PC3 PC4 PC5## Standard deviation 27.8535 21.48136 19.82046 17.03256 15.97181## Proportion of Variance 0.1136 0.06756 0.05752 0.04248 0.03735## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115 0.31850## PC6 PC7 PC8 PC9 PC10## Standard deviation 15.72108 14.47145 13.54427 13.14400 12.73860## Proportion of Variance 0.03619 0.03066 0.02686 0.02529 0.02376## Cumulative Proportion 0.35468 0.38534 0.41220 0.43750 0.46126## PC11 PC12 PC13 PC14 PC15## Standard deviation 12.68672 12.15769 11.83019 11.62554 11.43779## Proportion of Variance 0.02357 0.02164 0.02049 0.01979 0.01915## Cumulative Proportion 0.48482 0.50646 0.52695 0.54674 0.56590## PC16 PC17 PC18 PC19 PC20## Standard deviation 11.00051 10.65666 10.48880 10.43518 10.3219## Proportion of Variance 0.01772 0.01663 0.01611 0.01594 0.0156## Cumulative Proportion 0.58361 0.60024 0.61635 0.63229 0.6479## PC21 PC22 PC23 PC24 PC25 PC26## Standard deviation 10.14608 10.0544 9.90265 9.64766 9.50764 9.33253## Proportion of Variance 0.01507 0.0148 0.01436 0.01363 0.01324 0.01275## Cumulative Proportion 0.66296 0.6778 0.69212 0.70575 0.71899 0.73174## PC27 PC28 PC29 PC30 PC31 PC32## Standard deviation 9.27320 9.0900 8.98117 8.75003 8.59962 8.44738## Proportion of Variance 0.01259 0.0121 0.01181 0.01121 0.01083 0.01045## Cumulative Proportion 0.74433 0.7564 0.76824 0.77945 0.79027 0.80072## PC33 PC34 PC35 PC36 PC37 PC38## Standard deviation 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127## Proportion of Variance 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896## Cumulative Proportion 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803## PC39 PC40 PC41 PC42 PC43 PC44## Standard deviation 7.72156 7.58603 7.45619 7.3444 7.10449 7.0131## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072## Cumulative Proportion 0.86676 0.87518 0.88332 0.8912 0.89861 0.9058## PC45 PC46 PC47 PC48 PC49 PC50## Standard deviation 6.95839 6.8663 6.80744 6.64763 6.61607 6.40793## Proportion of Variance 0.00709 0.0069 0.00678 0.00647 0.00641 0.00601## Cumulative Proportion 0.91290 0.9198 0.92659 0.93306 0.93947 0.94548## PC51 PC52 PC53 PC54 PC55 PC56## Standard deviation 6.21984 6.20326 6.06706 5.91805 5.91233 5.73539## Proportion of Variance 0.00566 0.00563 0.00539 0.00513 0.00512 0.00482## Cumulative Proportion 0.95114 0.95678 0.96216 0.96729 0.97241 0.97723## PC57 PC58 PC59 PC60 PC61 PC62## Standard deviation 5.47261 5.2921 5.02117 4.68398 4.17567 4.08212## Proportion of Variance 0.00438 0.0041 0.00369 0.00321 0.00255 0.00244## Cumulative Proportion 0.98161 0.9857 0.98940 0.99262 0.99517 0.99761## PC63 PC64## Standard deviation 4.04124 2.148e-14## Proportion of Variance 0.00239 0.000e+00## Cumulative Proportion 1.00000 1.000e+00

51 / 63

Page 52: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Proportion of Variance Explainedplot(pr.out)

pr.out

Var

ianc

es

020

040

060

0

52 / 63

Page 53: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

PCA

0 10 20 30 40 50 60

02

46

810

Principal Component

PV

E

0 10 20 30 40 50 6020

4060

8010

0

Principal Component

Cum

ulat

ive

PV

E

Theproportion of variance explained (PVE) of the principal components of the NCI60 cancer cell line microarray dataset. Left: the PVE of each principal component is shown. Right: the cumulative PVE of the principal componentsis shown. Together, all principal components explain 100 % of the variance.

53 / 63

Page 54: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Conclusions from the Scree Plot

I We see that together, the first seven principal componentsexplain around 40% of the variance in the data.

I This is not a huge amount of the variance.I Looking at the scree plot, we see that while each of the first

seven principal components explain a substantial amount ofvariance, there is a marked decrease in the variance explainedby further principal components.

I That is, there is an elbow in the plot after approximately theseventh principal component. This suggests that there may belittle benefit to examining more than seven or so principalcomponents (though even examining seven principalcomponents may be difficult).

54 / 63

Page 55: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Hierarchical Clustering to NCI60 Data

I We now proceed to hierarchically cluster the cell lines in theNCI60 data, with the goal of finding out whether or not theobservations cluster into distinct types of cancer.

I To begin, we standardize the variables to have mean zero andstandard deviation one.

I As mentioned earlier, this step is optional and should beperformed only if we want each gene to be on the same scale.

sd.data=scale(nci.data)

55 / 63

Page 56: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Hierarchical Clustering to NCI60 Data

BR

EA

ST

BR

EA

ST

CN

SC

NS R

EN

AL

BR

EA

ST

NS

CLC

RE

NA

LM

ELA

NO

MA

OV

AR

IAN

OV

AR

IAN

NS

CLC

OV

AR

IAN

CO

LON

CO

LON

OV

AR

IAN

PR

OS

TAT

EN

SC

LCN

SC

LCN

SC

LCP

RO

STA

TE

NS

CLC

ME

LAN

OM

AR

EN

AL

RE

NA

LR

EN

AL O

VA

RIA

NU

NK

NO

WN

OV

AR

IAN

NS

CLC

CN

SC

NS

CN

SN

SC

LCR

EN

AL

RE

NA

LR

EN

AL

RE

NA

LN

SC

LCM

ELA

NO

MA

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AB

RE

AS

TB

RE

AS

TC

OLO

NC

OLO

NC

OLO

N CO

LON

CO

LON

BR

EA

ST

MC

F7A

−re

pro

BR

EA

ST

MC

F7D

−re

pro

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

K56

2B−

repr

oK

562A

−re

pro

LEU

KE

MIA

LEU

KE

MIA

4060

8010

012

014

016

0

CompleteLinkage

56 / 63

Page 57: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Hierarchical Clustering to NCI60 Data

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

K56

2B−

repr

oK

562A

−re

pro

RE

NA

LN

SC

LCB

RE

AS

TN

SC

LCB

RE

AS

TM

CF

7A−

repr

oB

RE

AS

TM

CF

7D−

repr

oC

OLO

NC

OLO

NC

OLO

NR

EN

AL

ME

LAN

OM

AM

ELA

NO

MA

BR

EA

ST

BR

EA

ST

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AO

VA

RIA

NO

VA

RIA

NN

SC

LCO

VA

RIA

NU

NK

NO

WN

OV

AR

IAN

NS

CLC

ME

LAN

OM

AC

NS

CN

SC

NS RE

NA

LR

EN

AL

RE

NA

LR

EN

AL

RE

NA

LR

EN

AL

RE

NA

LP

RO

STA

TE

NS

CLC

NS

CLC

NS

CLC

NS

CLC

OV

AR

IAN

PR

OS

TAT

EN

SC

LCC

OLO

NC

OLO

NO

VA

RIA

NC

OLO

NC

OLO

NC

NS

CN

SB

RE

AS

TB

RE

AS

T

4060

8010

012

0

Average Linkage

57 / 63

Page 58: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Hierarchical Clustering to NCI60 Data

LEU

KE

MIA

RE

NA

LB

RE

AS

TLE

UK

EM

IALE

UK

EM

IAC

NS

LEU

KE

MIA

LEU

KE

MIA

K56

2B−

repr

oK

562A

−re

pro

NS

CLC

LEU

KE

MIA

OV

AR

IAN

NS

CLC

CN

SB

RE

AS

TN

SC

LCO

VA

RIA

NC

OLO

NB

RE

AS

TM

ELA

NO

MA

RE

NA

LM

ELA

NO

MA

BR

EA

ST

BR

EA

ST

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AB

RE

AS

TO

VA

RIA

NC

OLO

NM

CF

7A−

repr

oB

RE

AS

TM

CF

7D−

repr

oU

NK

NO

WN

OV

AR

IAN

NS

CLC

NS

CLC

PR

OS

TAT

EM

ELA

NO

MA

CO

LON

OV

AR

IAN

NS

CLC

RE

NA

LC

OLO

NP

RO

STA

TE

CO

LON

OV

AR

IAN

CO

LON

CO

LON

NS

CLC

NS

CLC

RE

NA

LN

SC

LCR

EN

AL

RE

NA

LR

EN

AL

RE

NA

LR

EN

AL CN

SC

NS

CN

S

4050

6070

8090

100

110

Single Linkage

We see that the choice of linkage certainly does affect the resultsobtained.

58 / 63

Page 59: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Hierarchical Clustering to NCI60 Data

I Typically, single linkage will tend to yield trailing clusters: verylarge clusters onto which individual observations attachone-by-one.

I On the other hand, complete and average linkage tend to yieldmore balanced, attractive clusters.

I For this reason, complete and average linkage are generallypreferred to single linkage.

59 / 63

Page 60: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Complete linkage

I We will use complete linkage hierarchical clustering for theanalysis that follows.

I We can cut the dendrogram at the height that will yield aparticular number of clusters, say four.

hc.out=hclust(dist(sd.data))hc.clusters=cutree(hc.out,4)table(hc.clusters,nci.labs)

## nci.labs## hc.clusters BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro## 1 2 3 2 0 0 0 0## 2 3 2 0 0 0 0 0## 3 0 0 0 1 1 6 0## 4 2 0 5 0 0 0 1## nci.labs## hc.clusters MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN## 1 0 8 8 6 2 8 1## 2 0 0 1 0 0 1 0## 3 0 0 0 0 0 0 0## 4 1 0 0 0 0 0 0

60 / 63

Page 61: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

Complete linkageAll the leukemia cell lines fall in cluster 3, while the breast cancercell lines are spread out over three different clusters. We can plotthe cut on the dendrogram that produces these four clusters.This is the height that results in four clusters. (It is easy to verifythat the resulting clusters are the same as the ones we obtainedusing cutree(hc.out,4))

BR

EA

ST

BR

EA

ST

CN

SC

NS R

EN

AL

BR

EA

ST

NS

CLC

RE

NA

LM

ELA

NO

MA

OV

AR

IAN

OV

AR

IAN

NS

CLC

OV

AR

IAN

CO

LON

CO

LON

OV

AR

IAN

PR

OS

TAT

EN

SC

LCN

SC

LCN

SC

LCP

RO

STA

TE

NS

CLC

ME

LAN

OM

AR

EN

AL

RE

NA

LR

EN

AL O

VA

RIA

NU

NK

NO

WN

OV

AR

IAN

NS

CLC

CN

SC

NS

CN

SN

SC

LCR

EN

AL

RE

NA

LR

EN

AL

RE

NA

LN

SC

LCM

ELA

NO

MA

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AM

ELA

NO

MA

ME

LAN

OM

AB

RE

AS

TB

RE

AS

TC

OLO

NC

OLO

NC

OLO

N CO

LON

CO

LON

BR

EA

ST

MC

F7A

−re

pro

BR

EA

ST

MC

F7D

−re

pro

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

LEU

KE

MIA

K56

2B−

repr

oK

562A

−re

pro

LEU

KE

MIA

LEU

KE

MIA

4060

8010

012

014

016

0

Cluster Dendrogram

hclust (*, "complete")dist(sd.data)

Hei

ght

61 / 63

Page 62: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

K-means versus complete linkage?

How do these NCI60 hierarchical clustering results compare to whatwe get if we perform K-means clustering with K = 4?

set.seed (2)km.out <- kmeans(sd.data, 4, nstart=20)km.clusters <- km.out$clustertable(km.clusters, hc.clusters )

## hc.clusters## km.clusters 1 2 3 4## 1 11 0 0 9## 2 0 0 8 0## 3 9 0 0 0## 4 20 7 0 0

62 / 63

Page 63: Hierarchical clustering - Statistical Sciencercs46/lectures_2017/10-unsupervise/...Proportion of Variance Explained summary(pr.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4

K-means versus complete linkage?

I We see that the four clusters obtained using hierarchicalclustering and K-means clustering are somewhat different.

I Cluster 2 in K-means clustering is identical to cluster 3 inhierarchical clustering.

I However, the other clusters differ: for instance, cluster 4 inK-means clustering contains a portion of the observationsassigned to cluster 1 by hierarchical clustering, as well as all ofthe observations assigned to cluster 2 by hierarchical clustering.

63 / 63