advanced studies in applied statistics (wbl), ethz applied ... · advanced studies in applied...

Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 5

Lecturer: Beate Sick

[email protected]

1

Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.

mailto:[email protected]

mailto:[email protected]

Topics of today

2

• Clustering

• How to assess the quality of a clustering?

• K-mean clustering cntd.

• Visualization of k-mean clusters

• Quality measures for cluster result

• Pros and Cons

• Hierarchical clustering

• Principles of hierarchical clustering

• Linkage methods

• Visualization and quality measures

• Density based or network analysis based clustering is skipped

• Summary on Clustering

• Introduction to Classification

Partion clustering A division data objects into non-overlapping subsets (clusters)

3

The goal in k-means is to partition the observations into K homogeneous

clusters such that the total within-cluster variation (WCV), summed over

all K clusters Ck, is as small as possible.

What is optimized in K-means Clustering ?

WCV is often based on Euclidian distances

where |Ck| denotes the number of observations in the kth cluster

and p is the number of features (dimensions).

Squared Euclidian distance

between observations i and i’

4

• Run k-Means for several k

• Determine minimized sum of WCV

• Plot minimized sum of WCV vs. k

• Choose k after the last big drop of

How to choose the “best” number of clusters in K-means ?

5

## find suitable number of centers

wss = rep(0, 6) # initialize

wss[1] = (n-1) * sum(apply(pots, 2, var)) # wss ifall data in 1 cluster

for (i in 2:6)

wss[i] <- sum(kmeans(pots, centers = i)$withinss)

plot(1:6, wss, type = "b", xlab = "Number of groups",

ylab = "Within groups sum of squares")

## 3 groups is a good choice

## Result varies, because of random starting configurations in kmeans

Rousseeuw (1987) suggested a graphical display, the

silhouette plot, which can be used to:

(i) select the number of clusters and

(ii) assess how well individual observations are clustered.

The silhouette width of observation i is defined as

ai average distance between data point i and

all other points in the same cluster to which i belongs.

bi average distance between i and its “neighbor” cluster,

i.e., the nearest one to which it does not belong

The silhouette width of a observation i is a number between -1 and 1.

A silhouette width close to 1 means the point is well clustered.

Warning: some clustering methods optimize the average Euclidian distanc

1x

2x

i

6

QC of clustering: Check out the silhouette widths

• silhouette plot

• 2D: PCA or MDS

How to visualize the result of K-means ?

7

## visualize in PC 1 & 2

p1=prcomp(pots,retx=TRU)

# check explained variance

summary(p1)

plot( p1$x[,1] , p1$x[,2],

xlab="PC1", ylab="PC2",

pch=grpsKM, col=grpsKM)

ckm <- kmeans(pots, centers=3)

grpsKM = ckm$cluster

# Silhouette plot

plot(silhouette(grpsKM, dist.pots))

For each object, plots a relative membership measure (with range:

-1 < sil < 1) of the group to which it has been assigned to the best

supported alternative group

the plot presents all objects as vertical 'bands' of 'width' m

The 'average width' the cluster coefficient:

0.70 < sil < 1.00: Good structure has been found

0.50 < sil < 0.70: Reasonable structure found

0.25 < sil < 0.50: Weak structure, requiring confirmation

-1 < sil < 0.25: Forget it!

What does the average silhouette width tell us?

8

What does a negative silhouette width tell us?

A negative value indicates that a point is in average closer to the points of

another cluster than to points within its own cluster. 9

taken from Hastie & Tibshirani

K-means Result can strongly depend on starting configuration

10

Comments on the K-Means Method

Strength

• Fast

Weakness

• Need to specify k, the number of clusters, in advance

• Unable to handle noisy data and outliers

• Not suitable to discover clusters with non-convex shapes

• Can terminate at a local optimum.

• Is based on Euclidean distances and cannot handle other

dissimilarities

11

The K-Medoids Clustering Method

Find representative objects, called Medoids, in clusters

PAM (Partitioning Around Medoids)

- starts from an initial set of medoids

- iteratively replaces one of the medoids by one of the non-medoids if

it improves the total distance of the resulting clustering

Pro’s:

- more robust against outliers

- can deal with any dissimilarity measure

- easy to find representative objects per cluster

(e.g. for easy interpretation)

Con’s:

- PAM is slow for large data sets (only fast for small data sets)

12

Partitioning Methods in R

Function “kmeans” in package “stats”

Function “pam” in package “cluster”

13

Hierarchical clustering A set of nested clusters organized as a hierarchical tree

14

Since we cannot test all possible

dendrograms we will have to

heuristic search of all possible

dendrograms. We could do this..

Bottom-Up (agglomerative):

Starting with each item in its

own cluster, find the best pair to

merge into a new cluster.

Repeat until all clusters are

fused together.

Top-Down (divisive):

Starting with all the data in a

single cluster, consider every

possible way to divide the

cluster into two. Choose the best

division and recursively operate

on both sides.

Without proof: The number of

dendrograms with n leafs:

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possible

of Leafs Dendrograms

2 1

3 3

4 15

5 105

... …

10 34,459,425

How to do hierarchical Clustering?

15

Aglomatrative Hierarchical Clustering

dendrogram

Feature 1

Feature 2

Problem: Need a generalization of the distance

between the objects to compound of objects.

Feature 1

Feature 2

What is the distance between

those objects?-> Linkage

Any dissimilarity we have seen before can be used

- euclidean

- manhattan

- simple matching coefficent

- Jaccard dissimilarity

- Gower’s dissimilarity

- etc.

Dissimilarity between samples or observations

17

Dissimilarity between clusters: Linkages

single link (min)

complete link (max)

average

Single link: smallest distance between

point-pairs linking both clusters

Complete link: largest distance

Average: avg distance between

Wards: In this method, we try to minimize

the variance of the merged clusters

Wards

18

How to read a dendrogram

The position of the join node on the distance-scale indicates the distance

between clusters (this distance depends on the linkage method). For

example, if you see two clusters merged at a height 22, it means that the

distance between those clusters was 22 .

When you read a dendrogram, you want to determine at what stage the

distance between clusters that are combined is large.

You look for large distances between sequential join nodes (here vertical lines).

19

Distance (R: clust$height)

Agglomerative hierarchical clustering is often visualized by a dendrogram

https://www.researchgate.net/figure/Example-of-hierarchical-clustering-clusters-are-consecutively-merged-with-the-most_fig3_273456906

20

























Simple example

Draw a dendrogram visualizing the grouping of the following 1D data:

Use Euclidian distances and single-linkage

21

Linkage Methods based on

“distances” between 2 clusters:

• Single link: smallest distance

between point-pairs linking both

clusters

• Complete link: largest distance

between


• Average: avg distance between


• Wards: In this method, we try to

minimize the variance of the merged

clusters

Distances between

2 data points / observations

• Euclidean

• Manhattan

• Simple Matching Coefficent

• Jaccard dissimilarity

• Gower’s dissimilarity

• etc.

Cluster result depend on the used distances and linkage methods

22

Cluster result depend on data structure, distances and linkage methods

Single

linkage

complete

linkage

average

linkage

Ward likes to produce

clusters of equal

sizes

Data: we simulated two 2D-Gaussian Clusters with very different sizes

23

Functions “hclust”, “cutree” in package “stats”

Alternative: Function “agnes” in package “cluster”

Agglomerative Clustering in R

24

## determine euclidean distances

dist.pots = dist(pots) ## Apply agglomerative clutstering using Ward linkage

hc = hclust(dist.pots, method="ward.D2")

# plot the dendrogram

plot(hc)

## Split into 3 groups

grps = cutree(hc, k=3) grps

# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2

#

# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

# 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

> str(dat)

'data.frame': 38 obs. of 8 variables:

$ Country : Factor w/ 6 levels "France","Germany",..: 6 6 6 ...

$ Car : Factor w/ 38 levels "AMC Concord D/L",..: 6 21...

$ MPG : num 16.9 15.5 19.2 18.5 30 27.5 27.2 30.9 20.3 17 ...

$ Weight : num 4.36 4.05 3.6 3.94 2.15 ...

$ Drive_Ratio : num 2.73 2.26 2.56 2.45 3.7 3.05 3.54 3.37 3.9 3.5 ...

$ Horsepower : int 155 142 125 150 68 95 97 75 103 125 ...

$ Displacement: int 350 351 267 360 98 134 119 105 131 163 ...

$ Cylinders : int 8 8 8 8 4 4 4 4 5 6 ...

Use a data-set about cars to do a heatmap representation

25

# prepare numeric feature matrix

my.select=dat[,3:8]

x=t(as.matrix(my.select))

# defaults: dist=euclidean,

# linkage=complete,

# rows are scaled

# (mean=0, sd=1)

library(pheatmap)

pheatmap(x, scale="row")

Pretty heatmap does incorporate hierarchical clustering

26

Heatmaps allow to

“look into the clusters”

Here: lef cluster

holds heave thirsty cars

# give matrix x colnames to be used in heatmap plot

colnames(x) = dat$Car

# let's prepare for a color side bar

annot_col = data.frame(Country=dat$Country)

# give assocation between cols (=Car) and annot_col

rownames(annot_col) = colnames(x)

# plot the heatmap

pheatmap(x, scale="row", annotation_col=annot_col)

Add some meta information to heatmap

27

Side-color-bars

allow to check

hypothesis on

an association

between clusters

and an external

categorical

variable that was

not used during

clustering.

Cluster result depend on data structure, distances and linkage methods

Single

linkage

complete

linkage

average

linkage

Ward likes to produce

clusters of equal

sizes

Data: we simulated 2 2D-Gaussian Clusters with very different sizes

28

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

1

2

3

4

5

6

7

Average linkage

5 14 23 7 4 12 19 21 24 15 16 18 1 3 8 9 29 2 10 11 20 28 17 26 27 25 6 13 22 300

5

10

15

20

25

Wards linkage Single linkage

• Single-Linkage produce long and skinny clusters.

• Wards produce of ten very separated clusters

• Average linkage yield more round clusters

Generally clustering is an exploratory tool.

Use the linkage which produces the “best” results.

Compare linkage methods

29

(a) What is a cluster?

(b) What features should be used?

(c) Should the data be normalized?

(d) Does the data contain any outliers?

(e) How do we define the pair-wise similarity?

(f) How many clusters are present in the data?

(g) Which clustering method should be used?

(h) Does the data have any clustering tendency?

(i) Are the discovered clusters and partition valid?

All these questions have usually no clear answer!

Users dilemma when doing clustering

Data clustering: 50 years beyond K-means AK Jain - Pattern recognition letters, 2010

https://www.sciencedirect.com/science/article/pii/S0167865509002323



How good is the clustering?

31

How to assess the quaility of a clustering?

• Internal QC criterions (w/o ground truth knowledge or labels)

Visualization based: dendrogram, heatmap, coloring in 2D plot (PCA, MDS, t-SNE)

Silhouette plot

WCV to quantify within-cluster variation

…

• External QC measures (requires labels or ground truth knowledge)

• Confusion matrix

• Mutual information

• ….

See also: Data clustering: 50 years beyond K-means AK Jain - Pattern recognition letters, 2010



Assigned cluster

ACTUAL

CLASS

black yellow red

black 7 0 0

brown 2 6 2

blue 0 0 16

7 6 16Accuracy 0.88

7 6 16 2 2

For an ideal

classifier the off-

diagonal entries

should be zero:

c=0, b=0, or

Accuracy=1

Evaluate prediction accuracy on data

Confusion Matrix:

Simply count the # correct / all

Confusion Matrix / Accuracy

34

Mutual information measures the amount of information

that carries one random variable about the other.

The PMI of a pair of outcomes x and y belonging to

discrete random variables X and Y is defined as:.

2x2-Example:

Joint

distribution: Marginal

distributions:

PMI is zero if X and Y are independent, since in the

case of independence it holds: p(x,y)=p(x)*p(y).

PMI maximizes when X and Y are perfectly associated.

2

( , )( ; ) log

( ) ( )

p x ypmi x y

p x p y

x

y

0

0 1

1 0.7

0.1 0.15

0.05

Pointwise mutual information (PMI): A simple example with two binary variables

2

0.1pm(x=0;y=0)= log 1

0.8 0.25

pm(x=0;y=1)=0.22

pm(x=1;y=0)=1.58

pm(x=1;y=1)=-1.58

35

The mutual information (MI) of the random

variables X and Y is the expected value of the

pointwise mutual information (pmi) over all

possible outcomes.

MI=0 if X carries no information about Y.

,

2

,

( ; ) ( , )

( , )( , ) log

( ) ( )

X Y

x y

MI X Y E pmi x y

p x yp x y

p x p y

x

y

0

0 1

1 0.7

0.1 0.15

0.05

2x2-Example:

2 2

0.1 0.05( ; ) 0.1 log ... 0.05 log

0.8 0.25 0.25 0.75

0.1 ( 1) 0.7 0.22 0.15 1.58 0.05 ( 1.58)

0.21

MI X Y

Mutual Information (MI): A simple example with two binary variables

Joint

distribution: Marginal

distributions:

Using external information to measure cluster quality: Normalized mutual information criterion

https://www.coursera.org/learn/cluster-analysis/lecture/baJNC/6-5-external-measure-2-entropy-based-measures

The mutual information 𝑀𝐼 𝐶, 𝑇 between the categorical

variable 𝑇 – giving the ground truth group (1, .., k) and the

categorical variable 𝐶 - giving the cluster-assignment (1, …, r)

– quantifies the amount of information that the clustering

carries on the ground truth group and therefore 𝑀𝐼 𝐶, 𝑇

quantifies the clustering quality.

2

1 1

(c, t)(C;T) (c, t) log [0, ]

(c) (t)

r k

i j

pMI p

p p

𝑀𝐼 𝐶, 𝑇 is zero if 𝐶 and 𝑇 are independent,

however, there is no upper bound of 𝑀𝐼.

Therefor, often the normalized mutual N𝑀𝐼

information is used;

N𝑀𝐼 = 0 worst possible clustering N𝑀𝐼 = 1 for perfect clustering.

(C;T)(C;T) [0,1]

( ) ( )

MINMI

H C H T

2

1

Entropy of clustering C:

( ) (c) log (c)k

j

H C p p

2

1

Entropy of partitioning T:

(T) (t) log (t)r

j

H p p


















Summary: Typical Steps in Cluster Analysis

Hierarchical clustering, dendrogram, cutting a dendrogram

Partitioning methods: k-Means, PAM

Choosing number of clusters (w/o external knowledge): drop in dendrogram (hierarchical clustering)

drop in sum of within-cluster variation WCV (partitioning clustering)

QC (w/o external knowledge): Visualization in 2D: PCA or MDS, use different colors for different clusters

Silhouette plot

WCV to quantify within-cluster variation

Giving meaning to cluster:

- generally hard in high dimensions

- look at centers or representatives (easy in PAM)

- look at heatmap and interpret colors (for numeric data)

- perform Classification with cluster-ID as class-label

and look at variable importance (we will do that later in this course)

37

Let’s switch from unsupervised to supervised learning

Supervised learning unsupervised learning

38

38

Citation (Yann LeCun, 2018)

“THE REVOLUTION WILL NOT BE SUPERVISED”

http://engineering.nyu.edu/news/2018/03/06/revolution-will-not-be-supervised-promises-facebooks-yann-lecun-kickoff-ai-seminar






























Classification

Classification is a

prediction method

Idea:

Train a classifier based

on training data and

use the classifier to

classify new test

observations with

unknown class label.

Which feature should we

use to describe an

observation (animal)?

Camels

Dromedaries

???

What is a classification task?

40

Feature extraction

Defining appropriate features is essential for the success of the

classification task!

It is not always as simple as it is in this example:

Features can be combined to new features or selected.

ID of animal Class label Number

of legs

Number of

bumps Length of legs

[cm]

1 Dromedar 4 1 98

2 Kamel 3 2 87

… … … … …

150 Kamel 4 2 103

41

Data

Class label

Dromedar

Kamel

…

Kamel

Number

of legs

Number of

bumps Color of pelt

4 1 brown

3 2 beige

… … …

4 2 brown

Y X Data Matrix with several features Labels are categorical

In classification we try to predict the class labels using the features.

Principal Idea Classification

id Type Sepal.

Length Sepal.W

idth Petal.Le

ngth Petal.Widt

h

1 setosa 5.1 3.5 1.4 0.2

2 setosa 4.9 3 1.4 0.2

3 virinica 3.3 3.2 1.6 0.5

4 setosa 5.1 3.5 1.4 0.2

… … … … … …

150 virinica 4.9 3 1.4 0.2

Training Data

Classifier

Learn a classifier

d Type Sepal.

Length Sepal.Widt

h Petal.Len

gth Petal.Widt

h

1 ? 3.1 3.5 1.4 0.2

2 ? 4.9 3 1.4 0.2

3 ? 3.3 3.2 1.6 0.5

4 ? 5.1 3.5 31.4 0.2

Unknown data / Test data

Classifier

Predict

Type

Klassifikatoren

•Neuronale Netze

•Entscheidungsbäum

•…

Note:

To evaluate the performance a part of the labelled data not used to train the classifier

but left aside to check the performance of the classifier to new data.

Examples of Classification Task

Is a given text e.g. tweet about a product positive, negative or neutral.

Sentiment Analysis

“The movie XXX actually neither that funny, nor super witty” Negative

Churn in Marketing: Predict which customer wants to quit and offer them

a discount

Face detection. Image (array of pixels) John

…

44

K-Nearest-Neighbors in a nutshell

Idea of knn classification:

- Start with an observation x0 with

unknown class label

- Find the k training observations, that

have the smallest distance to x0

- Use the majority class among the k

neighbors as class label for x0

R functions to know

- From package “class”: “knn”

x0

45

knn-classificator with 3 class-labels after training with k=1

data with true class label Trained classificator (knn with k=1)

x1 x1

x2 x2

46

The effect of K

The effect of K

Which k to use? Let’s quantify the error / accuracy.

Types of Errors

Training error or naïve error or in-sample-error:

Error on data that have been used to train the model

Generalization error or test error or out-of-sample-error:

Error on previously unseen records (out of sample)

Over-fitting phenomenon:

Model fits the training data well (small training error) but shows high

generalization error

49

“Perfect” Vs. “Simple” classifier

0 1 2 3 4 5 6 x1

x2

0 1

2

3

4

5

6

“Perfect”

classifier

“Simple”

classifier

Which is better?

Check on a test-set (cross validation).

50

Cross validation of the “simple” classifier

0 1 2 3 4 5 6 x1

x2

0 1

2

3

4

5

6

Train Test

Training set:

6/29=20%

misclassification

Test set:

2/25=8%

misclassification

51

Cross validation of the “Perfect” classifier

0 1 2 3 4 5 6 x1

x2

0 1

2

3

4

5

6

Training set:

0%misclassification Test set:

8/25=24%

misclassification Train Test

52

What ist the right complexity of a model?

53

libray(class) # we need this package to use the function knn

data(iris)

# reorder rows randomly

rand=sample(nrow(iris))

iris.rand=iris[rand,]

# generate train and test data

my.train = iris.rand[1:25, 1:4]

my.test = iris.rand[26:50, 1:4]

# generate class labels for train and test data

class.train = iris.rand[1:25, 5]

class.test = iris.rand[26:50, 5]

# train knn with known train labels

# prediction will be done only for test data

iris.knn.pred = knn(train=my.train, test=my.test,

cl=class.train, k=3, prob=TRUE)

# confusion matrix:

table(iris.knn.pred, class.test)

# class.test

# iris.knn.pred setosa versicolor virginica

# setosa 11 0 0

# versicolor 0 3 1

# virginica 0 0 10

knn-classificatin in R with iris and k=3 classes of flowers

54

advanced studies in applied statistics (wbl), ethz applied ... · advanced studies in applied...

Documents