advanced studies in applied statistics (wbl), ethz applied ... · advanced studies in applied...
TRANSCRIPT
Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 5
Lecturer: Beate Sick
1
Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.
Topics of today
2
• Clustering
• How to assess the quality of a clustering?
• K-mean clustering cntd.
• Visualization of k-mean clusters
• Quality measures for cluster result
• Pros and Cons
• Hierarchical clustering
• Principles of hierarchical clustering
• Linkage methods
• Visualization and quality measures
• Density based or network analysis based clustering is skipped
• Summary on Clustering
• Introduction to Classification
Partion clustering A division data objects into non-overlapping subsets (clusters)
3
The goal in k-means is to partition the observations into K homogeneous
clusters such that the total within-cluster variation (WCV), summed over
all K clusters Ck, is as small as possible.
What is optimized in K-means Clustering ?
WCV is often based on Euclidian distances
where |Ck| denotes the number of observations in the kth cluster
and p is the number of features (dimensions).
Squared Euclidian distance
between observations i and i’
4
• Run k-Means for several k
• Determine minimized sum of WCV
• Plot minimized sum of WCV vs. k
• Choose k after the last big drop of
How to choose the “best” number of clusters in K-means ?
5
## find suitable number of centers
wss = rep(0, 6) # initialize
wss[1] = (n-1) * sum(apply(pots, 2, var)) # wss ifall data in 1 cluster
for (i in 2:6)
wss[i] <- sum(kmeans(pots, centers = i)$withinss)
plot(1:6, wss, type = "b", xlab = "Number of groups",
ylab = "Within groups sum of squares")
## 3 groups is a good choice
## Result varies, because of random starting configurations in kmeans
Rousseeuw (1987) suggested a graphical display, the
silhouette plot, which can be used to:
(i) select the number of clusters and
(ii) assess how well individual observations are clustered.
The silhouette width of observation i is defined as
ai average distance between data point i and
all other points in the same cluster to which i belongs.
bi average distance between i and its “neighbor” cluster,
i.e., the nearest one to which it does not belong
The silhouette width of a observation i is a number between -1 and 1.
A silhouette width close to 1 means the point is well clustered.
Warning: some clustering methods optimize the average Euclidian distanc
1x
2x
i
6
QC of clustering: Check out the silhouette widths
• silhouette plot
• 2D: PCA or MDS
How to visualize the result of K-means ?
7
## visualize in PC 1 & 2
p1=prcomp(pots,retx=TRU)
# check explained variance
summary(p1)
plot( p1$x[,1] , p1$x[,2],
xlab="PC1", ylab="PC2",
pch=grpsKM, col=grpsKM)
ckm <- kmeans(pots, centers=3)
grpsKM = ckm$cluster
# Silhouette plot
plot(silhouette(grpsKM, dist.pots))
For each object, plots a relative membership measure (with range:
-1 < sil < 1) of the group to which it has been assigned to the best
supported alternative group
the plot presents all objects as vertical 'bands' of 'width' m
The 'average width' the cluster coefficient:
0.70 < sil < 1.00: Good structure has been found
0.50 < sil < 0.70: Reasonable structure found
0.25 < sil < 0.50: Weak structure, requiring confirmation
-1 < sil < 0.25: Forget it!
What does the average silhouette width tell us?
8
What does a negative silhouette width tell us?
A negative value indicates that a point is in average closer to the points of
another cluster than to points within its own cluster. 9
taken from Hastie & Tibshirani
K-means Result can strongly depend on starting configuration
10
Comments on the K-Means Method
Strength
• Fast
Weakness
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
• Can terminate at a local optimum.
• Is based on Euclidean distances and cannot handle other
dissimilarities
11
The K-Medoids Clustering Method
Find representative objects, called Medoids, in clusters
PAM (Partitioning Around Medoids)
- starts from an initial set of medoids
- iteratively replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
Pro’s:
- more robust against outliers
- can deal with any dissimilarity measure
- easy to find representative objects per cluster
(e.g. for easy interpretation)
Con’s:
- PAM is slow for large data sets (only fast for small data sets)
12
Partitioning Methods in R
Function “kmeans” in package “stats”
Function “pam” in package “cluster”
13
Hierarchical clustering A set of nested clusters organized as a hierarchical tree
14
Since we cannot test all possible
dendrograms we will have to
heuristic search of all possible
dendrograms. We could do this..
Bottom-Up (agglomerative):
Starting with each item in its
own cluster, find the best pair to
merge into a new cluster.
Repeat until all clusters are
fused together.
Top-Down (divisive):
Starting with all the data in a
single cluster, consider every
possible way to divide the
cluster into two. Choose the best
division and recursively operate
on both sides.
Without proof: The number of
dendrograms with n leafs:
= (2n -3)!/[(2(n -2)) (n -2)!]
Number Number of Possible
of Leafs Dendrograms
2 1
3 3
4 15
5 105
... …
10 34,459,425
How to do hierarchical Clustering?
15
Aglomatrative Hierarchical Clustering
dendrogram
Feature 1
Feature 2
Problem: Need a generalization of the distance
between the objects to compound of objects.
Feature 1
Feature 2
What is the distance between
those objects?-> Linkage
Any dissimilarity we have seen before can be used
- euclidean
- manhattan
- simple matching coefficent
- Jaccard dissimilarity
- Gower’s dissimilarity
- etc.
Dissimilarity between samples or observations
17
Dissimilarity between clusters: Linkages
single link (min)
complete link (max)
average
Single link: smallest distance between
point-pairs linking both clusters
Complete link: largest distance
Average: avg distance between
Wards: In this method, we try to minimize
the variance of the merged clusters
Wards
18
How to read a dendrogram
The position of the join node on the distance-scale indicates the distance
between clusters (this distance depends on the linkage method). For
example, if you see two clusters merged at a height 22, it means that the
distance between those clusters was 22 .
When you read a dendrogram, you want to determine at what stage the
distance between clusters that are combined is large.
You look for large distances between sequential join nodes (here vertical lines).
19
Distance (R: clust$height)
Agglomerative hierarchical clustering is often visualized by a dendrogram
https://www.researchgate.net/figure/Example-of-hierarchical-clustering-clusters-are-consecutively-merged-with-the-most_fig3_273456906
20
Simple example
Draw a dendrogram visualizing the grouping of the following 1D data:
Use Euclidian distances and single-linkage
21
Linkage Methods based on
“distances” between 2 clusters:
• Single link: smallest distance
between point-pairs linking both
clusters
• Complete link: largest distance
between
point-pairs linking both clusters
• Average: avg distance between
point-pairs linking both clusters
• Wards: In this method, we try to
minimize the variance of the merged
clusters
Distances between
2 data points / observations
• Euclidean
• Manhattan
• Simple Matching Coefficent
• Jaccard dissimilarity
• Gower’s dissimilarity
• etc.
Cluster result depend on the used distances and linkage methods
22
Cluster result depend on data structure, distances and linkage methods
Single
linkage
complete
linkage
average
linkage
Ward likes to produce
clusters of equal
sizes
Data: we simulated two 2D-Gaussian Clusters with very different sizes
23
Functions “hclust”, “cutree” in package “stats”
Alternative: Function “agnes” in package “cluster”
Agglomerative Clustering in R
24
## determine euclidean distances
dist.pots = dist(pots) ## Apply agglomerative clutstering using Ward linkage
hc = hclust(dist.pots, method="ward.D2")
# plot the dendrogram
plot(hc)
## Split into 3 groups
grps = cutree(hc, k=3) grps
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
#
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
# 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
> str(dat)
'data.frame': 38 obs. of 8 variables:
$ Country : Factor w/ 6 levels "France","Germany",..: 6 6 6 ...
$ Car : Factor w/ 38 levels "AMC Concord D/L",..: 6 21...
$ MPG : num 16.9 15.5 19.2 18.5 30 27.5 27.2 30.9 20.3 17 ...
$ Weight : num 4.36 4.05 3.6 3.94 2.15 ...
$ Drive_Ratio : num 2.73 2.26 2.56 2.45 3.7 3.05 3.54 3.37 3.9 3.5 ...
$ Horsepower : int 155 142 125 150 68 95 97 75 103 125 ...
$ Displacement: int 350 351 267 360 98 134 119 105 131 163 ...
$ Cylinders : int 8 8 8 8 4 4 4 4 5 6 ...
Use a data-set about cars to do a heatmap representation
25
# prepare numeric feature matrix
my.select=dat[,3:8]
x=t(as.matrix(my.select))
# defaults: dist=euclidean,
# linkage=complete,
# rows are scaled
# (mean=0, sd=1)
library(pheatmap)
pheatmap(x, scale="row")
Pretty heatmap does incorporate hierarchical clustering
26
Heatmaps allow to
“look into the clusters”
Here: lef cluster
holds heave thirsty cars
# give matrix x colnames to be used in heatmap plot
colnames(x) = dat$Car
# let's prepare for a color side bar
annot_col = data.frame(Country=dat$Country)
# give assocation between cols (=Car) and annot_col
rownames(annot_col) = colnames(x)
# plot the heatmap
pheatmap(x, scale="row", annotation_col=annot_col)
Add some meta information to heatmap
27
Side-color-bars
allow to check
hypothesis on
an association
between clusters
and an external
categorical
variable that was
not used during
clustering.
Cluster result depend on data structure, distances and linkage methods
Single
linkage
complete
linkage
average
linkage
Ward likes to produce
clusters of equal
sizes
Data: we simulated 2 2D-Gaussian Clusters with very different sizes
28
29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
1
2
3
4
5
6
7
Average linkage
5 14 23 7 4 12 19 21 24 15 16 18 1 3 8 9 29 2 10 11 20 28 17 26 27 25 6 13 22 300
5
10
15
20
25
Wards linkage Single linkage
• Single-Linkage produce long and skinny clusters.
• Wards produce of ten very separated clusters
• Average linkage yield more round clusters
Generally clustering is an exploratory tool.
Use the linkage which produces the “best” results.
Compare linkage methods
29
(a) What is a cluster?
(b) What features should be used?
(c) Should the data be normalized?
(d) Does the data contain any outliers?
(e) How do we define the pair-wise similarity?
(f) How many clusters are present in the data?
(g) Which clustering method should be used?
(h) Does the data have any clustering tendency?
(i) Are the discovered clusters and partition valid?
All these questions have usually no clear answer!
Users dilemma when doing clustering
Data clustering: 50 years beyond K-means AK Jain - Pattern recognition letters, 2010
https://www.sciencedirect.com/science/article/pii/S0167865509002323
How good is the clustering?
31
How to assess the quaility of a clustering?
• Internal QC criterions (w/o ground truth knowledge or labels)
Visualization based: dendrogram, heatmap, coloring in 2D plot (PCA, MDS, t-SNE)
Silhouette plot
WCV to quantify within-cluster variation
…
• External QC measures (requires labels or ground truth knowledge)
• Confusion matrix
• Mutual information
• ….
See also: Data clustering: 50 years beyond K-means AK Jain - Pattern recognition letters, 2010
https://www.sciencedirect.com/science/article/pii/S0167865509002323
Assigned cluster
ACTUAL
CLASS
black yellow red
black 7 0 0
brown 2 6 2
blue 0 0 16
7 6 16Accuracy 0.88
7 6 16 2 2
For an ideal
classifier the off-
diagonal entries
should be zero:
c=0, b=0, or
Accuracy=1
Evaluate prediction accuracy on data
Confusion Matrix:
Simply count the # correct / all
Confusion Matrix / Accuracy
34
Mutual information measures the amount of information
that carries one random variable about the other.
The PMI of a pair of outcomes x and y belonging to
discrete random variables X and Y is defined as:.
2x2-Example:
Joint
distribution: Marginal
distributions:
PMI is zero if X and Y are independent, since in the
case of independence it holds: p(x,y)=p(x)*p(y).
PMI maximizes when X and Y are perfectly associated.
2
( , )( ; ) log
( ) ( )
p x ypmi x y
p x p y
x
y
0
0 1
1 0.7
0.1 0.15
0.05
Pointwise mutual information (PMI): A simple example with two binary variables
2
0.1pm(x=0;y=0)= log 1
0.8 0.25
pm(x=0;y=1)=0.22
pm(x=1;y=0)=1.58
pm(x=1;y=1)=-1.58
35
The mutual information (MI) of the random
variables X and Y is the expected value of the
pointwise mutual information (pmi) over all
possible outcomes.
MI=0 if X carries no information about Y.
,
2
,
( ; ) ( , )
( , )( , ) log
( ) ( )
X Y
x y
MI X Y E pmi x y
p x yp x y
p x p y
x
y
0
0 1
1 0.7
0.1 0.15
0.05
2x2-Example:
2 2
0.1 0.05( ; ) 0.1 log ... 0.05 log
0.8 0.25 0.25 0.75
0.1 ( 1) 0.7 0.22 0.15 1.58 0.05 ( 1.58)
0.21
MI X Y
Mutual Information (MI): A simple example with two binary variables
Joint
distribution: Marginal
distributions:
Using external information to measure cluster quality: Normalized mutual information criterion
https://www.coursera.org/learn/cluster-analysis/lecture/baJNC/6-5-external-measure-2-entropy-based-measures
The mutual information 𝑀𝐼 𝐶, 𝑇 between the categorical
variable 𝑇 – giving the ground truth group (1, .., k) and the
categorical variable 𝐶 - giving the cluster-assignment (1, …, r)
– quantifies the amount of information that the clustering
carries on the ground truth group and therefore 𝑀𝐼 𝐶, 𝑇
quantifies the clustering quality.
2
1 1
(c, t)(C;T) (c, t) log [0, ]
(c) (t)
r k
i j
pMI p
p p
𝑀𝐼 𝐶, 𝑇 is zero if 𝐶 and 𝑇 are independent,
however, there is no upper bound of 𝑀𝐼.
Therefor, often the normalized mutual N𝑀𝐼
information is used;
N𝑀𝐼 = 0 worst possible clustering N𝑀𝐼 = 1 for perfect clustering.
(C;T)(C;T) [0,1]
( ) ( )
MINMI
H C H T
2
1
Entropy of clustering C:
( ) (c) log (c)k
j
H C p p
2
1
Entropy of partitioning T:
(T) (t) log (t)r
j
H p p
Summary: Typical Steps in Cluster Analysis
Hierarchical clustering, dendrogram, cutting a dendrogram
Partitioning methods: k-Means, PAM
Choosing number of clusters (w/o external knowledge): drop in dendrogram (hierarchical clustering)
drop in sum of within-cluster variation WCV (partitioning clustering)
QC (w/o external knowledge): Visualization in 2D: PCA or MDS, use different colors for different clusters
Silhouette plot
WCV to quantify within-cluster variation
Giving meaning to cluster:
- generally hard in high dimensions
- look at centers or representatives (easy in PAM)
- look at heatmap and interpret colors (for numeric data)
- perform Classification with cluster-ID as class-label
and look at variable importance (we will do that later in this course)
37
Let’s switch from unsupervised to supervised learning
Supervised learning unsupervised learning
38
38
Citation (Yann LeCun, 2018)
“THE REVOLUTION WILL NOT BE SUPERVISED”
http://engineering.nyu.edu/news/2018/03/06/revolution-will-not-be-supervised-promises-facebooks-yann-lecun-kickoff-ai-seminar
Classification
Classification is a
prediction method
Idea:
Train a classifier based
on training data and
use the classifier to
classify new test
observations with
unknown class label.
Which feature should we
use to describe an
observation (animal)?
Camels
Dromedaries
???
What is a classification task?
40
Feature extraction
Defining appropriate features is essential for the success of the
classification task!
It is not always as simple as it is in this example:
Features can be combined to new features or selected.
ID of animal Class label Number
of legs
Number of
bumps Length of legs
[cm]
1 Dromedar 4 1 98
2 Kamel 3 2 87
… … … … …
150 Kamel 4 2 103
41
Data
Class label
Dromedar
Kamel
…
Kamel
Number
of legs
Number of
bumps Color of pelt
4 1 brown
3 2 beige
… … …
4 2 brown
Y X Data Matrix with several features Labels are categorical
In classification we try to predict the class labels using the features.
Principal Idea Classification
id Type Sepal.
Length Sepal.W
idth Petal.Le
ngth Petal.Widt
h
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3 1.4 0.2
3 virinica 3.3 3.2 1.6 0.5
4 setosa 5.1 3.5 1.4 0.2
… … … … … …
150 virinica 4.9 3 1.4 0.2
Training Data
Classifier
Learn a classifier
d Type Sepal.
Length Sepal.Widt
h Petal.Len
gth Petal.Widt
h
1 ? 3.1 3.5 1.4 0.2
2 ? 4.9 3 1.4 0.2
3 ? 3.3 3.2 1.6 0.5
4 ? 5.1 3.5 31.4 0.2
Unknown data / Test data
Classifier
Predict
Type
Klassifikatoren
•Neuronale Netze
•Entscheidungsbäum
•…
Note:
To evaluate the performance a part of the labelled data not used to train the classifier
but left aside to check the performance of the classifier to new data.
Examples of Classification Task
Is a given text e.g. tweet about a product positive, negative or neutral.
Sentiment Analysis
“The movie XXX actually neither that funny, nor super witty” Negative
Churn in Marketing: Predict which customer wants to quit and offer them
a discount
Face detection. Image (array of pixels) John
…
44
K-Nearest-Neighbors in a nutshell
Idea of knn classification:
- Start with an observation x0 with
unknown class label
- Find the k training observations, that
have the smallest distance to x0
- Use the majority class among the k
neighbors as class label for x0
R functions to know
- From package “class”: “knn”
x0
45
knn-classificator with 3 class-labels after training with k=1
data with true class label Trained classificator (knn with k=1)
x1 x1
x2 x2
46
The effect of K
The effect of K
Which k to use? Let’s quantify the error / accuracy.
Types of Errors
Training error or naïve error or in-sample-error:
Error on data that have been used to train the model
Generalization error or test error or out-of-sample-error:
Error on previously unseen records (out of sample)
Over-fitting phenomenon:
Model fits the training data well (small training error) but shows high
generalization error
49
“Perfect” Vs. “Simple” classifier
0 1 2 3 4 5 6 x1
x2
0 1
2
3
4
5
6
“Perfect”
classifier
“Simple”
classifier
Which is better?
Check on a test-set (cross validation).
50
Cross validation of the “simple” classifier
0 1 2 3 4 5 6 x1
x2
0 1
2
3
4
5
6
Train Test
Training set:
6/29=20%
misclassification
Test set:
2/25=8%
misclassification
51
Cross validation of the “Perfect” classifier
0 1 2 3 4 5 6 x1
x2
0 1
2
3
4
5
6
Training set:
0%misclassification Test set:
8/25=24%
misclassification Train Test
52
What ist the right complexity of a model?
53
libray(class) # we need this package to use the function knn
data(iris)
# reorder rows randomly
rand=sample(nrow(iris))
iris.rand=iris[rand,]
# generate train and test data
my.train = iris.rand[1:25, 1:4]
my.test = iris.rand[26:50, 1:4]
# generate class labels for train and test data
class.train = iris.rand[1:25, 5]
class.test = iris.rand[26:50, 5]
# train knn with known train labels
# prediction will be done only for test data
iris.knn.pred = knn(train=my.train, test=my.test,
cl=class.train, k=3, prob=TRUE)
# confusion matrix:
table(iris.knn.pred, class.test)
# class.test
# iris.knn.pred setosa versicolor virginica
# setosa 11 0 0
# versicolor 0 3 1
# virginica 0 0 10
knn-classificatin in R with iris and k=3 classes of flowers
54