introduction to computer sciencerolf/edu/dm534/e18/dm534... · 2018-10-04 · fall 2018...
TRANSCRIPT
Introduction to Computer ScienceFall 2018
Fall 2018
DM534
Introduction toComputer Science
Clustering and Feature Spaces
Introduction to Computer ScienceFall 2018
About Me
2
▪ Richard Roettger:▪ Computer Science (Technical University of Munich and thesis at the ICSI at
the University of California at Berkeley)
▪ PhD at the Max Planck Institute for Computer Science in Saarbrücken
▪ Since 2014: Assistant Professor at SDU
▪ Research Interests:▪ Bioinformatics
▪ Machine Learning
▪ Clustering
▪ Biological Networks
▪ Part of the Slides are taken from Arthur Zimek
Introduction to Computer ScienceFall 2018
Learning Objectives
3
▪ Understand the problem of clustering in general
▪ Learn about k-means
▪ Understand the importance of feature spaces and object representation
▪ Understand the influence of distance functions
Clustering & Feature Spaces
Introduction to Computer ScienceFall 2018
Lecture ContentClustering & Feature Spaces
4
▪ Clustering
▪ Clustering in General
▪ Partitional Clustering
▪ Visualization: Algorithmic Differences
▪ Summary
▪ Feature Spaces
▪ Distances
▪ Features for Images
▪ Summary
Introduction to Computer ScienceFall 2018
Purpose of Clustering
5
▪ Identify a finite number of categories (classes, groups: clusters) in a given dataset
▪ Similar objects shall be grouped in the same cluster, dissimilar objects in different clusters
▪ “similarity” is highly subjective, depending on the application scenario
Introduction to Computer ScienceFall 2018
How Many Clusters?
6
➢ Image taken from: http://fromdatawithlove.thegovans.us/
Introduction to Computer ScienceFall 2018
How Many Clusters?
7
➢ Everitt, Brian S., et al. “Cluster Analysis”, 5th Edition (2011).
Introduction to Computer ScienceFall 2018
How Many Clusters?
8
➢ Everitt, Brian S., et al. “Cluster Analysis”, 5th Edition (2011).
Introduction to Computer ScienceFall 2018
A Seemingly Simple Problem
9
➢ Figure from Tan et al. [2006].
▪ Each dataset can be clustered in many meaningful ways▪ Highly problem depended
▪ Not known by the algorithm a priori
Introduction to Computer ScienceFall 2018
About Clustering
10
“Clustering is the unsupervised machine-learning task of “grouping or segmenting a collection of objects into subsets or ’clusters’ such that
those within each cluster are more closely related to one another than objects assigned to different clusters.“
▪ What is related? For example Customers?▪ Age?
▪ Behavior?
▪ Kinship?
▪ Treatment of Outliers?
▪ Ill-posed Problem ...▪ That means there exist multiple solutions
▪ What is the best?
Introduction to Computer ScienceFall 2018
Overview of a Cluster Analysis
11
Introduction to Computer ScienceFall 2018
Overview of a Cluster Analysis
12
Introduction to Computer ScienceFall 2018
Lecture ContentClustering & Feature Spaces
13
▪ Clustering
▪ Clustering in General
▪ Partitional Clustering
▪ Visualization: Algorithmic Differences
▪ Summary
▪ Feature Spaces
▪ Distances
▪ Features for Images
▪ Summary
Introduction to Computer ScienceFall 2018
Steps to Automatization: Cluster Criteria
14
▪ Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)?
▪ Separation: how well is a cluster separated from other clusters?
small within clusterdistance
large between clusterdistance
Introduction to Computer ScienceFall 2018
Steps to Automatization: Cluster Criteria
15
▪ Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)?
▪ Separation: how well is a cluster separated from other clusters?
small within clusterdistance
large between clusterdistance
• There exist many other criteria, e.g., areas with the same density.
• It is important to choose a criterion which fits the data!
Introduction to Computer ScienceFall 2018
Optimization
16
▪ Partitional clustering algorithms partition a dataset into k clusters, typically minimizing some cost function▪ no overlaps
▪ all points must be part of a cluster
▪ (compactness criterion), i.e., optimizing cohesion.
Introduction to Computer ScienceFall 2018
Assumptions for Partitioning Clustering
17
▪ Central assumptions for approaches in this family are typically:▪ number k of clusters known (i.e., given as input)
▪ clusters are characterized by their compactness
▪ compactness measured by some distance function (e.g., distance of all objects in a cluster from some cluster representative is minimal)
▪ criterion of compactness typically leads to convex or even spherically shaped clusters
Introduction to Computer ScienceFall 2018
Construction of Central Points: Basics
18
▪ objects are points 𝑥 = 𝑥1, … , 𝑥𝑑 in Euclidean vector space ℝ𝑑
▪ dist = Euclidean distance (𝐿2)
▪ I centroid 𝜇𝐶: mean vector of all points in cluster 𝐶
Introduction to Computer ScienceFall 2018
Construction of Central Points: Basics
19
▪ objects are points 𝑥 = 𝑥1, … , 𝑥𝑑 in Euclidean vector space ℝ𝑑
▪ dist = Euclidean distance (𝐿2)
▪ I centroid 𝜇𝐶: mean vector of all points in cluster 𝐶
Introduction to Computer ScienceFall 2018
Cluster Criteria
20
▪ Measure for compactness:
𝑇𝐷2 𝐶 =
𝑝∈𝐶
𝑑𝑖𝑠𝑡 𝑝, 𝜇𝐶2
(sum of squares)
▪ Measure of compactnessfor a clustering:
𝑇𝐷2 𝐶1, … , 𝐶𝑘 =
𝑖=1
𝑘
𝑇𝐷2(𝐶𝑖)
Introduction to Computer ScienceFall 2018
Cluster Criteria
21
▪ Measure for compactness:
𝑇𝐷2 𝐶 =
𝑝∈𝐶
𝑑𝑖𝑠𝑡 𝑝, 𝜇𝐶2
(sum of squares)
▪ Measure of compactnessfor a clustering:
𝑇𝐷2 𝐶1, … , 𝐶𝑘 =
𝑖=1
𝑘
𝑇𝐷2(𝐶𝑖)
Introduction to Computer ScienceFall 2018
Basic Algorithm: Clustering by Minimization of Variance [Forgy, 1965, Lloyd, 1982]
22
▪ start with 𝑘 (e.g., randomly selected) points as cluster representatives (or with a random partition into 𝑘 “clusters”)
▪ repeat:1. assign each point to the closest representative
2. compute new representatives based on the given partitions (centroid of the assigned points)
▪ until there is no change in assignment
Introduction to Computer ScienceFall 2018
𝑘-means
23
𝑘-means [MacQueen, 1967] is a variant of the basic algorithm:
▪ A centroid is immediately updated when some point changes its assignment
▪ 𝑘-means has very similar properties, but the result now depends on the order of data points in the input file
Note:
▪ The name “𝑘-means” is often used indifferently for any variant of the basic algorithm, in particular also for the Algorithm shown before [Forgy, 1965, Lloyd, 1982].
Introduction to Computer ScienceFall 2018
Lecture ContentClustering & Feature Spaces
24
▪ Clustering
▪ Clustering in General
▪ Partitional Clustering
▪ Visualization: Algorithmic Differences
▪ Summary
▪ Feature Spaces
▪ Distances
▪ Features for Images
▪ Summary
Introduction to Computer ScienceFall 201825
k-means ClusteringLloyd/Forgy Algorithm
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
26
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
27
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
28
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
29
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
30
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
31
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
32
Introduction to Computer ScienceFall 2018
k-means Clustering – Lloyd/Forgy Algorithm
33
Introduction to Computer ScienceFall 201834
k-means ClusteringMacQueen Algorithm
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
35
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
36
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
37
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
38
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
39
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
40
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
41
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
42
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
43
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
44
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
45
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
46
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
47
Introduction to Computer ScienceFall 201848
k-means ClusteringMacQueen Algorithm
Alternative Ordering
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
49
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
50
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
51
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
52
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
53
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
54
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
55
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
56
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
57
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
58
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
59
Introduction to Computer ScienceFall 2018
k-means Clustering – MacQueen Algorithm
60
Introduction to Computer ScienceFall 2018
k-means Clustering – Quality
61
Introduction to Computer ScienceFall 2018
k-means Clustering – Quality
62
Introduction to Computer ScienceFall 2018
k-means Clustering – Quality
63
Introduction to Computer ScienceFall 2018
Lecture ContentClustering & Feature Spaces
64
▪ Clustering
▪ Clustering in General
▪ Partitional Clustering
▪ Visualization: Algorithmic Differences
▪ Summary
▪ Feature Spaces
▪ Distances
▪ Features for Images
▪ Summary
Introduction to Computer ScienceFall 2018
k-means: Pros
65
▪ Efficient: 𝑂(𝑘 ⋅ 𝑛) per iteration, number of iterations is usually in the order of 10.
▪ Easy to implement, thus very popular
▪ Only one parameter, easy to understand
▪ Well understood and researched
▪ Different variants exists▪ Fuzzy clustering
▪ Variants without n-dimensional embedding
Introduction to Computer ScienceFall 2018
k-means: Disadvantages
66
▪ k-means converges towards a local minimum
▪ k-means (MacQueen-variant) is order-dependent
▪ Deteriorates with noise and outliers (all points are used to compute centroids)
▪ Clusters need to be convex and of (more or less) equal extension
▪ Number k of clusters is hard to determine
▪ Strong dependency on initial partition (in result quality as well as runtime)
Introduction to Computer ScienceFall 2018
What to do?
67
How can we tackle the initialization problem?
Introduction to Computer ScienceFall 2018
What to do?
68
How can we tackle the initialization problem?
▪ Repeated runs
▪ Furthest-first initialization
▪ Subset Furthest-first initialization
Introduction to Computer ScienceFall 2018
Insertion: Furthest First Initialization
69
▪ Select a random point as start
▪ For each point, the minimum of its distances to the selected centers is maintained.
▪ While less than 𝑘 points selected, repeat:▪ Selected point 𝑝 with the maximum distance to the existing centers
▪ Remove p from the not-yet-selected points and add it to the center points
▪ For each remaining not-yet-selected point q, replace the distance stored for q by the minimum of its old value and the distance from p to q.
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
70
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
71
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
72
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
73
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
74
Update the minimum distances
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
75
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
76
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
77
Introduction to Computer ScienceFall 2018
Furthest First Initialization: Visualization
78
Introduction to Computer ScienceFall 2018
Learnings of this Section
79
▪ What is Clustering?
▪ Basic idea for identifying “good” partitions into k clusters
▪ Selection of representative points
▪ Iterative refinement
▪ Local optimum
▪ k-means variants [Forgy, 1965, Lloyd, 1982, MacQueen, 1967]
▪ Different initialization methods
Introduction to Computer ScienceFall 2018
Lecture ContentClustering & Feature Spaces
80
▪ Clustering
▪ Clustering in General
▪ Partitional Clustering
▪ Visualization: Algorithmic Differences
▪ Summary
▪ Feature Spaces
▪ Distances
▪ Features for Images
▪ Summary
Introduction to Computer ScienceFall 2018
Recall: Clustering as a Workflow
81
Introduction to Computer ScienceFall 2018
Similarities
82
▪ Similarity (as given by some distance measure) is a central concept in data mining, e.g.:▪ Clustering: group similar objects in the same cluster, separate dissimilar
objects to different clusters
▪ Outlier detection: identify objects that are dissimilar (by some characteristic) from most other objects
▪ Definition of a suitable distance measure is often crucial for deriving a meaningful solution in the data mining task▪ Images
▪ CAD objects
▪ Proteins
▪ Texts
▪ . . .
Introduction to Computer ScienceFall 2018
Spaces and Distance Functions
83
Introduction to Computer ScienceFall 2018
Spaces and Distance Functions
84
Introduction to Computer ScienceFall 2018
Lecture ContentClustering & Feature Spaces
85
▪ Clustering
▪ Clustering in General
▪ Partitional Clustering
▪ Visualization: Algorithmic Differences
▪ Summary
▪ Feature Spaces
▪ Distances
▪ Features for Images
▪ Summary
Introduction to Computer ScienceFall 2018
Categories of Feature Descriptors for Images
86
▪ Distribution of colors
▪ Texture
▪ Shapes (contours)
▪ Many more …
Introduction to Computer ScienceFall 2018
Color Histogram
87
▪ A histogram represents the distribution of colors over the pixels of an image
▪ Definition of an color histogram:▪ Choose a color space (RGB, HSV, HLS, . . . )
▪ Choose number of representants (sample points) in the color space
▪ Possibly normalization (to account for different image sizes)
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
88
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
89
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
90
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
91
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
92
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
93
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
94
Introduction to Computer ScienceFall 2018
Impact of Number of Representants
95
Introduction to Computer ScienceFall 2018
Distances for Color Histograms
96
Introduction to Computer ScienceFall 2018
Distances for Color Histograms
97
Introduction to Computer ScienceFall 2018
Distances for Color Histograms
98
Introduction to Computer ScienceFall 2018
Lecture ContentClustering & Feature Spaces
99
▪ Clustering
▪ Clustering in General
▪ Partitional Clustering
▪ Visualization: Algorithmic Differences
▪ Summary
▪ Feature Spaces
▪ Distances
▪ Features for Images
▪ Summary
Introduction to Computer ScienceFall 2018
Your Choice of a Distance Measure
100
▪ There are hundreds of distance functions [Deza and Deza, 2009].▪ For time series: DTW, EDR, ERP, LCSS, . . .
▪ For texts: Cosine and normalizations
▪ For sets – based on intersection, union, . . . (Jaccard)
▪ For clusters (single-link, average-link, etc.)
▪ For histograms: histogram intersection, “Earth movers distance”, quadratic forms with color similarity
▪ For proteins: Edit distance, structure, …
▪ With normalization: Canberra, …
▪ Quadratic forms / bilinear forms: 𝑑(𝑥, 𝑦) ∶= 𝑥𝑇𝑀𝑦 for some positive (usually symmetric) definite matrix 𝑀.
Introduction to Computer ScienceFall 2018
Learnings of this Section
101
▪ Distances (𝐿𝑝-norms, weighted, quadratic form)
▪ Color histograms as feature (vector) descriptors for images
▪ Impact of the granularity of color histograms on similarity measures