introduction to computer sciencerolf/edu/dm534/e18/dm534... · 2018-10-04 · fall 2018...

101
Introduction to Computer Science Fall 2018 Fall 2018 DM534 Introduction to Computer Science Clustering and Feature Spaces

Upload: others

Post on 05-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Fall 2018

DM534

Introduction toComputer Science

Clustering and Feature Spaces

Page 2: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

About Me

2

▪ Richard Roettger:▪ Computer Science (Technical University of Munich and thesis at the ICSI at

the University of California at Berkeley)

▪ PhD at the Max Planck Institute for Computer Science in Saarbrücken

▪ Since 2014: Assistant Professor at SDU

▪ Research Interests:▪ Bioinformatics

▪ Machine Learning

▪ Clustering

▪ Biological Networks

▪ Part of the Slides are taken from Arthur Zimek

Page 3: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Learning Objectives

3

▪ Understand the problem of clustering in general

▪ Learn about k-means

▪ Understand the importance of feature spaces and object representation

▪ Understand the influence of distance functions

Clustering & Feature Spaces

Page 4: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

4

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 5: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Purpose of Clustering

5

▪ Identify a finite number of categories (classes, groups: clusters) in a given dataset

▪ Similar objects shall be grouped in the same cluster, dissimilar objects in different clusters

▪ “similarity” is highly subjective, depending on the application scenario

Page 6: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

How Many Clusters?

6

➢ Image taken from: http://fromdatawithlove.thegovans.us/

Page 7: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

How Many Clusters?

7

➢ Everitt, Brian S., et al. “Cluster Analysis”, 5th Edition (2011).

Page 8: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

How Many Clusters?

8

➢ Everitt, Brian S., et al. “Cluster Analysis”, 5th Edition (2011).

Page 9: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

A Seemingly Simple Problem

9

➢ Figure from Tan et al. [2006].

▪ Each dataset can be clustered in many meaningful ways▪ Highly problem depended

▪ Not known by the algorithm a priori

Page 10: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

About Clustering

10

“Clustering is the unsupervised machine-learning task of “grouping or segmenting a collection of objects into subsets or ’clusters’ such that

those within each cluster are more closely related to one another than objects assigned to different clusters.“

▪ What is related? For example Customers?▪ Age?

▪ Behavior?

▪ Kinship?

▪ Treatment of Outliers?

▪ Ill-posed Problem ...▪ That means there exist multiple solutions

▪ What is the best?

Page 11: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Overview of a Cluster Analysis

11

Page 12: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Overview of a Cluster Analysis

12

Page 13: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

13

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 14: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Steps to Automatization: Cluster Criteria

14

▪ Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)?

▪ Separation: how well is a cluster separated from other clusters?

small within clusterdistance

large between clusterdistance

Page 15: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Steps to Automatization: Cluster Criteria

15

▪ Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)?

▪ Separation: how well is a cluster separated from other clusters?

small within clusterdistance

large between clusterdistance

• There exist many other criteria, e.g., areas with the same density.

• It is important to choose a criterion which fits the data!

Page 16: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Optimization

16

▪ Partitional clustering algorithms partition a dataset into k clusters, typically minimizing some cost function▪ no overlaps

▪ all points must be part of a cluster

▪ (compactness criterion), i.e., optimizing cohesion.

Page 17: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Assumptions for Partitioning Clustering

17

▪ Central assumptions for approaches in this family are typically:▪ number k of clusters known (i.e., given as input)

▪ clusters are characterized by their compactness

▪ compactness measured by some distance function (e.g., distance of all objects in a cluster from some cluster representative is minimal)

▪ criterion of compactness typically leads to convex or even spherically shaped clusters

Page 18: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Construction of Central Points: Basics

18

▪ objects are points 𝑥 = 𝑥1, … , 𝑥𝑑 in Euclidean vector space ℝ𝑑

▪ dist = Euclidean distance (𝐿2)

▪ I centroid 𝜇𝐶: mean vector of all points in cluster 𝐶

Page 19: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Construction of Central Points: Basics

19

▪ objects are points 𝑥 = 𝑥1, … , 𝑥𝑑 in Euclidean vector space ℝ𝑑

▪ dist = Euclidean distance (𝐿2)

▪ I centroid 𝜇𝐶: mean vector of all points in cluster 𝐶

Page 20: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Cluster Criteria

20

▪ Measure for compactness:

𝑇𝐷2 𝐶 =

𝑝∈𝐶

𝑑𝑖𝑠𝑡 𝑝, 𝜇𝐶2

(sum of squares)

▪ Measure of compactnessfor a clustering:

𝑇𝐷2 𝐶1, … , 𝐶𝑘 =

𝑖=1

𝑘

𝑇𝐷2(𝐶𝑖)

Page 21: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Cluster Criteria

21

▪ Measure for compactness:

𝑇𝐷2 𝐶 =

𝑝∈𝐶

𝑑𝑖𝑠𝑡 𝑝, 𝜇𝐶2

(sum of squares)

▪ Measure of compactnessfor a clustering:

𝑇𝐷2 𝐶1, … , 𝐶𝑘 =

𝑖=1

𝑘

𝑇𝐷2(𝐶𝑖)

Page 22: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Basic Algorithm: Clustering by Minimization of Variance [Forgy, 1965, Lloyd, 1982]

22

▪ start with 𝑘 (e.g., randomly selected) points as cluster representatives (or with a random partition into 𝑘 “clusters”)

▪ repeat:1. assign each point to the closest representative

2. compute new representatives based on the given partitions (centroid of the assigned points)

▪ until there is no change in assignment

Page 23: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

𝑘-means

23

𝑘-means [MacQueen, 1967] is a variant of the basic algorithm:

▪ A centroid is immediately updated when some point changes its assignment

▪ 𝑘-means has very similar properties, but the result now depends on the order of data points in the input file

Note:

▪ The name “𝑘-means” is often used indifferently for any variant of the basic algorithm, in particular also for the Algorithm shown before [Forgy, 1965, Lloyd, 1982].

Page 24: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

24

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 25: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 201825

k-means ClusteringLloyd/Forgy Algorithm

Page 26: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

26

Page 27: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

27

Page 28: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

28

Page 29: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

29

Page 30: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

30

Page 31: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

31

Page 32: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

32

Page 33: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

33

Page 34: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 201834

k-means ClusteringMacQueen Algorithm

Page 35: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

35

Page 36: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

36

Page 37: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

37

Page 38: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

38

Page 39: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

39

Page 40: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

40

Page 41: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

41

Page 42: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

42

Page 43: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

43

Page 44: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

44

Page 45: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

45

Page 46: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

46

Page 47: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

47

Page 48: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 201848

k-means ClusteringMacQueen Algorithm

Alternative Ordering

Page 49: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

49

Page 50: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

50

Page 51: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

51

Page 52: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

52

Page 53: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

53

Page 54: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

54

Page 55: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

55

Page 56: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

56

Page 57: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

57

Page 58: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

58

Page 59: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

59

Page 60: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

60

Page 61: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Quality

61

Page 62: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Quality

62

Page 63: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means Clustering – Quality

63

Page 64: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

64

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 65: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means: Pros

65

▪ Efficient: 𝑂(𝑘 ⋅ 𝑛) per iteration, number of iterations is usually in the order of 10.

▪ Easy to implement, thus very popular

▪ Only one parameter, easy to understand

▪ Well understood and researched

▪ Different variants exists▪ Fuzzy clustering

▪ Variants without n-dimensional embedding

Page 66: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

k-means: Disadvantages

66

▪ k-means converges towards a local minimum

▪ k-means (MacQueen-variant) is order-dependent

▪ Deteriorates with noise and outliers (all points are used to compute centroids)

▪ Clusters need to be convex and of (more or less) equal extension

▪ Number k of clusters is hard to determine

▪ Strong dependency on initial partition (in result quality as well as runtime)

Page 67: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

What to do?

67

How can we tackle the initialization problem?

Page 68: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

What to do?

68

How can we tackle the initialization problem?

▪ Repeated runs

▪ Furthest-first initialization

▪ Subset Furthest-first initialization

Page 69: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Insertion: Furthest First Initialization

69

▪ Select a random point as start

▪ For each point, the minimum of its distances to the selected centers is maintained.

▪ While less than 𝑘 points selected, repeat:▪ Selected point 𝑝 with the maximum distance to the existing centers

▪ Remove p from the not-yet-selected points and add it to the center points

▪ For each remaining not-yet-selected point q, replace the distance stored for q by the minimum of its old value and the distance from p to q.

Page 70: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

70

Page 71: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

71

Page 72: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

72

Page 73: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

73

Page 74: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

74

Update the minimum distances

Page 75: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

75

Page 76: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

76

Page 77: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

77

Page 78: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

78

Page 79: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Learnings of this Section

79

▪ What is Clustering?

▪ Basic idea for identifying “good” partitions into k clusters

▪ Selection of representative points

▪ Iterative refinement

▪ Local optimum

▪ k-means variants [Forgy, 1965, Lloyd, 1982, MacQueen, 1967]

▪ Different initialization methods

Page 80: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

80

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 81: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Recall: Clustering as a Workflow

81

Page 82: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Similarities

82

▪ Similarity (as given by some distance measure) is a central concept in data mining, e.g.:▪ Clustering: group similar objects in the same cluster, separate dissimilar

objects to different clusters

▪ Outlier detection: identify objects that are dissimilar (by some characteristic) from most other objects

▪ Definition of a suitable distance measure is often crucial for deriving a meaningful solution in the data mining task▪ Images

▪ CAD objects

▪ Proteins

▪ Texts

▪ . . .

Page 83: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Spaces and Distance Functions

83

Page 84: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Spaces and Distance Functions

84

Page 85: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

85

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 86: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Categories of Feature Descriptors for Images

86

▪ Distribution of colors

▪ Texture

▪ Shapes (contours)

▪ Many more …

Page 87: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Color Histogram

87

▪ A histogram represents the distribution of colors over the pixels of an image

▪ Definition of an color histogram:▪ Choose a color space (RGB, HSV, HLS, . . . )

▪ Choose number of representants (sample points) in the color space

▪ Possibly normalization (to account for different image sizes)

Page 88: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

88

Page 89: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

89

Page 90: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

90

Page 91: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

91

Page 92: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

92

Page 93: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

93

Page 94: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

94

Page 95: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

95

Page 96: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Distances for Color Histograms

96

Page 97: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Distances for Color Histograms

97

Page 98: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Distances for Color Histograms

98

Page 99: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

99

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 100: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Your Choice of a Distance Measure

100

▪ There are hundreds of distance functions [Deza and Deza, 2009].▪ For time series: DTW, EDR, ERP, LCSS, . . .

▪ For texts: Cosine and normalizations

▪ For sets – based on intersection, union, . . . (Jaccard)

▪ For clusters (single-link, average-link, etc.)

▪ For histograms: histogram intersection, “Earth movers distance”, quadratic forms with color similarity

▪ For proteins: Edit distance, structure, …

▪ With normalization: Canberra, …

▪ Quadratic forms / bilinear forms: 𝑑(𝑥, 𝑦) ∶= 𝑥𝑇𝑀𝑦 for some positive (usually symmetric) definite matrix 𝑀.

Page 101: Introduction to Computer Sciencerolf/Edu/DM534/E18/dm534... · 2018-10-04 · Fall 2018 Introduction to Computer Science Assumptions for Partitioning Clustering 17 Central assumptions

Introduction to Computer ScienceFall 2018

Learnings of this Section

101

▪ Distances (𝐿𝑝-norms, weighted, quadratic form)

▪ Color histograms as feature (vector) descriptors for images

▪ Impact of the granularity of color histograms on similarity measures