introduction to computer science - imada.sdu.dkrolf/edu/dm534/e18/dm534_clustering.pdf · fall 2018...

101
Introduction to Computer Science Fall 2018 Fall 2018 DM534 Introduction to Computer Science Clustering and Feature Spaces

Upload: others

Post on 21-Sep-2019

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Fall 2018

DM534

Introduction toComputer Science

Clustering and Feature Spaces

Page 2: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

About Me

2

▪ Richard Roettger:▪ Computer Science (Technical University of Munich and thesis at the ICSI at

the University of California at Berkeley)

▪ PhD at the Max Planck Institute for Computer Science in Saarbrücken

▪ Since 2014: Assistant Professor at SDU

▪ Research Interests:▪ Bioinformatics

▪ Machine Learning

▪ Clustering

▪ Biological Networks

▪ Part of the Slides are taken from Arthur Zimek

Page 3: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Learning Objectives

3

▪ Understand the problem of clustering in general

▪ Learn about k-means

▪ Understand the importance of feature spaces and object representation

▪ Understand the influence of distance functions

Clustering & Feature Spaces

Page 4: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

4

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 5: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Purpose of Clustering

5

▪ Identify a finite number of categories (classes, groups: clusters) in a given dataset

▪ Similar objects shall be grouped in the same cluster, dissimilar objects in different clusters

▪ “similarity” is highly subjective, depending on the application scenario

Page 6: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

How Many Clusters?

6

➢ Image taken from: http://fromdatawithlove.thegovans.us/

Page 7: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

How Many Clusters?

7

➢ Everitt, Brian S., et al. “Cluster Analysis”, 5th Edition (2011).

Page 8: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

How Many Clusters?

8

➢ Everitt, Brian S., et al. “Cluster Analysis”, 5th Edition (2011).

Page 9: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

A Seemingly Simple Problem

9

➢ Figure from Tan et al. [2006].

▪ Each dataset can be clustered in many meaningful ways▪ Highly problem depended

▪ Not known by the algorithm a priori

Page 10: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

About Clustering

10

“Clustering is the unsupervised machine-learning task of “grouping or segmenting a collection of objects into subsets or ’clusters’ such that

those within each cluster are more closely related to one another than objects assigned to different clusters.“

▪ What is related? For example Customers?▪ Age?

▪ Behavior?

▪ Kinship?

▪ Treatment of Outliers?

▪ Ill-posed Problem ...▪ That means there exist multiple solutions

▪ What is the best?

Page 11: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Overview of a Cluster Analysis

11

Page 12: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Overview of a Cluster Analysis

12

Page 13: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

13

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 14: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Steps to Automatization: Cluster Criteria

14

▪ Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)?

▪ Separation: how well is a cluster separated from other clusters?

small within clusterdistance

large between clusterdistance

Page 15: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Steps to Automatization: Cluster Criteria

15

▪ Cohesion: how strong are the cluster objects connected (how similar, pairwise, to each other)?

▪ Separation: how well is a cluster separated from other clusters?

small within clusterdistance

large between clusterdistance

• There exist many other criteria, e.g., areas with the same density.

• It is important to choose a criterion which fits the data!

Page 16: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Optimization

16

▪ Partitional clustering algorithms partition a dataset into k clusters, typically minimizing some cost function▪ no overlaps

▪ all points must be part of a cluster

▪ (compactness criterion), i.e., optimizing cohesion.

Page 17: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Assumptions for Partitioning Clustering

17

▪ Central assumptions for approaches in this family are typically:▪ number k of clusters known (i.e., given as input)

▪ clusters are characterized by their compactness

▪ compactness measured by some distance function (e.g., distance of all objects in a cluster from some cluster representative is minimal)

▪ criterion of compactness typically leads to convex or even spherically shaped clusters

Page 18: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Construction of Central Points: Basics

18

▪ objects are points 𝑥 = 𝑥1, … , 𝑥𝑑 in Euclidean vector space ℝ𝑑

▪ dist = Euclidean distance (𝐿2)

▪ I centroid 𝜇𝐶: mean vector of all points in cluster 𝐶

Page 19: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Construction of Central Points: Basics

19

▪ objects are points 𝑥 = 𝑥1, … , 𝑥𝑑 in Euclidean vector space ℝ𝑑

▪ dist = Euclidean distance (𝐿2)

▪ I centroid 𝜇𝐶: mean vector of all points in cluster 𝐶

Page 20: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Cluster Criteria

20

▪ Measure for compactness:

𝑇𝐷2 𝐶 =

𝑝∈𝐶

𝑑𝑖𝑠𝑡 𝑝, 𝜇𝐶2

(sum of squares)

▪ Measure of compactnessfor a clustering:

𝑇𝐷2 𝐶1, … , 𝐶𝑘 =

𝑖=1

𝑘

𝑇𝐷2(𝐶𝑖)

Page 21: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Cluster Criteria

21

▪ Measure for compactness:

𝑇𝐷2 𝐶 =

𝑝∈𝐶

𝑑𝑖𝑠𝑡 𝑝, 𝜇𝐶2

(sum of squares)

▪ Measure of compactnessfor a clustering:

𝑇𝐷2 𝐶1, … , 𝐶𝑘 =

𝑖=1

𝑘

𝑇𝐷2(𝐶𝑖)

Page 22: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Basic Algorithm: Clustering by Minimization of Variance [Forgy, 1965, Lloyd, 1982]

22

▪ start with 𝑘 (e.g., randomly selected) points as cluster representatives (or with a random partition into 𝑘 “clusters”)

▪ repeat:1. assign each point to the closest representative

2. compute new representatives based on the given partitions (centroid of the assigned points)

▪ until there is no change in assignment

Page 23: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

𝑘-means

23

𝑘-means [MacQueen, 1967] is a variant of the basic algorithm:

▪ A centroid is immediately updated when some point changes its assignment

▪ 𝑘-means has very similar properties, but the result now depends on the order of data points in the input file

Note:

▪ The name “𝑘-means” is often used indifferently for any variant of the basic algorithm, in particular also for the Algorithm shown before [Forgy, 1965, Lloyd, 1982].

Page 24: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

24

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 25: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 201825

k-means ClusteringLloyd/Forgy Algorithm

Page 26: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

26

Page 27: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

27

Page 28: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

28

Page 29: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

29

Page 30: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

30

Page 31: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

31

Page 32: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

32

Page 33: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Lloyd/Forgy Algorithm

33

Page 34: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 201834

k-means ClusteringMacQueen Algorithm

Page 35: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

35

Page 36: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

36

Page 37: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

37

Page 38: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

38

Page 39: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

39

Page 40: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

40

Page 41: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

41

Page 42: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

42

Page 43: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

43

Page 44: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

44

Page 45: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

45

Page 46: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

46

Page 47: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

47

Page 48: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 201848

k-means ClusteringMacQueen Algorithm

Alternative Ordering

Page 49: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

49

Page 50: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

50

Page 51: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

51

Page 52: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

52

Page 53: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

53

Page 54: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

54

Page 55: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

55

Page 56: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

56

Page 57: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

57

Page 58: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

58

Page 59: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

59

Page 60: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – MacQueen Algorithm

60

Page 61: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Quality

61

Page 62: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Quality

62

Page 63: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means Clustering – Quality

63

Page 64: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

64

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 65: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means: Pros

65

▪ Efficient: 𝑂(𝑘 ⋅ 𝑛) per iteration, number of iterations is usually in the order of 10.

▪ Easy to implement, thus very popular

▪ Only one parameter, easy to understand

▪ Well understood and researched

▪ Different variants exists▪ Fuzzy clustering

▪ Variants without n-dimensional embedding

Page 66: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

k-means: Disadvantages

66

▪ k-means converges towards a local minimum

▪ k-means (MacQueen-variant) is order-dependent

▪ Deteriorates with noise and outliers (all points are used to compute centroids)

▪ Clusters need to be convex and of (more or less) equal extension

▪ Number k of clusters is hard to determine

▪ Strong dependency on initial partition (in result quality as well as runtime)

Page 67: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

What to do?

67

How can we tackle the initialization problem?

Page 68: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

What to do?

68

How can we tackle the initialization problem?

▪ Repeated runs

▪ Furthest-first initialization

▪ Subset Furthest-first initialization

Page 69: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Insertion: Furthest First Initialization

69

▪ Select a random point as start

▪ For each point, the minimum of its distances to the selected centers is maintained.

▪ While less than 𝑘 points selected, repeat:▪ Selected point 𝑝 with the maximum distance to the existing centers

▪ Remove p from the not-yet-selected points and add it to the center points

▪ For each remaining not-yet-selected point q, replace the distance stored for q by the minimum of its old value and the distance from p to q.

Page 70: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

70

Page 71: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

71

Page 72: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

72

Page 73: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

73

Page 74: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

74

Update the minimum distances

Page 75: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

75

Page 76: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

76

Page 77: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

77

Page 78: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Furthest First Initialization: Visualization

78

Page 79: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Learnings of this Section

79

▪ What is Clustering?

▪ Basic idea for identifying “good” partitions into k clusters

▪ Selection of representative points

▪ Iterative refinement

▪ Local optimum

▪ k-means variants [Forgy, 1965, Lloyd, 1982, MacQueen, 1967]

▪ Different initialization methods

Page 80: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

80

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 81: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Recall: Clustering as a Workflow

81

Page 82: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Similarities

82

▪ Similarity (as given by some distance measure) is a central concept in data mining, e.g.:▪ Clustering: group similar objects in the same cluster, separate dissimilar

objects to different clusters

▪ Outlier detection: identify objects that are dissimilar (by some characteristic) from most other objects

▪ Definition of a suitable distance measure is often crucial for deriving a meaningful solution in the data mining task▪ Images

▪ CAD objects

▪ Proteins

▪ Texts

▪ . . .

Page 83: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Spaces and Distance Functions

83

Page 84: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Spaces and Distance Functions

84

Page 85: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

85

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 86: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Categories of Feature Descriptors for Images

86

▪ Distribution of colors

▪ Texture

▪ Shapes (contours)

▪ Many more …

Page 87: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Color Histogram

87

▪ A histogram represents the distribution of colors over the pixels of an image

▪ Definition of an color histogram:▪ Choose a color space (RGB, HSV, HLS, . . . )

▪ Choose number of representants (sample points) in the color space

▪ Possibly normalization (to account for different image sizes)

Page 88: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

88

Page 89: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

89

Page 90: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

90

Page 91: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

91

Page 92: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

92

Page 93: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

93

Page 94: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

94

Page 95: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Impact of Number of Representants

95

Page 96: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Distances for Color Histograms

96

Page 97: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Distances for Color Histograms

97

Page 98: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Distances for Color Histograms

98

Page 99: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Lecture ContentClustering & Feature Spaces

99

▪ Clustering

▪ Clustering in General

▪ Partitional Clustering

▪ Visualization: Algorithmic Differences

▪ Summary

▪ Feature Spaces

▪ Distances

▪ Features for Images

▪ Summary

Page 100: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Your Choice of a Distance Measure

100

▪ There are hundreds of distance functions [Deza and Deza, 2009].▪ For time series: DTW, EDR, ERP, LCSS, . . .

▪ For texts: Cosine and normalizations

▪ For sets – based on intersection, union, . . . (Jaccard)

▪ For clusters (single-link, average-link, etc.)

▪ For histograms: histogram intersection, “Earth movers distance”, quadratic forms with color similarity

▪ For proteins: Edit distance, structure, …

▪ With normalization: Canberra, …

▪ Quadratic forms / bilinear forms: 𝑑(𝑥, 𝑦) ∶= 𝑥𝑇𝑀𝑦 for some positive (usually symmetric) definite matrix 𝑀.

Page 101: Introduction to Computer Science - imada.sdu.dkrolf/Edu/DM534/E18/dm534_clustering.pdf · Fall 2018 Introduction to Computer Science Learning Objectives 3 Understand the problem of

Introduction to Computer ScienceFall 2018

Learnings of this Section

101

▪ Distances (𝐿𝑝-norms, weighted, quadratic form)

▪ Color histograms as feature (vector) descriptors for images

▪ Impact of the granularity of color histograms on similarity measures