machine learning: clustering · machine learning: clustering ste en rendle information systems and...

101
Clustering k-Means Agglomerative Clustering Use Case Summary Machine Learning: Clustering Steffen Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Upload: others

Post on 14-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Machine Learning: Clustering

Steffen Rendle

Information Systems and Machine Learning Lab (ISMLL)University of Hildesheim

Wintersemester 2007 / 2008

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 2: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

ClusteringOverviewExamplesClustering Tasks

k-MeansOverviewAlgorithm

Agglomerative ClusteringOverviewAlgorithm

Use CaseTaskMethod

Summary

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 3: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Overview

The objective of clustering is to group similar data D.

I the groups are called clusters

I clustering is unsupervised, i.e. neither training data nor classesare given in advance

I grouping/ clustering depends on the algorithm

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 4: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 5: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 6: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 7: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 8: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 9: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Clustering of Search Results

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 10: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Clustering of Search Results

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 11: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Clustering of Search Results

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 12: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Wafer Analysis

Fertigungsprozess 1

Fertigungsprozess 2

...

Fehlerursache 1

Fehlerursache 2

Test 1..n

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 13: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Wafer Analysis

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 14: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Wafer Analysis

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 15: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Object Identification

DB

Shop Product name Price

T-Online Fuji FinePix S5600 279,00

Amazon FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom) 254,90

Cyberport Fuji FinePix S5600 259,90

Mediamarkt Fine Pix S 5600 245,00

Mediamarkt Fine Pix S 9500 515,00

Amazon Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom) 349,99

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 16: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Clustering Tasks

I Hard-Clustering: find a partition of the data

I Soft-Clustering/ Fuzzy-Clustering: find propabilities of groupmembership for each item

I Hierarchical Clustering: find a dendrogram (tree) of the data

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 17: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hard-Clustering

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 18: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hard-Clustering

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 19: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Soft-Clustering

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 20: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Soft-Clustering

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 21: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hierarchical-Clustering

A B C D E F G H I J K

AB

C D

E

FG

H

I

JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 22: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hierarchical-Clustering

A B C D E F G H I J K

AB

C D

E

FG

H

I

JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 23: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hierarchical-Clustering

A B C D E F G H I J K

AB

C D

E

FG

H

I

JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 24: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hierarchical-Clustering

A B C D E F G H I J K

AB

C D

E

FG

H

I

JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 25: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hierarchical-Clustering

A B C D E F G H I J K

AB

C D

E

FG

H

I

JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 26: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

k-Means

I partitional clusteringI given

I Data D = {d1, ..., dn} ∈ P(Rm) with di = (xi,1, . . . , xi,m) ∈ Rm

I Number of clusters kI Similarity sim : Rm × Rm → R+

I to findI Partition of the data f : D → {1, . . . , k}

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 27: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

k-Means Algorithm

function k-Means(D, k , sim)for all j ∈ {1, . . . , k} do

yj ← randomD

end forrepeat

f ′ ← ff (d)← argmax

j∈{1,...,k}sim(yj , d)

for all j ∈ {1, . . . , k} doyj ← avg

d∈{d |f (d)=j}d

end foruntil f’ = freturn f

end function

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 28: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 29: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 30: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 31: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 32: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 33: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 34: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 35: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 36: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 37: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 38: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 39: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 40: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 41: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 42: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 43: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 44: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 45: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 46: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 47: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 48: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means I

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 49: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means I

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 50: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means I

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 51: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means I

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 52: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means I

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 53: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means II

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 54: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means II

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 55: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means II

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 56: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means II

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 57: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means II

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 58: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problems of k-Means

I k-Means is run several times and the”best“ result is returned.

I for determing the”best“ partition heuristic measures like intra

cluster variance can be used:

ICV(f ,D) =k∑

j=1

∑d∈{d |f (d)=j}

∥∥∥∥∥d − avgd ′∈{d ′|f (d ′)=j}

d ′

∥∥∥∥∥2

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 59: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Properties of k-Means

I easy to implement

I in practice often fastI data must be present in a metric space (e.g. euclidian space:

Rn with ‖·‖) so that centroids can be calculated.I Counter example: strings

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 60: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Agglomerative Clustering

Agglomerative Clustering can solve several tasks:

I partitional clustering with given number of clusters k orsimilarity threshold θ

I hierarchical clustering

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 61: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Greedy Agglomerative Clustering

I partitional clusteringI given

I Data D = {d1, ..., dn}I Similarity sim : D × D → R+

I Number of clusters k or threshold θ on similarities

I to findI Partition of the data f : D → N

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 62: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hierarchical Agglomerative Clustering

I hierarchical clusteringI given

I Data D = {d1, ..., dn}I Similarity sim : D × D → R+

I to findI Series fi of partitions of the data fi : D → N with

img fi ⊂ img fi+1

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 63: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Agglomerative Clustering Algorithm

function AgglomerativeClustering(D, sim)m← 0for all i ∈ {1, . . . , n} do

fm(di )← iend forrepeat

(i , j) = argmaxi ,j∈img(fm),i 6=j

sim?(fm, i , j)

fm+1 ← fmfor all d ∈ {d ′|fm(d ′) = j} do

fm+1(d)← iend form← m + 1

until convergence(fm)return fm

end function

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 64: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Convergence

convergence(f ) depends on the task:

I if k given:convergence(f )⇔ | img(f )| ≤ k

I if θ given:convergence(f )⇔ max

i ,j∈img(f ),i 6=jsimX (f , i , j) ≤ θ

I in case of hierarchical clustering:convergence(f )⇔ | img(f )| = 1

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 65: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Similarity between Clusters

A

BE

D

C

A

BE

D

C

?0.9

0.82

0.60.7

0.2

0.63

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 66: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Similarity between Clusters

Several possibilities for similarity sim?(f , i , j) between clusters

I single linkage:simSL(f , i , j) = max

(d ,d ′)∈f −1(i)×f −1(j)sim(d , d ′)

I complete linkage:simCL(f , i , j) = min

(d ,d ′)∈f −1(i)×f −1(j)sim(d , d ′)

I average linkage:simAL(f , i , j) = avg

(d ,d ′)∈f −1(i)×f −1(j)

sim(d , d ′)

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 67: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Single Linkage

A

BE

D

C

A

BE

D

C

0.90.9

0.82

0.60.7

0.2

0.63

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 68: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Complete Linkage

A

BE

D

C

A

BE

D

C

0.20.9

0.82

0.60.7

0.2

0.63

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 69: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Average Linkage

A

BE

D

C

A

BE

D

C

0.640.9

0.82

0.60.7

0.2

0.63

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 70: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A

B

CD

E

FG

H

I

J

K

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 71: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A

B

CD

E

FG

H

I

J

K

A B C D E F G H I J K

A B C D E F G H I J KAB .8C .8 .9D .5 .9 .8E .2 .2 .3 .2F .1 .1 .2 .1 .9G .1 .2 .3 .2 .9 .8H .2 .2 .2 .3 .1 .0 .2I .2 .2 .2 .3 .2 .1 .3 .9J .0 .1 .1 .2 .2 .1 .3 .8 .9K .0 .1 .1 .2 .1 .0 .3 .8 .9 .9

A B C D E F G H I J K

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 72: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A

B

CD

E

FG

H

I

J

K

A B C D E F G H I J K

A BC D E F G H I J KABC .8D .5 .85E .2 .25 .2F .1 .15 .1 .9G .1 .25 .2 .9 .8H .2 .20 .3 .1 .0 .2I .2 .20 .3 .2 .1 .3 .9J .0 .10 .2 .2 .1 .3 .8 .9K .0 .10 .2 .1 .0 .3 .8 .9 .9

A BC D E F G H I J K

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 73: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A

B

CD

E

FG

H

I

J

K

A B C D E F G H I J K

A BC D E F G H I JKABC .8D .5 .85E .2 .25 .2F .1 .15 .1 .9G .1 .25 .2 .9 .8H .2 .20 .3 .1 .0 .2I .2 .20 .3 .2 .1 .3 .9JK .0 .10 .2 .15 .05 .3 .8 .9

A BC D E F G H I JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 74: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A

B

CD

E

FG

H

I

J

K

A B C D E F G H I J K

A BC D E F G HI JKABC .8D .5 .85E .2 .25 .2F .1 .15 .1 .9G .1 .25 .2 .9 .8HI .2 .20 .3 .15 .05 .25JK .0 .10 .2 .15 .05 .3 .85

A BC D E F G HI JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 75: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A B C D E F G H I J K

A

B

CD

E

FG

H

I

J

K

A BC D EF G HI JKABC .8D .5 .85EF .15 .20 .15G .1 .25 .2 .85HI .2 .20 .3 .10 .25JK .0 .10 .2 .10 .3 .85

A BC D EF G HI JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 76: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A B C D E F G H I J K

A

B

CD

E

FG

H

I

J

K

A BCD EF G HI JKABCD .7EF .15 .18G .1 .23 .85HI .2 .23 .10 .25JK .0 .13 .10 .3 .85

A BCD EF G HI JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 77: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A B C D E F G H I J K

A

B

CD

E

FG

H

I

J

K

A BCD EFG HI JKABCD .7EFG .13 .20HI .2 .23 .15JK .0 .13 .16 .85

A BCD EFG HI JK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 78: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A B C D E F G H I J K

A

B

CD

E

FG

H

I

J

K

A BCD EFG HIJKABCD .7EFG .13 .20HIJK .1 .18 .16

A BCD EFG HIJK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 79: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A B C D E F G H I J K

A

B

CD

E

FG

H

I

J

K

ABCD EFG HIJKABCDEFG .18HIJK .16 .16

ABCD EFG HIJK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 80: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A B C D E F G H I J K

A

B

CD

E

FG

H

I

J

K

ABCDEFG HIJKABCDEFGHIJK .16

ABCDEFG HIJK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 81: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Example: Agglomerative Clustering with Average Linkage

A B C D E F G H I J K

A

B

CD

E

FG

H

I

J

K

ABCDEFGHIJKABCDEFGHIJK

ABCDEFGHIJK

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 82: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Properties of Agglomerative Clustering

I several tasks can be solved: partitional clustering with numberof clusters or threshold and hierarchical clustering

I no metric space is necessary

I runtime complexity O(n2 log(n))

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 83: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Use Case: Object Identification

I Object Identification (OI) finds identical items for informationintegration.

I OI tasks are semi-supervised.

I OI models use both clustering and classification techniques.

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 84: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 85: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

DB

Shop Product name PriceT-Online Fuji FinePix S5600 279,00Amazon FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom) 254,90Cyberport Fuji FinePix S5600 259,90Mediamarkt Fine Pix S 5600 245,00

Mediamarkt Fine Pix S 9500 515,00

Amazon Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom) 349,99

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 86: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Object Identification Problem

A

BC

D

EF

GH

I

A

BC

D

EF

GH

I

SolutionProblem

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 87: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Adaptive Setting

A

BC

D

EF

GH

I

A

BC

D

EF

GH

I

Solution

J

K

PQ

R

L

MN

O

Training Set

Problem

L1

L2

L3

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 88: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Types of Labels

Often some parts of the data provide information about identities:

I Some offers are labeled by a unique identifier– e.g. an EAN, UPC, ISBN.

I New offers should be merged into an already integrateddatabase– e.g. new products, new shops should be integrated.

I Some offers are known to be identical / different– e.g. provided by a supervisor.

I N databases should be merged and each database contains noduplicates.

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 89: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Iterative Problem Citer

A

BC

D

E

F

GH

I

Iterative Problem

A

BC

D

E

F

GH

I

A Consistent Solution

L1

L2

L3

Unknown class label

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 90: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Constrained Problem Cconstr

A

BC

D

E

F

GH

I

Constrained Problem

A

BC

D

E

F

GH

I

A Consistent Solution

Must-Link ConstraintCannot-Link Constraint

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 91: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Problem Classes

Problem classes are defined by their preconditions, that restrict thespace E ⊆ X 2 of consistent solutions:

I Iterative Problems Citergiven: EY with Y ⊆ XE = {E |EY = E ∩ Y 2}

I Constrained Problems Cconstr

given: Rml ⊆ X 2, Rcl ⊆ X 2

E = {E |E ⊇ Eml ∧ E ∩ Rcl = ∅}I Matching Problems Cmatch

given: X =⋃

Ai with A = (A1, . . . ,An)E = {E |E ∩ (X 2 \ (

⋃A2

i \ {x , x |x ∈ Ai})) = ∅}

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 92: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Hierarchy of Problem Classes

One can show:

Cclassic ⊂ Citer ⊂ Cconstr

Cclassic ⊂ Cmatch ⊂ Cconstr

Citer 6⊆ Cmatch

Cmatch 6⊆ Citer

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 93: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

There are constrained problems that cannot be expressed as aniterative problem:

A

B GH

A

BG

H

A

B

HG

Iterative Problem

Iterative Problem

Constrained Problem

Must-Link Constraint

L2

L1

L1

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 94: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Generic Object Identification Model

I Feature Extractionffeature : X 2 → Rn

I Probabilistic pairwise decision modelfpairwise : X 2 → [0, 1]

I Collective decision modelfglobal : P(X )× P(X 2)× P(X 2)→ E

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 95: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Data

Object Brand Product Name Pricex1 Hewlett Packard Photosmart 435 Digital Camera 118.99x2 HP HP Photosmart 435 16MB memory 110.00x3 Canon Canon EOS 300D black 18-55 Camera 786.00

Feature Extraction

Object Pair TFIDF-Cosine Similarity FirstNumberEqual Rel. Difference(Product Name) (Product Name) (Price)

(x1, x2) 0.6 1 0.076(x1, x3) 0.1 0 0.849(x2, x3) 0.0 0 0.860

Probabilistic Pairwise Decision Model

Object Pair P[xi ≡ xj ](x1, x2) 0.8(x1, x3) 0.2(x2, x3) 0.1

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 96: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Learning and Constraints

Information provided by constraints can be used for training anidentification model:

I Probabilistic pairwise decision model: trained classifier (e.g.SVM)

I Collective decision model: constrained clustering algorithm(e.g. constrained HAC) using the pairwise decision model as alearned similarity measure.

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 97: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Constrained Agglomerative Clustering Algorithmfunction ConstrainedAgglClustering(X ,Rml,Rcl, sim)

m← 0for all i ∈ {1, . . . , n} do

fm(xi )← iend forfm ← ApplyMustLink(f ,Rml)repeat

(i , j) = argmaxi ,j∈img(fm),i 6=j ,not HasCannotLink(fm,i ,j ,Rcl)

sim?(fm, i , j)

fm+1 ← fmfor all x ∈ {y |fm(y) = j do

fm+1(x)← iend form← m + 1

until convergence(fm)return fm

end functionSteffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 98: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Constrained Agglomerative Clustering Algorithm

function ApplyMustLink(f ,Rml)for all (x , y) ∈ Rml do

for all x ′ : f (x ′) = f (x) dof (x ′)← f (y)

end forend forreturn f

end function

function HasCannotLink(f , i , j ,Rcl)return ∃x ∈ f −1(i), y ∈ f −1(j) : (x , y) ∈ Rcl

end function

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 99: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Summary

I Clustering groups data

I Groups depend on the similarity and the clustering method

I Clustering is an unsupervised task

I Semi-supervised clustering can use labels (e.g. on relations) tolearn the similarity measure and to enhance clustering.

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 100: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Outlook

I Fuzzy / Soft clustering, e.g. Fuzzy C-MeansI cluster membership is a probability distribution

I Spectral clusteringI similarity matrix Sij := sim(di , dj)I use spectral methods on Sij – e.g. eigenvectors – to compute

clusters

I Constrained / Semi-supervised clusteringI constraints on objects, pairs, etc. are presentI example: object identification

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim

Page 101: Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Clustering k-Means Agglomerative Clustering Use Case Summary

Literature

A. K. Jain, M. N. Murty, and P. J. Flynn.Data clustering: a review.ACM Comput. Surv., 31(3):264–323, 1999.

S. Rendle and L. Schmidt-Thieme.Object identification with constraints.In Proceedings of the 6th IEEE International Conference onData Mining (ICDM-2006), Hong Kong, 2006.

Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim