data mining lectures lecture 9,10: clustering padhraic smyth, uc irvine ics 278: data mining lecture...

62
Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 9,10: Clustering Algorithms

Padhraic SmythDepartment of Information and Computer Science

University of California, Irvine

Page 2: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Project Progress Report

• Written Progress Report:– Due Tuesday May 18th in class– Expect at least 3 pages (should be typed not handwritten)– Hand in written document in class on Tuesday May 18th

• 1 Powerpoint slide:– 1 slide that describes your project– Should contain:

• Your name (top right corner)• Clear description of the main task• Some visual graphic of data relevant to your task• 1 bullet or 2 on what methods you plan to use• Preliminary results or results of exploratory data analysis• Make it graphical (use text sparingly)

– Submit by 12 noon Monday May 17th

Page 3: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

List of Sections for your Progress Report• Clear description of task (reuse original proposal if needed)

– Basic task + extended “bonus” tasks (if time allows)

• Discussion of relevant literature – Discuss prior published/related work (if it exists)

• Preliminary data evaluation– Exploratory data analysis relevant to your task– Include as many of plots/graphs as you think are useful/relevant

• Preliminary algorithm work – Summary of your progress on algorithm implementation so far

• If you are not at this point yet, say so– Relevant information about other code/algorithms you have downloaded, some

preliminary testing on, etc.– Difficulties encountered so far

• Plans for the remainder of the quarter– Algorithm implementation– Experimental methods– Evaluation, validation

• Approximately ½ page to 1 page of text per section (graphs/plots don’t count – include as many of these as you like).

Page 4: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Clustering

• “automated detection of group structure in data”– Typically: partition N data points into K groups (clusters) such

that the points in each group are more similar to each other than to points in other groups

– descriptive technique (contrast with predictive)– for real-valued vectors, clusters can be thought of as clouds of

points in p-dimensional space

Page 5: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Clustering

Sometimes easy

Sometimes impossible

and sometimes in between

Page 6: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Why is Clustering useful?

• “Discovery” of new knowledge from data– Contrast with supervised classification (where labels are

known)– Long history in the sciences of categories, taxonomies, etc– Can be very useful for summarizing large data sets

• For large n and/or high dimensionality

• Applications of clustering– Discovery of new types of galaxies in astronomical data– Clustering of genes with similar expression profiles– Cluster pixels in an image into regions of similar intensity– Segmentation of customers for an e-commerce store– Clustering of documents produced by a search engine– …. many more

Page 7: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

General Issues in Clustering

• Representation:– What types of clusters are we looking for?

• Score:– The criterion to compare one clustering to another

• Optimization– Generally, finding the optimal clustering is NP-hard

• Greedy algorithms to optimize score are widely used

• Other issues– Distance function, D(x(i),x(j)) critical aspect of clustering, both

• distance of pairs of objects• distance of objects from clusters

– How is K selected?– Different types of data

• Real-valued versus categorical• Attribute-valued vectors vs. n2 distance matrix

Page 8: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

General Families of Clustering Algorithms

• partition-based clustering– e.g. K-means

• probabilistic model-based clustering– e.g. mixture models [both of the above work with measurement data, e.g., feature

vectors]

• hierarchical clustering– e.g. hierarchical agglomerative clustering

• graph-based clustering– E.g., min-cut algorithms

[both of the above work with distance data, e.g., distance matrix]

Page 9: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Partition-Based Clustering

• given: n data points X={x(1) … x(n)}

• output: k partitions C = {C1 … CK} such that– each x(i) is assigned to unique Cj (hard-assignment)– C implicitly represents a mapping from X to C

• Optimization algorithm– require that score[C, X] is maximized

• e.g., sum-of-squares of within cluster distances– exhaustive search intractable– combinatorial optimization to assign n objects to k classes– large search space: possible assignment choices ~ kn

• so, use greedy interative method • will be subject to local maxima

Page 10: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Score Function for Partition-Based Clustering

• want compact clusters– minimize within cluster distances wc(C)

• want different clusters far apart– maximize between cluster distances bc(C)

• given cluster partitioning C, find centers c1…ck

– e.g. for vectors, use centroids of points in cluster Ci• ck = 1/(nk) x Ck x

– wc(C) = sum-of-squares within cluster distance• wc(C) = i=1…k wc(Ci) = i=1…k x Ci d(x,ci)2

– bc(C) = distance between clusters• bc(C) = i,j=1…k d(ci,cj)2

• Score[C,X]=f[wc(C),bc(C)]

Page 11: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means Clustering

• basic idea:– Score = wc(C) = sum-of-squares within cluster distance

– start with randomly chosen cluster centers c1 … ck

– repeat until no cluster memberships change:• assign each point x to cluster with nearest center

– find smallest d(x,ci), over all c1 … ck

• recompute cluster centers over data assigned to them

– ci = 1/(ni) x Ci x

• algorithm terminates (finite number of steps)– decreases Score(X,C) each iteration membership changes

• converges to local maxima of Score(X,C)– not necessarily the global maxima …– different initial centers (seeds) can lead to diff local maxs

Page 12: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means Complexity

• time complexity = O(I e n k) << exhaustive’s nk

– I = number of interations (steps)– e = cost of distance computation (e=p for Euclidian dist)

• speed-up tricks (especially useful in early iterations)– use nearest x(i)’s as cluster centers instead of mean

• reuse of cached dists from size n2 dist mat D (lowers effective “e”)• k-mediods: use one of x(i)’s as center because mean not defined

– recompute centers as points reassigned• useful for large n (like online neural nets) & more cache efficient

– PCA: reduce effective “e” and/or fit more of X in RAM– “condense”: reduce “n” by replace group with prototype– even more clever data structures (see work by Andrew Moore,

CMU)

Page 13: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means example(courtesy of Andrew Moore, CMU)

Page 14: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means1. Ask user how

many clusters they’d like. (e.g. K=5)

Page 15: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means1. Ask user how

many clusters they’d like. (e.g. K=5)

2. Randomly guess K cluster Center locations

Page 16: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means1. Ask user how

many clusters they’d like. (e.g. K=5)

2. Randomly guess K cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

Page 17: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

Page 18: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

5. New Centers => new boundaries

6. Repeat until no change!

Page 19: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns…

5. …and jumps there

6. …Repeat until terminated!

Page 20: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

AcceleratedComputation

s

Example generated by Pelleg and Moore’s accelerated k-means

Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html)

Page 21: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 22: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 23: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 24: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 25: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 26: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 27: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 28: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means continues…

Page 29: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means terminates

Page 30: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

K-means clustering of RGB (3 value) pixelcolor intensities, K = 11 segments(courtesy of David Forsyth, UC Berkeley)

Image Clusters on color

Page 31: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Issues in K-means clustering

• Simple, but useful– tends to select compact “isotropic” cluster shapes– can be useful for initializing more complex methods– many algorithmic variations on the basic theme

• Choice of distance measure– Euclidean distance– Weighted Euclidean distance– Many others possible

• Selection of K– “screen diagram” - plot SSE versus K, look for knee

• Limitation: may not be any clear K value

Page 32: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Probabilistic Clustering: Mixture Models

• assume a probabilistic model for each component cluster

• mixture model: f(x) = k=1…K wk fk(x;k)

• where wk are K mixing weights wk : 0 wk 1 and k=1…K wk = 1

• where K components fk(x;k) can be:– Gaussian– Poisson– exponential– ...

• Note:– Assumes a model for the data (advantages and disadvantages)– Results in probabilistic membership: p(cluster k | x)

Pd

Page 33: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Gaussian Mixture Models (GMM)

• model for k-th component is normal N(k,k)

– often assume diagonal covariance: jj = j2

, ij = 0

– or sometimes even simpler: jj = 2 , ij = 0

• f(x) = k=1…K wk fk(x;k) with k = <k , k> or <k ,k>

• generative model:– randomly choose a component

• selected with probability wk

– generate x ~ N(k,k)

– note: k & k both d-dim vectors

Page 34: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Learning Mixture Models from Data

• Score function = log-likelihood L() – L() = log p(X|) = log H p(X,H|)

– H = hidden variables (cluster memberships of each x)– L() cannot be optimized directly

• EM Procedure– General technique for maximizing log-likelihood with missing data– For mixtures

• E-step: compute “memberships” p(k | x) = wk fk(x;k) / f(x)

• M-step: pick a new to max expected data log-likelihood• Iterate: guaranteed to climb to (local) maximum of L()

Page 35: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

The E (Expectation) Step

Current K clustersand parameters

n datapoints

E step: Compute p(data point i is in group k)

Page 36: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

The M (Maximization) Step

New parameters forthe K clusters

n datapoints

M step: Compute , given n data points and memberships

Page 37: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Complexity of EM for mixtures

K modelsn datapoints

Complexity per iteration scales as O( n K f(p) )

Page 38: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Comments on Mixtures and EM Learning

• Complexity of each EM iteration– Depends on the probabilistic model being used

• e.g., for Gaussians, Estep is O(nK), Mstep is O(Knp2)– Sometimes E or M-step is not closed form

• => can requires numerical methods at each iteration

• K-means interpretation– Gaussian mixtures with isotropic (diagonal, equi-variance) k ‘s – Approximate the E-step by choosing most likely cluster

(instead of using membership probabilities)

• Generalizations…– Mixtures of multinomials for text data– Mixtures of Markov chains for Web sequences– etc

Page 39: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Page 40: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 1

Page 41: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 3

Page 42: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 5

Page 43: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 10

Page 44: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 15

Page 45: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 25

Page 46: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

0 5 10 15 20 25400

410

420

430

440

450

460

470

480

490LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS

EM Iteration

Lo

g-L

ike

liho

od

Page 47: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Selecting K in mixture models

• cannot just choose K that maximizes likelihood– Likelihood L() ALWAYS larger for larger K

• Model selection alternatives:– 1) penalize complexity

• e.g., BIC = L() – d/2 log n (Bayesian information criterion)

– 2) Bayesian: compute posteriors p(k | data)• Can be tricky to compute for mixture models

– 3) (cross) validation: popular and practical• Score different models by log p(Xtest | ) • split data into train and validate sets

Page 48: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Example of BIC Score for Red-Blood Cell Data

Page 49: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

ANEMIA DATA WITH LABELS

Page 50: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Hierarchical Clustering

• Representation: tree of nested clusters• Works from a distance matrix

– advantage: x’s can be any type of object– disadvantage: computation

• two basic approachs:– merge points (agglomerative)– divide superclusters (divisive)

• visualize both via “dendograms”– shows nesting structure– merges or splits = tree nodes

• Applications– e.g., clustering of gene expression data– Useful for seeing hierarchical structure, for relatively small data sets

Page 51: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Page 52: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Agglomerative Methods: Bottom-Up

• algorithm based on distance between clusters:– for i=1 to n let Ci = { x(i) } -- i.e. start with n singletons

– while more than one cluster left• let Ci and Cj be cluster pair with minimum distance over dist[Ci , Cj ]

• merge them, via Ci = Ci Cj and remove Cj

• time complexity = O(n2) to O(n3)– n iterations (start: n clusters; end: 1 cluster)– 1st iteration O(nlgn) to O(n2) to find nearest singleton pair

• space complexity = O(nlgn) to O(n2)– accesses all/most distances between x(i)’s during build– interpreting large n dendrogram difficult anyway (like DTs)

• large n idea: partition-based clusters at leafs

Page 53: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Distances Between Clusters

• single link / nearest neighbor measure:– D(Ci,Cj) = min { d(x,y) | x Ci, y Cj } – can be outlier/noise sensitive

• complete link / furthest neighbor measure:– D(Ci,Cj) = max { d(x,y) | x Ci, y Cj }

• intermediates between those extremes:– average link: D(Ci,Cj) = avg { d(x,y) | x Ci, y Cj }– centroid: D(Ci,Cj) = d(ci,cj) where ci , cj are centroids– Wards’s SSE measure (for vector data):

• within-cluster sum-of-squared-dists for Ci + for Cj - for merged

• DM theme: try several, see which is most interesting

Page 54: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Dendrogram Using Single-Link Method

Old Faithful Eruption Duration vs Wait Data

notice that y scale x scale !

Notice how single-linktends to “chain”.

dendrogram y-axis = crossbar’s distance score

Page 55: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Dendogram Using Ward’s SSE Distance

Old Faithful Eruption Duration vs Wait DataMore balanced thansingle-link.

Page 56: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Divisive Methods: Top-Down

• algorithm:– begin with single cluster containing all data– split into components, repeat until clusters = single points

• two major types:– monothetic:

• split by one variable at a time -- restricts choice search space• analogous to DTs

– polythetic• splits by all variables at once -- many choices makes difficult

• less commonly used than agglomerative methods– generally more computationally intensive

• more choices in search space

Page 57: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Spectral/Graph-based Clustering

Page 58: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Clustering non-vector objects

• E.g., sequences, images, documents, etc – Can be of varying lengths, sizes

• Distance matrix approach– E.g., compute edit distance/transformations for pairs of sequences– Apply clustering (e.g., hierarchical) based on distance matrix– However….does not scale well

• “Vectorization”– Represent each object as a vector– Cluster resulting vectors using vector-space algorithm– However…. can lose (e.g., sequence) information by going to vector space

• Probabilistic model-based clustering– Treat as mixture of (e.g.) stochastic finite state machines– Can naturally handle variable lengths – Will discuss application to Web session clustering later in the quarter

Page 59: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Task

K-Means Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Partition based on K centers

Within-cluster sum of squared errors

Iterative greedy search

None specified

K centers

Page 60: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Task

Probabilistic Model-Based Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Log-likelihood

EM (iterative)

None specified

Probability model

Mixture of Probability Components

Page 61: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Task

Single-Link Hierarchical Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Tree of nested groupings

No global score

Iterative merging of nearest neighbors

None specified

Dendrogram

Page 62: Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 9,10: Clustering Algorithms Padhraic Smyth Department

Data Mining Lectures Lecture 9,10: Clustering Padhraic Smyth, UC Irvine

Summary

• General comments:– Many different approaches and algorithms– What type of cluster structure are you looking for?– Computational complexity may be an issue for large n– Dimensionality is also an issue– Validation is difficult – but the payoff can be large.

• Chapter 9– Covers all of the clustering methods discussed here (except

graph/spectral clustering)