machine learning, stor 565 [.1in] clustering: overview and

35
Machine Learning, STOR 565 Clustering: Overview and Basic Methods Andrew Nobel February, 2021

Upload: others

Post on 30-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Machine Learning, STOR 565

Clustering: Overview and Basic Methods

Andrew Nobel

February, 2021

Page 2: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Overview

Task: Divide a set of objects (e.g. data points) into a small number of disjointgroups such that objects in the same group are close together, and objects indifferent groups are far apart.

Page 3: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Example (http://rosettacode.org)

Page 4: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Example (https://apandre.files.wordpress.com)

Page 5: Machine Learning, STOR 565 [.1in] Clustering: Overview and

General Setting

Given: Objects x1, . . . , xn in feature space X

I Dissimilarity or distance d(xi, xj) between pairs of objects

Goal: Divide x1, . . . , xn into disjoint groups C1, . . . , Ck, called clusters, s.t.

I Objects in same cluster are close together

I Objects in different clusters are far apart

I Number of clusters k is small

Distinction: Clustering is complete if it partitions X and incomplete if itpartitions only x1, . . . , xn.

Page 6: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Clustering: Areas of Application

Genomics, Biology

Data Compression

Psychology

Computer Science

Social and Political Science

Page 7: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Feature Vectors

Objects x ∈ X typically represented by a feature vector

x = (x1, . . . , xp)t

where xi is a numerical/categorical measurement of interest:

I xi ∈ R numerical feature

I xi ∈ {a, b, . . .} categorical feature

Page 8: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Examples

Medicine

I Object = patient

I Feature xi = outcome of a diagnostic test on patient

Microarrays (Genomics)

I Object = tissue sample

I Feature xi = measured expression level of gene i in that sample

Data Mining

I Object = consumer

I Features xi = type, location, or amount of recent purchases

Page 9: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Dissimilarities/Distances Between Feature Vectors

Euclidean d(u,v) =√∑

i(ui − vi)2

Manhattan d(u,v) =∑

i |ui − vi|

Correlation d(u,v) = 1− corr(u, v)

Hamming d(u,v) =∑

i I{ui 6= vi}

Mixtures of these

Page 10: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Basic Steps in Clustering

Objects x1, . . . ,xn

Selection and Extraction of Features

Dissimilarity matrix D = {d(xi,xj) : 1 ≤ i, j ≤ n}

Clustering Algorithm

Partition π = {C1, . . . , Ck} of x1, . . . ,xn.

Page 11: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Some Clustering Methods

Hierarchical: Candidate divisions of data described by a binary tree

I Agglomerative (bottom-up)

I Divisive (top-down)

Iterative: Search for local minimum of simple cost function

I k-means and variants

I partitioning around medioids, self organizing maps

Model-based: Fit feature vectors with a finite mixture model

Spectral: Threshold eigenvectors of Laplacian of Dissimilarity Matrix

Page 12: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Features of Clusters

Features of clusters can affect the performance of different procedures, e.g.,whether clusters are

I Spherical or elliptical in shape

I Similar in overall variance/spread

I Similar in size (number of points)

Page 13: Machine Learning, STOR 565 [.1in] Clustering: Overview and

The k-Means Algorithm

Page 14: Machine Learning, STOR 565 [.1in] Clustering: Overview and

The k-Means Algorithm

Setting: Objects x1, . . . ,xn ∈ Rp are vectors. Seek k clusters

Approach: Focus on cluster centers

I Find good cluster centers c1, . . . , ck ∈ Rp

I Let cluster Cj = vectors xi closer to cj than other centers cl

Optimization: Select centers to minimize sum of squares (SoS) cost function

Cost(c1, . . . , ck) =

n∑i=1

min1≤j≤k

||xi − cj ||2

Problem: Exact solution of optimization problem not feasible. Resort toiterative methods that find local minimum

Page 15: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Ingredient 1: Centroids

Definition: The centroid of vectors v1, . . . ,vm ∈ Rp is their (vector) average

c =1

m

m∑i=1

vi

I Centroid c is the center of mass of the point configuration v1, . . . ,vm

I Centroid c is an optimal representative for v1, . . . ,vm, in the sense that

m∑i=1

||vi − c||2 ≤m∑i=1

||vi − v||2

for every vector v ∈ Rp

Page 16: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Ingredient 2: Nearest Neighbor Partitions

Idea: Given centers c1, . . . , ck ∈ Rp one can partition Rp into correspondingcells A1, . . . , Ak where

Aj = {x : ||x− cj || ≤ ||x− cs|| all l 6= j}

contains vectors that are closer to center cj than any other center cs (wherewe break ties by index)

Definition: Cells {A1, . . . , Ak} called the nearest neighbor or Voronoipartition of Rp generated by centers c1, . . . , ck

Note: Aj =⋂

s6=j{x : ||x− cj || ≤ ||x− cl||} is an intersection of half-spaces

Page 17: Machine Learning, STOR 565 [.1in] Clustering: Overview and

The k-Means Algorithm

Initialize: Centers C0 = {a1, . . . ,ak}

Iterate: For m = 1, 2, . . . do:

I Let πm be the nearest neighbor partition of the centers Cm−1.

I Let Cm be the centroids of the vectors in each cell of πm

Stop: When Cost(Cm) is close to Cost(Cm+1)

Key Fact: Cost function decreases at each iteration of algorithm. Recall

Cost(c1, . . . , ck) =n∑

i=1

min1≤j≤k

||xi − cj ||2

Page 18: Machine Learning, STOR 565 [.1in] Clustering: Overview and

k-means (Yu-Zhong Chen, ResearchGate)

Page 19: Machine Learning, STOR 565 [.1in] Clustering: Overview and

The k-Means Algorithm

In practice

I Choose multiple initial sets of representative vectors C0 = {c1, . . . , ck}

I Run the iterative k-means procedure

I Choose the partition associated with the smallest final cost

Example: http://onmyphd.com/?p=k-means.clustering.

K-Means tends to perform best when clusters are spherical, similar invariance and size

Page 20: Machine Learning, STOR 565 [.1in] Clustering: Overview and

3-means (onmyphd.com)

Page 21: Machine Learning, STOR 565 [.1in] Clustering: Overview and

2-Means (onmyphd.com)

Page 22: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Agglomerative Clustering

Page 23: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Binary Trees

1. Distinguished node called the root with zero or two children but no parent

2. Every other node has one parent and zero or two children

I Nodes with no children are called leaves

I Nodes with two children are called internal

Note: Tree usually drawn upside-down, with root node at the top

Page 24: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Agglomerative Clustering

Stage 0: Assign each object xi to its own cluster

Stage k:

I Find the two closest clusters at stage k − 1

I Combine them into a single cluster

Stop: When all objects xi belong to a single cluster

Output: Binary tree T called a dendrogram

Note: Closeness of clusters C,C′ can be measured in different ways

Page 25: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Distances Between Clusters

Single Linkageds(C,C

′) = minxi∈C, xj∈C′

d(xi, xj)

Average Linkage

da(C,C′) =

1

|C| |C′|∑

xi∈C, xj∈C′

d(xi, xj)

Total Linkagedt(C,C

′) = maxxi∈C, xj∈C′

d(xi, xj)

Page 26: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Dendrogram

Binary tree associated with the agglomerative clustering procedure: it is agraphical record of the clustering process

Initialize: Each singleton cluster {xi} corresponds to a node at height 0

Update: If two clusters C,C′ are combined, their respective nodes are joinedto a parent node at height d(C,C′)

Each node of dendrogram corresponds to a set of objects. Objectsassociated with two nodes are merged when forming their parent

I Leaves correspond to individual objects

I The root corresponds to all objects

Page 27: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Cities by Distance (blogs.sas.com)

Page 28: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Salmon by Genetic Similarity

Page 29: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Dendrogram, cont.

Note: Dendrogram T represents many possible clusterings, one for each(rooted) subtree.

Methods for selecting a clustering/subtree

I Ad hoc selection (by eye)

I “Cutting” dendrogram at fixed level

I Penalized pruning

Visualization of clustering structure

I Order objects in the same way as the leaves of the dendrogram

I Caveat: many orderings possible

Page 30: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Cars Data

I Samples: 32 unique cars

I Variables: 11 descriptive variables, including gas mileage, horsepower,number of cylinders, etc.

I Freely available in R: data(mtcars)

Page 31: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Mas

erat

i Bor

aFo

rd P

ante

ra L

Dus

ter 3

60C

amar

o Z2

8C

hrys

ler I

mpe

rial

Cad

illac

Fle

etw

ood

Linc

oln

Con

tinen

tal

Hor

net S

porta

bout

Pon

tiac

Fire

bird

Mer

c 45

0SLC

Mer

c 45

0SE

Mer

c 45

0SL

Dod

ge C

halle

nger

AM

C J

avel

inH

orne

t 4 D

rive

Valiant

Ferr

ari D

ino

Hon

da C

ivic

Toyo

ta C

orol

laFi

at 1

28Fi

at X

1-9

Mer

c 24

0DM

azda

RX

4M

azda

RX

4 W

agM

erc

280

Mer

c 28

0CLo

tus

Eur

opa

Mer

c 23

0D

atsu

n 71

0V

olvo

142

ETo

yota

Cor

ona

Por

sche

914

-2

020

4060

80

Single Linkage Clustering on Cars data

hclust (*, "single")dist(mtcars)

Distance

Page 32: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Ferr

ari D

ino

Hon

da C

ivic

Toyo

ta C

orol

laFi

at 1

28Fi

at X

1-9

Maz

da R

X4

Maz

da R

X4

Wag

Mer

c 28

0M

erc

280C

Mer

c 24

0DLo

tus

Eur

opa

Mer

c 23

0V

olvo

142

ED

atsu

n 71

0To

yota

Cor

ona

Por

sche

914

-2M

aser

ati B

ora

Hor

net 4

Driv

eValiant

Mer

c 45

0SLC

Mer

c 45

0SE

Mer

c 45

0SL

Dod

ge C

halle

nger

AM

C J

avel

inC

hrys

ler I

mpe

rial

Cad

illac

Fle

etw

ood

Linc

oln

Con

tinen

tal

Ford

Pan

tera

LD

uste

r 360

Cam

aro

Z28

Hor

net S

porta

bout

Pon

tiac

Fire

bird

050

100

150

200

250

Average Linkage Clustering on Cars data

hclust (*, "average")dist(mtcars)

Distance

Page 33: Machine Learning, STOR 565 [.1in] Clustering: Overview and

TCGA Data

Gene expression data from The Cancer Genome Atlas (TCGA)

I Samples

I 95 Luminal A breast tumors

I 122 Basal breast tumors

I Variables: 2000 randomly selected genes

Page 34: Machine Learning, STOR 565 [.1in] Clustering: Overview and

TCGA Data

I Clustered samples (breast tumor subtype)I Colors: Luminal A and Basal

Page 35: Machine Learning, STOR 565 [.1in] Clustering: Overview and

Important Questions

I What is the right number of clusters?

I What is right measure of distance?

I What is the best clustering method for the data?

I How robust is an observed clustering to small perturbations of the data?

I What significance can be assigned to the clusters?