introduction to machine learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(chapter 8) lecture notes for...

Post on 09-Jul-2018

240 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

INTRODUCTION TO

Machine Learning

ETHEM ALPAYDIN© The MIT Press, 2004

alpaydin@boun.edu.trhttp://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

CHAPTER 7:

Clustering

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

3

Semiparametric Density Estimation

Parametric: Assume a single model for p (x | Ci) (Chapter 4 and 5)Semiparametric: p (x | Ci) is a mixture of densities

Multiple possible explanations/prototypes:

Different handwriting styles, accents in speech

Nonparametric: No model; data speaks for itself (Chapter 8)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

4

Mixture Densities

where Gi the components/groups/clusters, P ( Gi ) mixture proportions (priors),p ( x | Gi) component densities

Gaussian mixture where p(x|Gi) ~ N ( µi , ∑i ) parameters Φ = {P ( Gi ), µi , ∑i }k

i=1

unlabeled sample X={xt}t (unsupervised learning)

( ) ( ) ( )∑=

=k

iii Ppp

1

| GGxx

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

5

Classes vs. Clusters Supervised: X = { xt ,rt }t

Classes Ci i=1,...,K

where p ( x | Ci) ~ N ( µi , ∑i )

Φ = {P (Ci ), µi , ∑i }Ki=1

Unsupervised : X = { xt }t

Clusters Gi i=1,...,k

where p ( x | Gi) ~ N ( µi , ∑i )

Φ = {P ( Gi ), µi , ∑i }ki=1

Labels, r ti ?

( ) ( ) ( )∑=

=k

iii Ppp

1

| GGxx( ) ( ) ( )∑=

=K

iii Ppp

1

| CCxx

( )

( )( )∑

∑∑∑∑

−−=

==

t

ti

T

it

t itt

ii

t

ti

t

tti

it

ti

i

r

r

r

r

N

rCP̂

mxmx

xm

S

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

6

k-Means ClusteringFind k reference vectors (prototypes/codebook vectors/codewords) which best represent data

Reference vectors, mj, j =1,...,k

Use nearest (most similar) reference:

Reconstruction error

jt

jit mxmx −=− min

{ }( )

⎪⎩

⎪⎨⎧ −=−

=

−= ∑ ∑=

otherwise0

minif 1

1

jt

ji

tti

t i itt

ik

ii

b

bE

mxmx

mxm X

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

7

Encoding/Decoding

⎪⎩

⎪⎨⎧ −=−

=otherwise0

minif 1 jt

ji

ttib

mxmx

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

8

k-means Clustering

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

9

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

10

Expectation-Maximization (EM)

Log likelihood with a mixture model

Assume hidden variables z, which when known, make optimization much simplerComplete likelihood, Lc(Φ |X,Z), in terms of x and z

Incomplete likelihood, L(Φ |X), in terms of x

( ) ( )

( ) ( )∑ ∑

=

=

=

t

k

iii

t

t

t

Pp

p

1

|log

|log|

GG

XL

x

x ΦΦ

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

11

E- and M-steps

Iterate the two steps1. E-step: Estimate z given X and current Φ2. M-step: Find new Φ’ given z, X, and old Φ.

An increase in Q increases incomplete likelihood

( ) ( )[ ]( )ll

lC

l E

ΦΦΦ

ΦΦΦΦ

Φ|maxarg:step-M

|||:step-E1 Q

X,ZX,LQ

=

=+

( ) ( )XLXL ||1 ll ΦΦ ≥+

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

12

EM in Gaussian Mixtures

zti = 1 if xt belongs to Gi, 0 otherwise (labels r ti of

supervised learning); assume p(x|Gi)~N(µi,∑i)

E-step:

M-step:

Use estimated labels in place of unknown labels

[ ] ( ) ( )( ) ( )

( ) ti

lti

j jl

jt

il

it

lti

hP

PpPp

,zE

≡=

=∑

Φ

ΦΦ

Φ

,G

G,GG,GX

x

xx

|

||

( )

( )( )∑

∑∑∑∑

+++

+

−−=

==

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

hP

111

1

mxmx

xm

S

G

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

13

P(G1|x)=h1=0.5

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

14

Mixtures of Latent Variable Models

Regularize clusters

1. Assume shared/diagonal covariance matrices

2. Use PCA/FA to decrease dimensionality: Mixtures of PCA/FA

Can use EM to learn Vi (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999)

( ) ( )iTiiiit ,p ψmx += VVNG|

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

15

After Clustering

Dimensionality reduction methods find correlations between features and group featuresClustering methods find similarities between instances and group instances

Allows knowledge extraction throughnumber of clusters,prior probabilities, cluster parameters, i.e., center, range of features.

Example: CRM, customer segmentation

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

16

Clustering as Preprocessing

Estimated group labels hj (soft) or bj (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor.

Local representation (only one bj is 1, all others are 0; only few hj are nonzero) vs

Distributed representation (After PCA; all zj are nonzero)

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

17

Mixture of Mixtures

In classification, the input comes from a mixture of classes (supervised).

If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:

( ) ( ) ( )

( ) ( ) ( )∑

=

=

=

=

K

iii

k

jijiji

Ppp

Pppi

1

1

|

||

CC

GGC

xx

xx

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

18

Hierarchical Clustering

Cluster based on similarities/distances

Distance measure between instances xr and xs

Minkowski (Lp) (Euclidean for p = 2)

City-block distance

( ) ( )[ ] p/d

j

psj

rj

srm xx,d

1

1∑ =−=xx

( ) ∑ =−=

d

j

sj

rj

srcb xx,d

1xx

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

19

Agglomerative Clustering

Start with N groups each with one instance and merge two closest groups at each iterationDistance between two groups Gi and Gj:

Single-link:

Complete-link:

Average-link, centroid

( ) ( )sr

,ji ,d,d

js

ir

xxxx GG

GG∈∈

= min

( ) ( )sr

,ji ,d,d

js

ir

xxxx GG

GG∈∈

= max

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

20

Dendrogram

Example: Single-Link Clustering

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

21

Choosing k

Defined by the application, e.g., image quantization

Plot data (after PCA) and check for clusters

Incremental (leader-cluster) algorithm: Add one at a time until “elbow” (reconstruction error/log likelihood/intergroup distances)

Manual check for meaning

top related