introduction to machine learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(chapter 8) lecture notes for...

21
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml Lecture Slides for

Upload: vuminh

Post on 09-Jul-2018

238 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

INTRODUCTION TO

Machine Learning

ETHEM ALPAYDIN© The MIT Press, 2004

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

Page 2: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

CHAPTER 7:

Clustering

Page 3: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

3

Semiparametric Density Estimation

Parametric: Assume a single model for p (x | Ci) (Chapter 4 and 5)Semiparametric: p (x | Ci) is a mixture of densities

Multiple possible explanations/prototypes:

Different handwriting styles, accents in speech

Nonparametric: No model; data speaks for itself (Chapter 8)

Page 4: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

4

Mixture Densities

where Gi the components/groups/clusters, P ( Gi ) mixture proportions (priors),p ( x | Gi) component densities

Gaussian mixture where p(x|Gi) ~ N ( µi , ∑i ) parameters Φ = {P ( Gi ), µi , ∑i }k

i=1

unlabeled sample X={xt}t (unsupervised learning)

( ) ( ) ( )∑=

=k

iii Ppp

1

| GGxx

Page 5: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

5

Classes vs. Clusters Supervised: X = { xt ,rt }t

Classes Ci i=1,...,K

where p ( x | Ci) ~ N ( µi , ∑i )

Φ = {P (Ci ), µi , ∑i }Ki=1

Unsupervised : X = { xt }t

Clusters Gi i=1,...,k

where p ( x | Gi) ~ N ( µi , ∑i )

Φ = {P ( Gi ), µi , ∑i }ki=1

Labels, r ti ?

( ) ( ) ( )∑=

=k

iii Ppp

1

| GGxx( ) ( ) ( )∑=

=K

iii Ppp

1

| CCxx

( )

( )( )∑

∑∑∑∑

−−=

==

t

ti

T

it

t itt

ii

t

ti

t

tti

it

ti

i

r

r

r

r

N

rCP̂

mxmx

xm

S

Page 6: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

6

k-Means ClusteringFind k reference vectors (prototypes/codebook vectors/codewords) which best represent data

Reference vectors, mj, j =1,...,k

Use nearest (most similar) reference:

Reconstruction error

jt

jit mxmx −=− min

{ }( )

⎪⎩

⎪⎨⎧ −=−

=

−= ∑ ∑=

otherwise0

minif 1

1

jt

ji

tti

t i itt

ik

ii

b

bE

mxmx

mxm X

Page 7: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

7

Encoding/Decoding

⎪⎩

⎪⎨⎧ −=−

=otherwise0

minif 1 jt

ji

ttib

mxmx

Page 8: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

8

k-means Clustering

Page 9: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

9

Page 10: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

10

Expectation-Maximization (EM)

Log likelihood with a mixture model

Assume hidden variables z, which when known, make optimization much simplerComplete likelihood, Lc(Φ |X,Z), in terms of x and z

Incomplete likelihood, L(Φ |X), in terms of x

( ) ( )

( ) ( )∑ ∑

=

=

=

t

k

iii

t

t

t

Pp

p

1

|log

|log|

GG

XL

x

x ΦΦ

Page 11: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

11

E- and M-steps

Iterate the two steps1. E-step: Estimate z given X and current Φ2. M-step: Find new Φ’ given z, X, and old Φ.

An increase in Q increases incomplete likelihood

( ) ( )[ ]( )ll

lC

l E

ΦΦΦ

ΦΦΦΦ

Φ|maxarg:step-M

|||:step-E1 Q

X,ZX,LQ

=

=+

( ) ( )XLXL ||1 ll ΦΦ ≥+

Page 12: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

12

EM in Gaussian Mixtures

zti = 1 if xt belongs to Gi, 0 otherwise (labels r ti of

supervised learning); assume p(x|Gi)~N(µi,∑i)

E-step:

M-step:

Use estimated labels in place of unknown labels

[ ] ( ) ( )( ) ( )

( ) ti

lti

j jl

jt

il

it

lti

hP

PpPp

,zE

≡=

=∑

Φ

ΦΦ

Φ

,G

G,GG,GX

x

xx

|

||

( )

( )( )∑

∑∑∑∑

+++

+

−−=

==

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

hP

111

1

mxmx

xm

S

G

Page 13: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

13

P(G1|x)=h1=0.5

Page 14: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

14

Mixtures of Latent Variable Models

Regularize clusters

1. Assume shared/diagonal covariance matrices

2. Use PCA/FA to decrease dimensionality: Mixtures of PCA/FA

Can use EM to learn Vi (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999)

( ) ( )iTiiiit ,p ψmx += VVNG|

Page 15: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

15

After Clustering

Dimensionality reduction methods find correlations between features and group featuresClustering methods find similarities between instances and group instances

Allows knowledge extraction throughnumber of clusters,prior probabilities, cluster parameters, i.e., center, range of features.

Example: CRM, customer segmentation

Page 16: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

16

Clustering as Preprocessing

Estimated group labels hj (soft) or bj (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor.

Local representation (only one bj is 1, all others are 0; only few hj are nonzero) vs

Distributed representation (After PCA; all zj are nonzero)

Page 17: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

17

Mixture of Mixtures

In classification, the input comes from a mixture of classes (supervised).

If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:

( ) ( ) ( )

( ) ( ) ( )∑

=

=

=

=

K

iii

k

jijiji

Ppp

Pppi

1

1

|

||

CC

GGC

xx

xx

Page 18: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

18

Hierarchical Clustering

Cluster based on similarities/distances

Distance measure between instances xr and xs

Minkowski (Lp) (Euclidean for p = 2)

City-block distance

( ) ( )[ ] p/d

j

psj

rj

srm xx,d

1

1∑ =−=xx

( ) ∑ =−=

d

j

sj

rj

srcb xx,d

1xx

Page 19: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

19

Agglomerative Clustering

Start with N groups each with one instance and merge two closest groups at each iterationDistance between two groups Gi and Gj:

Single-link:

Complete-link:

Average-link, centroid

( ) ( )sr

,ji ,d,d

js

ir

xxxx GG

GG∈∈

= min

( ) ( )sr

,ji ,d,d

js

ir

xxxx GG

GG∈∈

= max

Page 20: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

20

Dendrogram

Example: Single-Link Clustering

Page 21: INTRODUCTION TO Machine Learningethem/i2ml/slides/v1-1/i2ml-chap7-v1...(Chapter 8) Lecture Notes for E ... Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The

Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

21

Choosing k

Defined by the application, e.g., image quantization

Plot data (after PCA) and check for clusters

Incremental (leader-cluster) algorithm: Add one at a time until “elbow” (reconstruction error/log likelihood/intergroup distances)

Manual check for meaning