chapter 1 introduction to clustering. section 1.1 introduction

30
Chapter 1 Introduction to Clustering

Post on 18-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

Chapter 1

Introduction to Clustering

Page 2: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

Section 1.1

Introduction

Page 3: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

3

Objectives Introduce clustering and unsupervised learning. Explain the various forms of cluster analysis. Outline several key distance metrics used as

estimates of experimental unit similarity.

Page 4: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

4

Course OverviewVariable Selection

VARCLUS

Plot DataPRINCOMP,MDS,CANDISC

PreprocessingACECLUS

‘Fuzzy’ ClusteringFACTOR

Discrete Clustering

Hierarchical ClusteringCLUSTER

Optimization Clustering

Parametric ClusteringFASTCLUS

Non-Parametric ClusteringMODECLUS

Page 5: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

5

“Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.”

B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”

Definition

Cluster Solution

Sensible Interpretable

Un-interpretable

Given Class Derived Class

Page 6: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

6

Learning without a priori knowledge about the classification of samples; learning without a teacher.

Kohonen (1995), “Self-Organizing Maps”

Unsupervised Learning

Page 7: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

Section 1.2

Types of Clustering

Page 8: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

8

Distinguish between the two major classes of clustering methods:

– hierarchical clustering– optimization (partitive) clustering.

Objectives

Page 9: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

9

Hierarchical Clustering Agglomerative DivisiveIteration

1

2

3

4

Page 10: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

10

Propagation of ErrorsIteration

1

2

3

4

(error)

(error)

(error)

Page 11: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

11

Optimization (Partitive) Clustering

“Seeds” Observations

XX

X

X

Initial State Final State

Old location

X

XX X

X

XX

X

New location

Page 12: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

12

Heuristic Search

1. Find an initial partition of the n objects into g groups.

2. Calculate the change in the error function produced by moving each observation from its own cluster to another group.

3. Make the change resulting in the greatest improvement in the error function.

4. Repeat steps 2 and 3 until no move results in improvement.

Page 13: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

Section 1.3

Similarity Metrics

Page 14: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

14

Define similarity and what comprises a good measure of similarity.

Describe a variety of similarity metrics.

Objectives

Page 15: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

15

Although the concept of similarity is fundamental to our thinking, it is also often difficult to precisely quantify.

Which is more similar to a duck: a crow or a penguin?

The metric that you choose to operationalize similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover.

What Is Similarity?

Page 16: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

16

The following principles have been identified as a foundation of any good similarity metric:

1. symmetry: d(x,y) = d(y,x)

2. non-identical distinguishability: if d(x,y) 0 then x y

3. identical non-distinguishability: if d(x,y) = 0 then x = y

Some popular similarity metrics (for example, correlation) fail to meet one or more of these criteria.

What Makes a Good Similarity Metric?

Page 17: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

17

Euclidean Distance Similarity Metric

Pythagorean Theorem: The square of the hypotenuse is equal to the sum of the squares of the other two sides.

d

iiiE wxD

1

2

x1

x2

(x1, x2)

(0, 0)

2

1

22

iixh

Page 18: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

18

City block (Manhattan) distance is the distance between two points measured along axes at right angles.

d

iiiM wxD

1

1

City Block Distance Similarity Metric

(w1,w2)

(x1,x2)

Page 19: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

19

Similar

...

..

.. .

.

..

. .

Tom

Mar

ieCorrelation Similarity Metrics

Dissimilar

..

....

. ..

..

. .Jerry

Mar

ie

Tom

.

.

.... ...

..

.

.

Jerry

No Similarity

Page 20: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

20

The Problem with CorrelationVariable Observation 1 Observation 2

x1 5 51

x2 4 42

x3 3 33

x4 2 24

x5 1 15

Mean 3 33

Std. Dev. 1.5811 14.2302

The correlation between observations 1 and 2 is a perfect 1.0, but are the observations really similar?

Page 21: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

21

i

ii nv

nf ˆ

Density Estimate Based Similarity Metrics

Clusters can be seen as areas of increased observation density. Similarity is a function of the distance between the identified density bubbles (hyper-spheres).

similarity

Density Estimate 1(Cluster 1)

Density Estimate 2(Cluster 2)

Page 22: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

22

1 2 3 4 5 … 17

Gene A 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1

Gene B 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1

DH = 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 = 5

Gene expression levels under 17 conditions

(low=0, high=1)

d

iiiH wx D

1

Hamming Distance Similarity Metric

Page 23: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

23

The DISTANCE ProcedureGeneral form of the DISTANCE procedure:

Both the PROC DISTANCE statement and the VAR statement are required.

PROC DISTANCE METHOD=method <options> ;COPY variables;VAR level (variables < / option-list >) ;

RUN;

PROC DISTANCE METHOD=method <options> ;COPY variables;VAR level (variables < / option-list >) ;

RUN;

Page 24: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

24

This demonstration illustrates the impact on cluster formation of two distance metrics generated by the DISTANCE procedure.

Generating Distances ch1s3d1

Page 25: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

Section 1.4

Classification Performance

Page 26: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

26

Use classification matrices to determine the quality of a proposed cluster solution.

Use the chi-square and Cramer’s V statistic to assess the relative strength of the derived association.

Objectives

Page 27: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

27

Perfect Solution

Quality of the Cluster Solution

Typical Solution

No Solution

Page 28: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

28

Probability of Cluster Assignment

Frequency

The probability that a cluster number represents a given class is given by the cluster’s proportion of the row total.

Probability

Page 29: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

29

The Chi-Square Statistic

i j ij

ijij

expected

) expected observed( 22

The chi-square statistic (and associated probability)• determine whether an association exists• depend on sample size• do not measure the strength of the association.

Page 30: Chapter 1 Introduction to Clustering. Section 1.1 Introduction

30

Measuring Strength of an Association

WEAK STRONG

0 1

CRAMER'S V STATISTIC

)1,1min(

/V sCramer'

2

cr

n

Cramer’s V ranges from -1 to 1 for 2X2 tables.