unsupervised analysis goal a: find groups of genes that have correlated expression profiles. these...

30
UNSUPERVISED ANALYSIS •GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. •GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING

Upload: peregrine-curtis

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

UNSUPERVISED ANALYSIS

•GOAL A: FIND GROUPS OF GENES THAT HAVE

CORRELATED EXPRESSION PROFILES. THESE GENES ARE

BELIEVED TO BELONG TO THE SAME BIOLOGICAL

PROCESS.

•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR

GENE EXPRESSION PROFILES. THESE TISSUES ARE

EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)

STATE.

CLUSTERING

Unsupervised analysis

Page 2: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Giraffe

DEFINITION OF THE CLUSTERING PROBLEM

Page 3: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Page 4: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Giraffe + Okapi

BUT WHAT ABOUT THE OKAPI ?

Page 5: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

Statement of the problem2

Page 6: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

CLUSTER ANALYSIS YIELDS DENDROGRAM

Dendrogram2

TLINEAR ORDERING OF DATA

YOUNG OLD

Page 7: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

Page 8: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Hierarchical Clustering -Summary

• Results depend on distance update method

• Greedy iterative process

• NOT robust against noise

• No inherent measure to identify stable clusters

Page 9: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

2 good clouds

COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

Page 10: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

2 flat clouds

2 FLAT CLOUDS - SINGLE LINKAGE WORKS

Page 11: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

filament

SINGLE LINKAGE SENSITIVE TO NOISE

Page 12: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

52 41 3

Average linkage

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Average Linkage: average distance between all pairs

Mean Linkage: distance between centroids

Need to define the distance between thenew cluster and the other clusters.

Average Linkage: average distance between all pairs

Mean Linkage: distance between centroids

Dendrogram

Page 13: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

nature 2002 breast cancer

Page 14: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

Statement of the problem2

Page 15: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

how many clusters?

3 LARGEMANY small (SPC)

toy problem SPC

Page 16: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

other methods

Page 17: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

K-means

Iteration = 0

•Start with random positions of centroids.

Page 18: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

K-means

Iteration = 1

•Start with random positions of centroids.

•Assign data points to

centroids

Page 19: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

K-means

Iteration = 1

•Start with random positions of centroids.

•Assign data points to

centroids

•Move centroids to center

of assigned points

Page 20: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

K-means

Iteration = 3

•Start with random positions of centroids.

•Assign data points to

centroids

•Move centroids to center

of assigned points

•Iterate till minimal cost

Page 21: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

• Result depends on initial centroids’ position

• Fast algorithm: compute distances from data points to centroids

• Must preset K

• Fails for non-spherical distributions

K-means - Summary

Page 22: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

TSS vs K

Page 23: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Iris setosa

Iris versicolor

Iris virginica

50 specimes from each group4 numbers for each flower150 data points in 4-dimensional space

irises

Page 24: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

150 points in d=4

3 large clusters

d=4

Page 25: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Output of SPC

Stable clusters “live” for large T

Stable clusters “live” for large T

Page 26: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Choosing a value for T

Page 27: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Same data - Average Linkage

No analog for No analog for

Page 28: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

Same data - Average Linkage

Examining this cluster

Examining this cluster

Page 29: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

A ( I I )S c G B M

P r G B MC L

GE

NE

S

S 2S 3

T

S 1 ( G 1 )

G 1 2

G 5

C o u p l e d T w o - W a y C l u s t e r i n g ( C T W C )

o f 3 5 8 G e n e s a n d 3 6 S a m p l e s

F i g . 2 A

G L I O B L A S T O M A : M . H E G I e t a l C H U V , C L O N T E C H A R R A Y S

g l i o b l a s t o m a

Page 30: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL

AB004904 STAT- i nduced STAT i nhi bi t or 3

M 32977 VEG F

M 35410 I G FBP2

X51602 VEG FR1

M 96322 gr avi n

AB004903 STAT- i nduced STAT i nhi bi t or 2

X52946 PTN

J04111 c- j un

X79067 TI S11B

S 1 1S 1 2

S 1 4

S 1 0

S 1 3S 1 (G 5 )

S u p e r -P a ra m a g n e tic C lu s te r in g o f A ll S a m p le s

U s in g S ta b le G e n e C lu s te r G 5

F ig . 2 B

S 1 (G 5 )