stephen connetti sally ellingson luis quilesdmitra/compbio/09spr.d/clusteringpr...luis quiles gene...

63
Cluster Analysis for Gene Expression Data CSE 5615 Class Project, Group 2 Stephen Connetti Sally Ellingson Luis Quiles

Upload: others

Post on 08-Aug-2020

0 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Cluster Analysis for Gene Expression DataCSE 5615 Class Project Group 2

Stephen ConnettiSally Ellingson

Luis Quiles

Gene Expression Analysis

bull Virus Bacteria and Cellular Function Research

bull Map unknown genes to cellular functions

bull Map illness to malfunctioning cellular processes

bull Measuring gene expression ndash comparing expression of proteins under different situations against a base situation

Gene Expression ‐ Data

bull To measure gene expression measure the mRNA being produced vs base production

bull Data mRNA production vs microarray test

E Coli E coli E coli E coli EHEC EHEC EHEC EHEC EHEC

Gene 1 hr 6 hr 12 hr 24 hr 1 hr 2 hr 6 hr 12 hr 24 hr

GCSF 0083 2615 2007 196 0001 0714 3642 3138 2229

GMCSF 0722 2002 0940 121 1034 1430 2961 2920 2352

IL12B 0845 477 4369 3454 0426 ‐0426 4316 4816 3671

IL1RN 0548 1732 1938 1389 0781 ‐127 1344 1419 1301

IL6 3006 5244 3897 3957 3889 4106 4396 4137 3990

Gene Expression ‐ Clustering

bull Groups together data with similar properties

bull Data sets are partitioned ndash clusters contain points more similar to themselves than others

bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known

bull Also helps identify relationships in co‐expressed genes

Human Macrophage Activation Programs induced by Pathogens

bull Originating source of our data

bull 6800 genes 43 microarray tests

bull Significance tests reduce this to 977 genes

bull Results were clustered to find relevance

bull 198 genes expressed with the same pattern

Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America

Vol 99 No 3 Feb 5 2002

Project Goals

bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK

bull Find most similar clusters ndash relevant similarity

bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 2: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Gene Expression Analysis

bull Virus Bacteria and Cellular Function Research

bull Map unknown genes to cellular functions

bull Map illness to malfunctioning cellular processes

bull Measuring gene expression ndash comparing expression of proteins under different situations against a base situation

Gene Expression ‐ Data

bull To measure gene expression measure the mRNA being produced vs base production

bull Data mRNA production vs microarray test

E Coli E coli E coli E coli EHEC EHEC EHEC EHEC EHEC

Gene 1 hr 6 hr 12 hr 24 hr 1 hr 2 hr 6 hr 12 hr 24 hr

GCSF 0083 2615 2007 196 0001 0714 3642 3138 2229

GMCSF 0722 2002 0940 121 1034 1430 2961 2920 2352

IL12B 0845 477 4369 3454 0426 ‐0426 4316 4816 3671

IL1RN 0548 1732 1938 1389 0781 ‐127 1344 1419 1301

IL6 3006 5244 3897 3957 3889 4106 4396 4137 3990

Gene Expression ‐ Clustering

bull Groups together data with similar properties

bull Data sets are partitioned ndash clusters contain points more similar to themselves than others

bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known

bull Also helps identify relationships in co‐expressed genes

Human Macrophage Activation Programs induced by Pathogens

bull Originating source of our data

bull 6800 genes 43 microarray tests

bull Significance tests reduce this to 977 genes

bull Results were clustered to find relevance

bull 198 genes expressed with the same pattern

Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America

Vol 99 No 3 Feb 5 2002

Project Goals

bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK

bull Find most similar clusters ndash relevant similarity

bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 3: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Gene Expression ‐ Data

bull To measure gene expression measure the mRNA being produced vs base production

bull Data mRNA production vs microarray test

E Coli E coli E coli E coli EHEC EHEC EHEC EHEC EHEC

Gene 1 hr 6 hr 12 hr 24 hr 1 hr 2 hr 6 hr 12 hr 24 hr

GCSF 0083 2615 2007 196 0001 0714 3642 3138 2229

GMCSF 0722 2002 0940 121 1034 1430 2961 2920 2352

IL12B 0845 477 4369 3454 0426 ‐0426 4316 4816 3671

IL1RN 0548 1732 1938 1389 0781 ‐127 1344 1419 1301

IL6 3006 5244 3897 3957 3889 4106 4396 4137 3990

Gene Expression ‐ Clustering

bull Groups together data with similar properties

bull Data sets are partitioned ndash clusters contain points more similar to themselves than others

bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known

bull Also helps identify relationships in co‐expressed genes

Human Macrophage Activation Programs induced by Pathogens

bull Originating source of our data

bull 6800 genes 43 microarray tests

bull Significance tests reduce this to 977 genes

bull Results were clustered to find relevance

bull 198 genes expressed with the same pattern

Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America

Vol 99 No 3 Feb 5 2002

Project Goals

bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK

bull Find most similar clusters ndash relevant similarity

bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 4: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Gene Expression ‐ Clustering

bull Groups together data with similar properties

bull Data sets are partitioned ndash clusters contain points more similar to themselves than others

bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known

bull Also helps identify relationships in co‐expressed genes

Human Macrophage Activation Programs induced by Pathogens

bull Originating source of our data

bull 6800 genes 43 microarray tests

bull Significance tests reduce this to 977 genes

bull Results were clustered to find relevance

bull 198 genes expressed with the same pattern

Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America

Vol 99 No 3 Feb 5 2002

Project Goals

bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK

bull Find most similar clusters ndash relevant similarity

bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 5: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Human Macrophage Activation Programs induced by Pathogens

bull Originating source of our data

bull 6800 genes 43 microarray tests

bull Significance tests reduce this to 977 genes

bull Results were clustered to find relevance

bull 198 genes expressed with the same pattern

Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America

Vol 99 No 3 Feb 5 2002

Project Goals

bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK

bull Find most similar clusters ndash relevant similarity

bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 6: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Project Goals

bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK

bull Find most similar clusters ndash relevant similarity

bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 7: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 8: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

K‐Means

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Contributes to Closest Cluster

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 9: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

K‐Means

[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 10: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Data

Initial Clusters

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 11: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

High Contribution to Closest Cluster

Low Contribution to Other Cluster

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 12: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Clusters adjust

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 13: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Bayesian Clustering

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Weak contributors removed

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 14: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Bayesian Clustering

bull Initialize Clusters

bull Until Convergencendash Expectation Compute new probabilities

ndash Maximization Compute new clusters

bull Use BIC to prune clusters

bull Compute global BIC

bull Repeat with different initial clustersndash Keep results with best BIC

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 15: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Normal Distribution

πσ

σ

2)(

220 2)( xxexN

minusminus

=

n

xxexN)2(||

)(2)()( 1

π

μμ

Σ=

minusΣminusminus minus

nk

ddtk

tki

tkik

tkiedP

)2(||)|(

2)()( 1

πμμ

μμ

Σ=isin

minusΣminusminus minus

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 16: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Expectation Non‐spherical Clusters

Tk

n

kk DD

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

λ

λλ

00

0000

2

1

bull Dk=Eigenvectors of Σ are the orientation of cluster k

bull λ1 λn=Eigenvalues of Σ are the radii of cluster k

bull Spherical clusters are usually sufficient (Identity Matrix)

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 17: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Compute initial probabilities

bull Given the initial clusters initialize the probabilities with the Normal Distribution

bull Since the probabilities will have to be normalized anyway we can skip the constant

2)()(00 00

)|( kiki ddkki edP μμμμ minussdotminusminus=isin

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 18: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Expectation Bayersquos law

)()()|()|(

EffectPCausePCauseEffectPEffectCauseP sdot

=

Posterior Prior

sum sdotsdot

=)()|(

)()|()|(CausePCauseEffectP

CausePCauseEffectPEffectCauseP

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 19: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Expectation Bayersquos law

sum sum

sum

= =

minusminus

=

minusminus

isinsdotisin

isinsdotisin=isin C

j

n

i

tji

tj

tj

tji

n

i

tki

tk

tk

tki

tki

tk

dPdP

dPdPdP

1 1

11

1

11

)|()|(

)|()|()|(

μμμμ

μμμμμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Recompute membership probabilities using normal

distribution

New Probability Old Probabilities

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 20: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Maximization New Clusters

bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster

sum

sum

=

minusminus

=

minusminus

isin

sdotisin= n

i

tki

tk

n

ii

tki

tk

tk

dP

ddP

1

11

1

11

)|(

)|(

μμ

μμμ

[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 21: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Bayesian Information Criterion

bull BIC measures the efficiency of the parameterized model in terms of predicting the data

bull It is independent to prior knowledge and the model used

)ln()|(

ln 1

2

nkn

dPnBIC

n

i

tk

tki

tk sdot+

⎟⎟⎟⎟

⎜⎜⎜⎜

⎛isin

sdot=sum=

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

Fit quality Complexity penalty

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 22: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Bayesian Information Criterion

bull After convergence compute global BIC

)ln()|(

ln 1 1

2

nknk

dPnBIC

n

i

k

jjji

sdot+

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛isin

sdot=sumsum= =

μμ

[33] httpenwikipediaorgwikiBayesian_information_criterion

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 23: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

GCSF

GMCSF

IL12B

IL1RN

IL6

IL6

PBEF

proIL1B

TNFA

IL8

IL8

IP10

MCP1

MGSA

MIP1A

MIP1B

MIP2A

MIP2B

RANTES

CD44

CD44

ICAM1

IFITM1

LAMB3

NINJ1

TNFAIP6

ADORA2A

CCR6

CCR7

CCRL2

DTR

1

2

37

11

6

4

10

5

8 12

9

13

14

15

17

16

18

19

20

22

21

23

24

25

26

27

28

29

30

Initial ClustersE coliTest Data

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 24: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR

1 Hour

12 H

ours

E-Coli Initial Clusters (Test Data)

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 25: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Hour 1 vs Hour 12

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5 6

GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9

E-Coli Final Clusters (Test Data)

1 Hour

12 H

ours

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 26: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

EHEC

-6

-4

-2

0

2

4

6

8

-6 -4 -2 0 2 4 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 27: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

S Aureus

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

oura

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 28: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Ecoli

-4

-3

-2

-1

0

1

2

3

4

5

6

-2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 29: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

S Typhirium

-4

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 30: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a

random map to be like the input values Thus similar inputs will be mapped close to each other

nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)

learning_algorith(n V)SOM larr random valuesfor j larr 1 to n

v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment

end

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 31: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Best Matching Unit

bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula

Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 32: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Neighborhood

bull The area of the neighborhood shrinks over time You can use the exponential decay function for this

Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 33: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Weight Adjustment

bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node

bull One decay function decreases the learning variable by time

bull The other decay function decreases the rate of learning for neighbors further away from the BMU

bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 34: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Mapping Mode (SOM)

bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 35: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

GenePattern

bull httpwwwbroadmiteducancersoftwaregenepattern

bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools

bull Use online or download

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 36: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary

The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 37: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Final Results

bull httpmyfitedu~sellingsfinalClusterszip

bull Arranged input data into 9 clusters using GenePattern software

bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 38: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

CLuster Identification via Connectivity Kernels (CLICK)

bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector

ndash Edge e pairwise similarity between genes

ndash Cut C subset of E that partitions the graph

ndash Cluster c subset of V

ndash Intersection of clusters ci cj inej is Oslash

ndash Fingerprint of a cluster c mean_vector(c)

Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 39: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

CLICK Preprocessing

bull Input n x p matrix M of values

bull n Genes p Tests

bull Data must be normalized

bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 40: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

CLICK Similarity

bull Key idea S is normalized mixed distribution

clustersdifferent in elementsor f pdf)|( variance mean For

cluster samein elementsor f pdf)|( variance mean For

2

2

FF

FF

TT

TT

xfVvu

xfVvu

σμσμ

σμσμ

==isin

==isin

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 41: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Basic CLICK

bull M n x p matrix (genes vs test conditions)

bull Sij dot product of vi vjbull wij probability that vi vj are mates

22

2222

2

)(1

2)()(

)1()ln

)|()1()|(

ln

)2()|( 2

2

TF

TijFFijT

Tmates

Fmatesij

FFijmates

TTijmatesij

S

ij

SSp

pw

SfpSfp

w

eSfij

σσμσμσ

σσ

σμσμ

πσσμ σ

μ

minusminusminus+⎟⎟⎠

⎞⎜⎜⎝

⎛minus

=

⎟⎟⎠

⎞⎜⎜⎝

minus=

=minusminus

minus

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 42: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then

Output(V(G))Else

(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 43: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

CLICK Kernel

bull Decision problem is Vhellipndash a singleton (|V| = 1)

ndash a subset of 2+ clusters (need to partition more)

ndash a subset of a single cluster (kernel)

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 44: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

CLICK Kernel

bull For all possible cuts C connecting Vndash H0

C Cut C disconnects two clusters

ndash H1C Cut C partitions a kernel

ndash If H0C gt H1

C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K

sum=⎟⎟⎠

⎞⎜⎜⎝

⎛=

inCjijiC

C

wCHCHCW

)(

0

1

)|Pr()|Pr(ln)(

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 45: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Minimal Weight Cut

bull Choose one vertex as source mark visited

bull Mark each next highly connected vertex

bull Last node t represents the cutndash W(Ct) = sumwit

ndash Merge t with 2nd last marked node s

ndash Remove t from V

ndash Repeat until |V| = 1

Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 46: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

MinWeightCut Example

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 47: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

MinWeightCut

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 48: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

CLICK Adoption amp Merging

bull Basic_CLICK kernels ndash not full clusters

bull Expand kernels by adding closest singletons

bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 49: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced

Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)

Merge(L)Adopt(L R)

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 50: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Results amp Analysis

bull Two metrics ndash Similarity between result clusters

ndash Similarity of clusters with paperrsquos results

bull Map clusters to find regions of similar data

bull Compare best clusters to find relationships

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 51: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

BCG

-1

-05

0

05

1

15

2

25

3

-04 -02 0 02 04 06 08 1 12

1 Hour

6 H

ours

Bayesian Click SOM

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 52: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

E Coli

-2

-1

0

1

2

3

4

5

6

-1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 53: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

EHEC

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5 6

1 Hour

6 H

ours

Bayesian Click SOM

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 54: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

L Monocytogenes

-4

-3

-2

-1

0

1

2

3

4

5

6

-4 -3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 55: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Latex

-2

-15

-1

-05

0

05

1

15

2

25

-15 -1 -05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 56: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

M Tuberculosis

-2

-1

0

1

2

3

4

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 57: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

S Aureus

-3

-2

-1

0

1

2

3

4

5

6

-3 -2 -1 0 1 2 3 4 5

1 Hour

6 H

ours

Bayesian Click SOM

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 58: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

S Typhi

-05

0

05

1

15

2

25

-02 0 02 04 06 08 1 12 14 16

1 Hour

6 H

ours

Bayesian Click SOM

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 59: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

S Typhimirium

-1

-05

0

05

1

15

2

25

3

-05 0 05 1 15 2

1 Hour

6 H

ours

Bayesian Click SOM

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 60: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Error Metric

bull Scores higher for clusters that either contain very many or very few clusters in the truth set

bull Scales to the number of datums in each cluster

bull Low error implies Truth cluster were highly correlated

bull High error implies Truth Clusters were well distributed

Bayesian Click SOM

Error 433 244 1822

sum

sum

=

= ⎥⎥⎦

⎢⎢⎣

⎡sdot⎟⎟⎠

⎞⎜⎜⎝

⎛ capminusminus

= k

ii

k

ii

i

C

CT

CT

E

1

1 21

21

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64
Page 61: Stephen Connetti Sally Ellingson Luis Quilesdmitra/Compbio/09Spr.d/ClusteringPr...Luis Quiles Gene Expression Analysis? • Virus, Bacteria, and Cellular Function Research • Map

Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition

Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000

1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899

3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899

4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899

6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141

8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141

10 1 0015152 0045455 55 6111 000 4141

11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141

13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141

15 1 0005051 0005051 80 2879 000 4141

16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000

18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000

20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025

22 1 0005051 0005051 Error 433 2440 1822

23 0 0 024 0 0 0

25 0037037 0005051 013636426 0 0 0

27 0 0 0

28 0225 0136364 163636429 0485714 0085859 3005051

30 0736842 0070707 134343431 0004673 0005051 1080808

32 059375 0191919 122828333 0 0 0

34 0008065 0005051 0626263

  • Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
  • Gene Expression Analysis
  • Gene Expression - Data
  • Gene Expression - Clustering
  • Human Macrophage Activation Programs induced by Pathogens
  • Project Goals
  • K-Means
  • K-Means
  • K-Means
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Bayesian Clustering
  • Normal Distribution
  • Expectation Non-spherical Clusters
  • Compute initial probabilities
  • Expectation Bayersquos law
  • Expectation Bayersquos law
  • Maximization New Clusters
  • Bayesian Information Criterion
  • Bayesian Information Criterion
  • Slide Number 23
  • Slide Number 24
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Self-Organizing Maps
  • Best Matching Unit
  • Neighborhood
  • Weight Adjustment
  • Slide Number 34
  • Mapping Mode (SOM)
  • GenePattern
  • GenePattern
  • Final Results
  • CLuster Identification via Connectivity Kernels (CLICK)
  • CLICK Preprocessing
  • CLICK Similarity
  • Basic CLICK
  • Basic CLICK
  • Slide Number 44
  • CLICK Kernel
  • CLICK Kernel
  • Minimal Weight Cut
  • MinWeightCut Example
  • MinWeightCut
  • CLICK Adoption amp Merging
  • Full CLICK
  • Results amp Analysis
  • Slide Number 54
  • Slide Number 55
  • Slide Number 56
  • Slide Number 57
  • Slide Number 58
  • Slide Number 59
  • Slide Number 60
  • Slide Number 61
  • Slide Number 62
  • Error Metric
  • Slide Number 64