stephen connetti sally ellingson luis quilesdmitra/compbio/09spr.d/clusteringpr...luis quiles gene...
TRANSCRIPT
Cluster Analysis for Gene Expression DataCSE 5615 Class Project Group 2
Stephen ConnettiSally Ellingson
Luis Quiles
Gene Expression Analysis
bull Virus Bacteria and Cellular Function Research
bull Map unknown genes to cellular functions
bull Map illness to malfunctioning cellular processes
bull Measuring gene expression ndash comparing expression of proteins under different situations against a base situation
Gene Expression ‐ Data
bull To measure gene expression measure the mRNA being produced vs base production
bull Data mRNA production vs microarray test
E Coli E coli E coli E coli EHEC EHEC EHEC EHEC EHEC
Gene 1 hr 6 hr 12 hr 24 hr 1 hr 2 hr 6 hr 12 hr 24 hr
GCSF 0083 2615 2007 196 0001 0714 3642 3138 2229
GMCSF 0722 2002 0940 121 1034 1430 2961 2920 2352
IL12B 0845 477 4369 3454 0426 ‐0426 4316 4816 3671
IL1RN 0548 1732 1938 1389 0781 ‐127 1344 1419 1301
IL6 3006 5244 3897 3957 3889 4106 4396 4137 3990
Gene Expression ‐ Clustering
bull Groups together data with similar properties
bull Data sets are partitioned ndash clusters contain points more similar to themselves than others
bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known
bull Also helps identify relationships in co‐expressed genes
Human Macrophage Activation Programs induced by Pathogens
bull Originating source of our data
bull 6800 genes 43 microarray tests
bull Significance tests reduce this to 977 genes
bull Results were clustered to find relevance
bull 198 genes expressed with the same pattern
Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America
Vol 99 No 3 Feb 5 2002
Project Goals
bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK
bull Find most similar clusters ndash relevant similarity
bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Gene Expression Analysis
bull Virus Bacteria and Cellular Function Research
bull Map unknown genes to cellular functions
bull Map illness to malfunctioning cellular processes
bull Measuring gene expression ndash comparing expression of proteins under different situations against a base situation
Gene Expression ‐ Data
bull To measure gene expression measure the mRNA being produced vs base production
bull Data mRNA production vs microarray test
E Coli E coli E coli E coli EHEC EHEC EHEC EHEC EHEC
Gene 1 hr 6 hr 12 hr 24 hr 1 hr 2 hr 6 hr 12 hr 24 hr
GCSF 0083 2615 2007 196 0001 0714 3642 3138 2229
GMCSF 0722 2002 0940 121 1034 1430 2961 2920 2352
IL12B 0845 477 4369 3454 0426 ‐0426 4316 4816 3671
IL1RN 0548 1732 1938 1389 0781 ‐127 1344 1419 1301
IL6 3006 5244 3897 3957 3889 4106 4396 4137 3990
Gene Expression ‐ Clustering
bull Groups together data with similar properties
bull Data sets are partitioned ndash clusters contain points more similar to themselves than others
bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known
bull Also helps identify relationships in co‐expressed genes
Human Macrophage Activation Programs induced by Pathogens
bull Originating source of our data
bull 6800 genes 43 microarray tests
bull Significance tests reduce this to 977 genes
bull Results were clustered to find relevance
bull 198 genes expressed with the same pattern
Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America
Vol 99 No 3 Feb 5 2002
Project Goals
bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK
bull Find most similar clusters ndash relevant similarity
bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Gene Expression ‐ Data
bull To measure gene expression measure the mRNA being produced vs base production
bull Data mRNA production vs microarray test
E Coli E coli E coli E coli EHEC EHEC EHEC EHEC EHEC
Gene 1 hr 6 hr 12 hr 24 hr 1 hr 2 hr 6 hr 12 hr 24 hr
GCSF 0083 2615 2007 196 0001 0714 3642 3138 2229
GMCSF 0722 2002 0940 121 1034 1430 2961 2920 2352
IL12B 0845 477 4369 3454 0426 ‐0426 4316 4816 3671
IL1RN 0548 1732 1938 1389 0781 ‐127 1344 1419 1301
IL6 3006 5244 3897 3957 3889 4106 4396 4137 3990
Gene Expression ‐ Clustering
bull Groups together data with similar properties
bull Data sets are partitioned ndash clusters contain points more similar to themselves than others
bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known
bull Also helps identify relationships in co‐expressed genes
Human Macrophage Activation Programs induced by Pathogens
bull Originating source of our data
bull 6800 genes 43 microarray tests
bull Significance tests reduce this to 977 genes
bull Results were clustered to find relevance
bull 198 genes expressed with the same pattern
Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America
Vol 99 No 3 Feb 5 2002
Project Goals
bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK
bull Find most similar clusters ndash relevant similarity
bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Gene Expression ‐ Clustering
bull Groups together data with similar properties
bull Data sets are partitioned ndash clusters contain points more similar to themselves than others
bull Clustering aids researchers to infer relationships between genes especially when cellular functions are known
bull Also helps identify relationships in co‐expressed genes
Human Macrophage Activation Programs induced by Pathogens
bull Originating source of our data
bull 6800 genes 43 microarray tests
bull Significance tests reduce this to 977 genes
bull Results were clustered to find relevance
bull 198 genes expressed with the same pattern
Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America
Vol 99 No 3 Feb 5 2002
Project Goals
bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK
bull Find most similar clusters ndash relevant similarity
bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Human Macrophage Activation Programs induced by Pathogens
bull Originating source of our data
bull 6800 genes 43 microarray tests
bull Significance tests reduce this to 977 genes
bull Results were clustered to find relevance
bull 198 genes expressed with the same pattern
Gerard J Nau Joan F L Richmond Ann Schlesinger Ezra G Jennings Eric S Lander and Richard A Young Human macrophage activation programs induced by bacterial pathogens Proceedings of the National Academy of Sciences of the United States of America
Vol 99 No 3 Feb 5 2002
Project Goals
bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK
bull Find most similar clusters ndash relevant similarity
bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Project Goals
bull Implement 3 Clustering algorithmsndash Bayesian Clustering Self‐Organizing Maps CLICK
bull Find most similar clusters ndash relevant similarity
bull Compare paperrsquos best cluster to our resultsndash How many of the 198 genes did we find
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
K‐Means
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Contributes to Closest Cluster
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
K‐Means
[31] Jiang D Tang C and Zhang A Cluster Analysis for Gene Expression Data A Survey
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Data
Initial Clusters
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
High Contribution to Closest Cluster
Low Contribution to Other Cluster
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Clusters adjust
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Bayesian Clustering
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Weak contributors removed
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Bayesian Clustering
bull Initialize Clusters
bull Until Convergencendash Expectation Compute new probabilities
ndash Maximization Compute new clusters
bull Use BIC to prune clusters
bull Compute global BIC
bull Repeat with different initial clustersndash Keep results with best BIC
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Normal Distribution
πσ
σ
2)(
220 2)( xxexN
minusminus
=
n
xxexN)2(||
)(2)()( 1
π
μμ
Σ=
minusΣminusminus minus
nk
ddtk
tki
tkik
tkiedP
)2(||)|(
2)()( 1
πμμ
μμ
Σ=isin
minusΣminusminus minus
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Expectation Non‐spherical Clusters
Tk
n
kk DD
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
=Σ
λ
λλ
00
0000
2
1
bull Dk=Eigenvectors of Σ are the orientation of cluster k
bull λ1 λn=Eigenvalues of Σ are the radii of cluster k
bull Spherical clusters are usually sufficient (Identity Matrix)
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Compute initial probabilities
bull Given the initial clusters initialize the probabilities with the Normal Distribution
bull Since the probabilities will have to be normalized anyway we can skip the constant
2)()(00 00
)|( kiki ddkki edP μμμμ minussdotminusminus=isin
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Expectation Bayersquos law
)()()|()|(
EffectPCausePCauseEffectPEffectCauseP sdot
=
Posterior Prior
sum sdotsdot
=)()|(
)()|()|(CausePCauseEffectP
CausePCauseEffectPEffectCauseP
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Expectation Bayersquos law
sum sum
sum
= =
minusminus
=
minusminus
isinsdotisin
isinsdotisin=isin C
j
n
i
tji
tj
tj
tji
n
i
tki
tk
tk
tki
tki
tk
dPdP
dPdPdP
1 1
11
1
11
)|()|(
)|()|()|(
μμμμ
μμμμμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Recompute membership probabilities using normal
distribution
New Probability Old Probabilities
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Maximization New Clusters
bull Clusters are computed as the weighted average of each point and the probability it belongs to the cluster
sum
sum
=
minusminus
=
minusminus
isin
sdotisin= n
i
tki
tk
n
ii
tki
tk
tk
dP
ddP
1
11
1
11
)|(
)|(
μμ
μμμ
[32] Fraley C and Raftery A How Many Cluster Which Clustering Method Answers Via Model-Based Cluster Analysis
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Bayesian Information Criterion
bull BIC measures the efficiency of the parameterized model in terms of predicting the data
bull It is independent to prior knowledge and the model used
)ln()|(
ln 1
2
nkn
dPnBIC
n
i
tk
tki
tk sdot+
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛isin
sdot=sum=
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
Fit quality Complexity penalty
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Bayesian Information Criterion
bull After convergence compute global BIC
)ln()|(
ln 1 1
2
nknk
dPnBIC
n
i
k
jjji
sdot+
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛isin
sdot=sumsum= =
μμ
[33] httpenwikipediaorgwikiBayesian_information_criterion
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
GCSF
GMCSF
IL12B
IL1RN
IL6
IL6
PBEF
proIL1B
TNFA
IL8
IL8
IP10
MCP1
MGSA
MIP1A
MIP1B
MIP2A
MIP2B
RANTES
CD44
CD44
ICAM1
IFITM1
LAMB3
NINJ1
TNFAIP6
ADORA2A
CCR6
CCR7
CCRL2
DTR
1
2
37
11
6
4
10
5
8 12
9
13
14
15
17
16
18
19
20
22
21
23
24
25
26
27
28
29
30
Initial ClustersE coliTest Data
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR
1 Hour
12 H
ours
E-Coli Initial Clusters (Test Data)
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Hour 1 vs Hour 12
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5 6
GCSF GMCSF IL12B IL1RN IL6 IL6 PBEF proIL1B TNFA IL8 IL8IP10 MCP1 MGSA MIP1A MIP1B MIP2A MIP2B RANTES CD44 CD44 ICAM1IFITM1 LAMB3 NINJ1 TNFAIP6 ADORA2A CCR6 CCR7 CCRL2 DTR Cluster 1 Cluster 2Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9
E-Coli Final Clusters (Test Data)
1 Hour
12 H
ours
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
EHEC
-6
-4
-2
0
2
4
6
8
-6 -4 -2 0 2 4 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 9 Rational Clusters Discrete Clusters
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
S Aureus
-6
-4
-2
0
2
4
6
8
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
oura
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Rational Clusters Discrete Clusters
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Ecoli
-4
-3
-2
-1
0
1
2
3
4
5
6
-2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Data Cluster 7 Rational Clusters Discrete Clusters
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
S Typhirium
-4
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 10 Cluster 12 Rational Clusters Discrete Clusters
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Self‐Organizing Mapsbull SOMs are based off of neural networks They use a learning algorithm to train a
random map to be like the input values Thus similar inputs will be mapped close to each other
nlarr number of iterations for training algorithmVlarr set of learning vectors (same dimension as input)
learning_algorith(n V)SOM larr random valuesfor j larr 1 to n
v larr V(random)BMU larr min(d(vSOM)) for every node in SOMneighbors larr neighborhood(BMU)neighborsweight= neighborsweight+adjustment
end
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Best Matching Unit
bull The BMU is the minimum distance between the training vector and all the nodes in the SOM It is typically found using the Euclidean distance formula
Where k is the length of the input vectors v is the training vector and w is the weight vector from the current node in the SOM
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Neighborhood
bull The area of the neighborhood shrinks over time You can use the exponential decay function for this
Where n is the number of iterations that the algorithm will run and sigma_0 is the initial size of the neighborhood
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Weight Adjustment
bull The weight is adjusted by multiplying two decay functions by the difference between the training vector and the SOM node
bull One decay function decreases the learning variable by time
bull The other decay function decreases the rate of learning for neighbors further away from the BMU
bull Kohonen T Self‐Organizing Maps Springer Berlin 1997bull Torkkola K Gardner RM Kaysser‐Kranich T and Ma C Self‐organizing maps in mining gene expression data Information Sciences 139 (2001) 79‐96bull Kohonenrsquos Self organizing Feature Maps AI ndash Junkie March 11 2009 httpwwwai‐junkiecomannsomsom1html
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Mapping Mode (SOM)
bull After the learning algorithm is ran on the map the input is mapped onto the SOM and similar n‐dimensional inputs will be near each other on the two‐dimensional map
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
GenePattern
bull httpwwwbroadmiteducancersoftwaregenepattern
bull GenePattern combines a powerful scientific workflow platform with more than 100 genomic analysis tools
bull Use online or download
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
GenePatternModule name SOMClustering Description Self‐Organizing Maps algorithm Author Keith Ohm (Broad Institute) gp‐helpbroadmitedu Date 102803 Release 10 Summary
The Self Organizing Map (SOM) is a clustering algorithm where a grid of 2D nodes (clusters) is iteratively adjusted to reflect the global structure in the expression dataset With the SOM the geometry of the grid is randomly chosen (eg a 3 x 2 grid) and mapped to the k‐dimensional gene expression space The mapping is then iteratively adjusted to reflect the natural structure of the data Resulting clusters are organized in a 2D grid where similar clusters lie near to each other and provide an automatic ldquoexecutiverdquo summary of the dataset This module is a standard implementation of the SOM algorithm that can be used to cluster genes or samples (or just about any data ie stocks mutual funds spectral peaks etc)
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Final Results
bull httpmyfitedu~sellingsfinalClusterszip
bull Arranged input data into 9 clusters using GenePattern software
bull Why 9 clustersndash Used average of Stephenrsquos output (8) + 1 to use a square map (3X3 SOM)
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
CLuster Identification via Connectivity Kernels (CLICK)
bull Initialize Graph G=(VE)ndash Vertex v single gene ldquofingerprintrdquo vector
ndash Edge e pairwise similarity between genes
ndash Cut C subset of E that partitions the graph
ndash Cluster c subset of V
ndash Intersection of clusters ci cj inej is Oslash
ndash Fingerprint of a cluster c mean_vector(c)
Roded Sharan and Ron Shamir CLICK A Clustering Algorithm with Applications to Gene Expression Analysis Proceedings of the Intelligent Systems for Molecular Biology 2000
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
CLICK Preprocessing
bull Input n x p matrix M of values
bull n Genes p Tests
bull Data must be normalized
bull Similarity measurendash Svu=vu = |v||u| cos θndash Proportional only when norm is fixed for all vєV
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
CLICK Similarity
bull Key idea S is normalized mixed distribution
clustersdifferent in elementsor f pdf)|( variance mean For
cluster samein elementsor f pdf)|( variance mean For
2
2
FF
FF
TT
TT
xfVvu
xfVvu
σμσμ
σμσμ
==isin
==isin
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Basic CLICK
bull M n x p matrix (genes vs test conditions)
bull Sij dot product of vi vjbull wij probability that vi vj are mates
22
2222
2
)(1
2)()(
)1()ln
)|()1()|(
ln
)2()|( 2
2
TF
TijFFijT
Tmates
Fmatesij
FFijmates
TTijmatesij
S
ij
SSp
pw
SfpSfp
w
eSfij
σσμσμσ
σσ
σμσμ
πσσμ σ
μ
minusminusminus+⎟⎟⎠
⎞⎜⎜⎝
⎛minus
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
minus=
=minusminus
minus
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Basic CLICKR Singleton SetBasic_CLICK(Graph G) v is a singletonIf V(G)=v then Radd(v)Else if G is a kernel then
Output(V(G))Else
(HK)lt-MinWeightCut(G)Basic_CLICK(H)Basic_CLICK(K)
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
CLICK Kernel
bull Decision problem is Vhellipndash a singleton (|V| = 1)
ndash a subset of 2+ clusters (need to partition more)
ndash a subset of a single cluster (kernel)
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
CLICK Kernel
bull For all possible cuts C connecting Vndash H0
C Cut C disconnects two clusters
ndash H1C Cut C partitions a kernel
ndash If H0C gt H1
C for any C then V is not a kernelbull If V is not a kernel then the graph should be partitioned into sub‐graphs H K
sum=⎟⎟⎠
⎞⎜⎜⎝
⎛=
inCjijiC
C
wCHCHCW
)(
0
1
)|Pr()|Pr(ln)(
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Minimal Weight Cut
bull Choose one vertex as source mark visited
bull Mark each next highly connected vertex
bull Last node t represents the cutndash W(Ct) = sumwit
ndash Merge t with 2nd last marked node s
ndash Remove t from V
ndash Repeat until |V| = 1
Mechthild Stoer and Frank Wagner A Simple Min‐Cut Algorithm Journal of the ACM Vol 44 No 4 July 1997
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
MinWeightCut Example
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
MinWeightCut
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
CLICK Adoption amp Merging
bull Basic_CLICK kernels ndash not full clusters
bull Expand kernels by adding closest singletons
bull Merge kernels with closest similarityndash In both situations only mergeadopt over some threshold
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Full CLICKR Singleton SetFull_CLICK(Graph G = (VE)) R lt- VWhile |R| is reduced
Basic_CLICK(G = (RER))Let L be the list of Kernels producedLet R be the set of Singletons producedAdopt(L R)
Merge(L)Adopt(L R)
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Results amp Analysis
bull Two metrics ndash Similarity between result clusters
ndash Similarity of clusters with paperrsquos results
bull Map clusters to find regions of similar data
bull Compare best clusters to find relationships
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
BCG
-1
-05
0
05
1
15
2
25
3
-04 -02 0 02 04 06 08 1 12
1 Hour
6 H
ours
Bayesian Click SOM
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
E Coli
-2
-1
0
1
2
3
4
5
6
-1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
EHEC
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5 6
1 Hour
6 H
ours
Bayesian Click SOM
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
L Monocytogenes
-4
-3
-2
-1
0
1
2
3
4
5
6
-4 -3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Latex
-2
-15
-1
-05
0
05
1
15
2
25
-15 -1 -05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
M Tuberculosis
-2
-1
0
1
2
3
4
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
S Aureus
-3
-2
-1
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3 4 5
1 Hour
6 H
ours
Bayesian Click SOM
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
S Typhi
-05
0
05
1
15
2
25
-02 0 02 04 06 08 1 12 14 16
1 Hour
6 H
ours
Bayesian Click SOM
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
S Typhimirium
-1
-05
0
05
1
15
2
25
3
-05 0 05 1 15 2
1 Hour
6 H
ours
Bayesian Click SOM
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Error Metric
bull Scores higher for clusters that either contain very many or very few clusters in the truth set
bull Scales to the number of datums in each cluster
bull Low error implies Truth cluster were highly correlated
bull High error implies Truth Clusters were well distributed
Bayesian Click SOM
Error 433 244 1822
sum
sum
=
= ⎥⎥⎦
⎤
⎢⎢⎣
⎡sdot⎟⎟⎠
⎞⎜⎜⎝
⎛ capminusminus
= k
ii
k
ii
i
C
CT
CT
E
1
1 21
21
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-
Percent of Our Clusters Percent of Paper Cluster Quality Min Cluster Composition
Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM Bayesian Click SOM0 0 002381 0 0005051 0 0212121 0 000 000 000
1 0418605 0368771 0 0090909 0560606 0 3909091 1322576 0 5 9848 9848 98992 0 0163333 0013793 0 0247475 0010101 0 7424242 1464646 10 9848 9848 9899
3 1 0 0337278 0005051 0 0287879 0005051 0 4865152 15 9848 9444 9899
4 08 0341772 0390805 0020202 0136364 0171717 010101 1077273 1493939 20 9848 6970 98995 08 0121212 0 0040404 0040404 0 040404 2666667 0 25 8485 6970 9899
6 1 0037037 0 0005051 0005051 0 0005051 0136364 0 30 8485 6970 87377 1 0041667 088172 0025253 0005051 0414141 0126263 0121212 3851515 40 8485 000 4141
8 1 0 0255556 0015152 0 0116162 0045455 0 1045455 45 6970 000 41419 1 0 0 0010101 0020202 18 64 50 6111 000 4141
10 1 0015152 0045455 55 6111 000 4141
11 1 0005051 0005051 60 4192 000 414112 1 0010101 0020202 65 4192 000 4141
13 1 0020202 0080808 70 4192 000 414114 1 0015152 0045455 75 3485 000 4141
15 1 0005051 0005051 80 2879 000 4141
16 0928571 0065657 0919192 85 2879 000 414117 1 0035354 0247475 90 2879 000 000
18 0444444 0020202 0181818 95 1717 000 00019 0 0 0 100 10000 10000 10000
20 0909091 0050505 055555621 0444444 0040404 0727273 Error 422929 238409 178025
22 1 0005051 0005051 Error 433 2440 1822
23 0 0 024 0 0 0
25 0037037 0005051 013636426 0 0 0
27 0 0 0
28 0225 0136364 163636429 0485714 0085859 3005051
30 0736842 0070707 134343431 0004673 0005051 1080808
32 059375 0191919 122828333 0 0 0
34 0008065 0005051 0626263
- Cluster Analysis for Gene Expression Data CSE 5615 Class Project Group 2
- Gene Expression Analysis
- Gene Expression - Data
- Gene Expression - Clustering
- Human Macrophage Activation Programs induced by Pathogens
- Project Goals
- K-Means
- K-Means
- K-Means
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Bayesian Clustering
- Normal Distribution
- Expectation Non-spherical Clusters
- Compute initial probabilities
- Expectation Bayersquos law
- Expectation Bayersquos law
- Maximization New Clusters
- Bayesian Information Criterion
- Bayesian Information Criterion
- Slide Number 23
- Slide Number 24
- Slide Number 25
- Slide Number 26
- Slide Number 27
- Slide Number 28
- Slide Number 29
- Self-Organizing Maps
- Best Matching Unit
- Neighborhood
- Weight Adjustment
- Slide Number 34
- Mapping Mode (SOM)
- GenePattern
- GenePattern
- Final Results
- CLuster Identification via Connectivity Kernels (CLICK)
- CLICK Preprocessing
- CLICK Similarity
- Basic CLICK
- Basic CLICK
- Slide Number 44
- CLICK Kernel
- CLICK Kernel
- Minimal Weight Cut
- MinWeightCut Example
- MinWeightCut
- CLICK Adoption amp Merging
- Full CLICK
- Results amp Analysis
- Slide Number 54
- Slide Number 55
- Slide Number 56
- Slide Number 57
- Slide Number 58
- Slide Number 59
- Slide Number 60
- Slide Number 61
- Slide Number 62
- Error Metric
- Slide Number 64
-