cz5211 topics in computational biology lecture 5: clustering analysis for microarray data iii prof....
TRANSCRIPT
CZ5211 Topics in Computational BiologyCZ5211 Topics in Computational Biology
Lecture 5: Clustering Analysis for Microarray Data IIILecture 5: Clustering Analysis for Microarray Data III
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@cz3.nus.edu.sg
http://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUS
22
Self-Organizing MapsSelf-Organizing Maps
• Based on the work of Kohonen on learning/memory in the human brain
• As with k-means, the number of clusters need to be specified
• Moreover, a topology needs also be specified – a 2D grid that gives the geometric relationships between the clusters (i.e., which clusters should be near or distant from each other)
• The algorithm learns a mapping from the high dimensional space of the data points onto the points of the 2D grid (there is one grid point for each cluster)
33
Self Organizing MapsSelf Organizing Maps
• Creates a map in which similar patterns are plotted next to each other
• Data visualization technique that reduces n dimensions and displays similarities
• More complex than k-means or hierarchical clustering, but more meaningful
• Neural Network Technique– Inspired by the brain
44
Self Organizing Maps (SOM)Self Organizing Maps (SOM)
• Each unit of the SOM has a weighted connection to all inputs
• As the algorithm progresses, neighboring units are grouped by similarity
Input Layer
Output Layer
NN 4NN 4 55
Biological MotivationBiological Motivation
Nearby areas of the cortex correspond to related brain functions
66
The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation.
That is, the brain processes the external signals in a topology-preserving way
Mimicking the way the brain learns, our system should be able to do the same thing.
Brain’s self-organizationBrain’s self-organization
77
A Self-Organized MapA Self-Organized MapA Self-Organized MapA Self-Organized Map
Data: vectors XT = (X1, ... Xd) from d-dimensional space.
Grid of nodes, with local processor (called neuron) in each node.
Local processor # j has d adaptive parameters W(j).
Goal: change W(j) parameters to recover data clusters in X space.
88
SOM NetworkSOM Network
• Unsupervised learning neural network
• Projects high-dimensional input data onto two-dimensional output map
• Preserves the topology of the input data
• Visualizes structures and clusters of the data
c
i 1iw
3iw
4iw
5iw
1cw 2cw
3cw 4cw
5cw
Input layer Output layer
Component 1
Component 3
Component 5
Component 2
Component 4
2iw
99
- input vector is represented by scalar signals x1to xn:X = (x1 … xn)
- every unit “i” in competitive layer has a weight vector associated with it, represented by variable parameters wi1to win:
w = (wi1... win)- we compute the total input to each neurode by taking the weighted sum of the input signal:
n
si = wij xj j = 1
-every weight vector may be regarded as a kind of image that shall be matched or compared against a corresponding input vector;
our aim is to devise adaptive processes in which weight of all units converge to such values that every unit “i” becomes sensitive to a particular region of domain
SOM AlgorithmSOM Algorithm
1010
- geometrically, the weighted sum is simply a dot (scalar) product ofthe input vector and the weight vector:si=x*wi = x1 wi1 +... + xn win
SOM AlgorithmSOM Algorithm
X
X
1111
…
…
…
…
…
…
…
…
…
…
…
…
…
… 2-d map of nodes3x4 SOM
Data Data array
Input vector
Weights
Node weights of the 3x4 SOM
Self-organizing
ikkc mx minarg
)]()([)()()()1( tttttt iii mxmm
Find the winner:
Update the weights:
xk
mi
SOM Algorithm
1212
SOM AlgorithmSOM Algorithm
• Learning Algorithm
1. Initialize w’s
2. Find the winning node
i(x) = argminj || x(n) - wj(n) ||
3. Update weights of neighbors
wj(n+1) = wj(n) + (n) j,i(x)(n) [ x(n) - wj(n) ]
4. Reduce neighbors and
5. Go to 2
1313
SOM Training processSOM Training processSOM Training processSOM Training process
o
o
oox
x
xx=dane
siatka neuronów
N-wymiarowa
xo=pozycje wag neuronów
o
o o
o
o
o
o
o
przestrzeń danych
wagi wskazująna punkty w N-D
w 2-D
Nearest neighbor vectors are clustered into the same node
1414
Concept of SOMConcept of SOMInput spaceInput layer
Reduced feature spaceMap layer
s1s2Mn
Sr
Ba
Clustering and ordering of the cluster centers in a two dimensional grid
Cluster centers (code vectors) Place of these code vectors in the reduced space
1515
Ba
Mn
Sr
…
SA3
It can be used for visualizationo
r used
for classificatio
nMg
Or used for clustering
SA3
Concept of SOMConcept of SOM
1616
SOM ArchitectureSOM Architecture
• The input is connected with each neuron of a lattice.The input is connected with each neuron of a lattice.• The topology of the lattice allows one to define a The topology of the lattice allows one to define a
neighborhood structure on the neurons, like those illustrated neighborhood structure on the neurons, like those illustrated below.below.
2D topology2D topology
and two possible neighborhoods
with a small neighborhood1D topology1D topology
1717
Self-Organizing Maps (SOMs)Self-Organizing Maps (SOMs)
a dbc
Idea: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares.
A
D
B
C
1818
Self-Organizing Maps (SOMs)Self-Organizing Maps (SOMs)
a dbc
IDEA: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares.
A
D
B
C
1919
Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8Gene 9Gene 10-Gene 11Gene 12Gene 13Gene 14Gene 15Gene 16
a_1hr a_2hr a_3hr b_1hr b_2hr b_3hr1 2 4 5 7 92 3 7 7 6 34 4 5 5 4 43 4 3 4 3 31 2 3 4 5 68 7 7 6 5 34 4 4 4 5 45 6 5 4 3 23 3 1 3 6 82 4 8 5 4 21 5 6 9 8 71 3 5 8 8 64 3 3 4 5 69 7 5 3 2 11 2 2 3 4 41 2 5 7 8 9
A
B
C
D
E
F
G
H
I
A
B
C
D
E
F
G
H
I
A
B
C
D
E
F
G
H
I
A
B
C
D
E
F
G
H
I A
B
C
D
E
F
G
H
I
Self-organizing Maps (SOMs)
2020
Self-organizing Maps (SOMS)
A
B
C
D
E
F
G
H
I
Genes , , and1 16 5
Genes and 6 14Genes and 9 13
Genes and 4, 7 2
Genes 3
Gene 15 Genes 8
Genes 10
Genes and 11 12
2121
Self-Organizing MapsSelf-Organizing Maps
• Suppose we have a r x s grid with each grid point associated with a cluster mean 1,1,… r,s
• SOM algorithm moves the cluster means around in the high dimensional space, maintaining the topology specified by the 2D grid (think of a rubber sheet)
• A data point is put into the cluster with the closest mean
• The effect is that nearby data points tend to map to nearby clusters (grid points)
2222
A Simple Example of Self-Organizing MapA Simple Example of Self-Organizing Map
This is a 4 x 3 SOM and the mean of each cluster is displayed
2323
SOM Applied to Microarray AnalysisSOM Applied to Microarray Analysis
• Consider clustering 10,000 genes
• Each gene was measured in 4 experiments– Input vectors are 4 dimensional– Initial pattern of 10,000 each described by a 4D vector
• Each of the 10,000 genes is chosen one at a time to train the SOM
2424
SOM Applied to Microarray AnalysisSOM Applied to Microarray Analysis
• The pattern found to be closest to the current gene (determined by weight vectors) is selected as the winner
• The weight is then modified to become more similar to the current gene based on the learning rate (t in the previous example)
• The winner then pulls its neighbors closer to the current gene by causing a lesser change in weight
• This process continues for all 10,000 genes
• Process is repeated until over time the learning rate is reduced to zero
2525
SOM Applied to Microarray Analysis of YeastSOM Applied to Microarray Analysis of Yeast
• Yeast Cell Cycle SOM. www.pnas.org/cgi/content/full/96/6/2907
• (a) 6 × 5 SOM. The 828 genes that passed the variation filter were grouped into 30 clusters. Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expression level of each gene was normalized to have mean = 0 and SD = 1 across time points. Expression levels are shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression. n indicates the number of genes within each cluster. Note that multiple clusters exhibit periodic behavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail. Cluster 29 contains 76 genes exhibiting periodic behavior with peak expression in late G1. Normalized expression pattern of 30 genes nearest the centroid are shown. (c) Centroids for SOM-derived clusters 29, 14, 1, and 5, corresponding to G1, S, G2 and M phases of the cell cycle, are shown.
2626
SOM Applied to Microarray Analysis of YeastSOM Applied to Microarray Analysis of Yeast
• Reduce data set to 828 genes• Clustered data into 30 clusters using a SOFM
Each pattern is represented by its average (centroid) pattern
Clustered data has same behavior Neighbors exhibit similar behavior
2727
A SOFM Example With YeastA SOFM Example With Yeast
2828
Benefits of SOMBenefits of SOM
• SOM contains the set of features extracted from the input patterns (reduces dimensions)
• SOM yields a set of clusters
• A gene will always be most similar to a gene in its immediate neighborhood than a gene further away
2929
Problems of SOMProblems of SOM
• The algorithm is complicated and there are a lot of parameters (such as the “learning rate”) - these settings will affect the results
• The idea of a topology in high dimensional gene expression spaces is not exactly obvious– How do we know what topologies are appropriate?– In practice people often choose nearly square grids
for no particularly good reason
• As with k-means, we still have to worry about how many clusters to specify…
3030
Comparison of SOM and K-meansComparison of SOM and K-means
• K-means is a simple yet effective algorithm for clustering data
• Self-organizing maps are slightly more computationally expensive than K-means, but they solve the problem of spatial relationship
3131
Other Clustering AlgorithmsOther Clustering Algorithms
• Clustering is a very popular method of microarray analysis and also a well established statistical technique – huge amount of literature out there
• Many variations on k-means, including algorithms in which clusters can be split and merged or that allow for soft assignments (multiple clusters can contribute)
• Semi-supervised clustering methods, in which some examples are assigned by hand to clusters and then other membership information is inferred