microarray data analysis (lecture for cs498-cxz algorithms in bioinformatics) oct 13, 2005...
TRANSCRIPT
![Page 1: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/1.jpg)
Microarray Data Analysis
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Oct 13, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
![Page 2: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/2.jpg)
Outline
Microarrays
Hierarchical Clustering
K-Means Clustering
Corrupted Cliques Problem
CAST Clustering Algorithm
![Page 3: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/3.jpg)
Motivation: Inferring Gene Functionality
Researchers want to know the functions of newly sequenced genes
Simply comparing the new gene sequences to known DNA sequences often does not give away the function of gene
For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes
Microarrays allow biologists to infer gene function even when sequence similarity alone there is insufficient to infer function.
![Page 4: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/4.jpg)
Microarrays and Expression Analysis
Microarrays measure the activity (expression level) of the genes under varying conditions/time points
Expression level is estimated by measuring the amount of mRNA for that particular gene
A gene is active if it is being transcribed
More mRNA usually indicates more gene activity
![Page 5: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/5.jpg)
Microarray Experiments
Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular
gene is expressed Different color phosphors are available to compare
many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser
Illumination reveals transcribed genes
Scan microarray multiple times for the different color phosphor’s
![Page 6: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/6.jpg)
Microarray Experiments (con’t)
www.affymetrix.com
Phosphors can be added here instead
Then instead of staining, laser illumination can be used
![Page 7: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/7.jpg)
Interpretation of Microarray Data
Green: expressed only from control
Red: expressed only from experimental cell
Yellow: equally expressed in both samples
Black: NOT expressed in either control or experimental cells
![Page 8: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/8.jpg)
Microarray Data Microarray data are usually transformed into an intensity
matrix (below)
The intensity matrix allows biologists to make correlations between different genes (even if they are
dissimilar) and to understand how gene functions might be related
Clustering comes into playTime: Time X Time Y Time Z
Gene 1 10 8 10
Gene 2 10 0 9
Gene 3 4 8.6 3
Gene 4 7 8 3
Gene 5 1 2 3
Intensity (expression level) of gene at measured time
![Page 9: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/9.jpg)
Major Analysis Techniques
Single gene analysis Compare the expression levels of the same gene under
different conditions
Main techniques: Significance test (e.g., t-test)
Gene group analysis Find genes that are expressed similarly across many different
conditions (co-expressed genes -> co-regulated genes)
Main techniques: Clustering (many possibilities)
Gene network analysis Analyze gene regulation relationship at a large scale
Main techniques: Bayesian networks, won’t be covered
![Page 10: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/10.jpg)
Clustering of Microarray Data
Clusters
![Page 11: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/11.jpg)
Homogeneity and Separation Principles
• Homogeneity: Elements within a cluster are close to each other
• Separation: Elements in different clusters are further apart from each other
• …clustering is not an easy task!
Given these points a clustering algorithm might make two distinct clusters as follows
![Page 12: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/12.jpg)
Bad Clustering
This clustering violates both Homogeneity and Separation principles
Close distances from points in separate clusters
Far distances from points in the same cluster
![Page 13: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/13.jpg)
Good Clustering
This clustering satisfies both Homogeneity and Separation principles
![Page 14: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/14.jpg)
Clustering Methods
Similarity-based (need a similarity function)
Construct a partition Agglomerative, bottom up
Typically “hard” clustering
Model-based (latent models, probabilistic or algebraic)
First compute the model
Clusters are obtained easily after having a model
Typically “soft” clustering
![Page 15: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/15.jpg)
Similarity-based Clustering
Define a similarity function to measure similarity between two objects
Common clustering criteria: Find a partition to Maximize intra-cluster similarity
Minimize inter-cluster similarity
Challenge: How to combine them? (many possibilities)
How to construct clusters? Agglomerative Hierarchical Clustering (Greedy)
Need a cutoff (either by the number of clusters or by a score threshold)
![Page 16: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/16.jpg)
Method 1 (Similarity-based):
Agglomerative Hierarchical Clustering
![Page 17: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/17.jpg)
Similarity Measures
Given two gene expression vectors, X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn}, define a similarity function s(X,Y) to measure how likely the two genes are co-regulated/co-expressed However “co-expressed” is hard to quantify
Biology knowledge would help
In practice, many choices Euclidean distance
Pearson correlation coefficient
1 1 1
2 2 2 2
1 1 1 1
( , )
[ ( ) ] [ ( ) ]
n n n
i i i ii i i
n n n n
i i i ii i i i
n x y x ys X Y
n x x n y y
Better measures are possible (e.g., focusing on a subset of values…
2
1
( , ) ( )n
i ii
s X Y x y
![Page 18: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/18.jpg)
Agglomerative Hierachical Clustering
Given a similarity function to measure similarity between two objects
Gradually group similar objects together in a bottom-up fashion
Stop when some stopping criterion is met
Variations: different ways to compute group similarity based on individual object similarity
![Page 19: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/19.jpg)
Hierarchical Clustering
![Page 20: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/20.jpg)
Hierarchical Clustering Algorithm
1. Hierarchical Clustering (d , n)
2. Form n clusters each with one element
3. Construct a graph T by assigning one vertex to each cluster
4. while there is more than one cluster
5. Find the two closest clusters C1 and C2
6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements
7. Compute distance from C to all other clusters
8. Add a new vertex C to T and connect to vertices C1 and C2
9. Remove rows and columns of d corresponding to C1 and C2
10. Add a row and column to d corrsponding to the new cluster C
11. return T
The algorithm takes a nxn distance matrix d of pairwise distances between points as an input.
![Page 21: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/21.jpg)
Hierarchical Clustering Algorithm
1. Hierarchical Clustering (d , n)
2. Form n clusters each with one element
3. Construct a graph T by assigning one vertex to each cluster
4. while there is more than one cluster
5. Find the two closest clusters C1 and C2
6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements
7. Compute distance from C to all other clusters
8. Add a new vertex C to T and connect to vertices C1 and C2
9. Remove rows and columns of d corresponding to C1 and C2
10. Add a row and column to d corrsponding to the new cluster C
11. return T
Different ways to define distances between clusters
lead to different clustering methods…
![Page 22: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/22.jpg)
How to Compute Group Similarity?
Given two groups g1 and g2,
Single-link algorithm: s(g1,g2)= similarity of the closest pair
complete-link algorithm: s(g1,g2)= similarity of the farthest pair
average-link algorithm: s(g1,g2)= average of similarity of all pairs
Three Popular Methods:
![Page 23: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/23.jpg)
Three Methods Illustrated
Single-link algorithm
?
g1 g2
complete-link algorithm
……
average-link algorithm
![Page 24: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/24.jpg)
Group Similarity Formulas
Given C (new cluster) and C* (any other cluster in the current results)
Given a similarity function s(x,y) (or equivalently, a distance function, d(x,y) as in the book)
Single link:
Complete link:
Average link:
max min, *, *
( , *) max ( , ) ( , *) min ( , )x C y Cx C y C
s C C s x y or d C C d x y
, * , *
( , ) ( , )
( , *) , ( , *)| || * | | || * |
x C y C x C y Cavg avg
s x y d x y
s C C or d C CC C C C
min max, * , *
( , *) min ( , ) ( , *) max ( , )x C y C x C y C
s C C s x y or d C C d x y
![Page 25: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/25.jpg)
Comparison of the Three Methods
Single-link “Loose” clusters
Individual decision, sensitive to outliers
Complete-link “Tight” clusters
Individual decision, sensitive to outliers
Average-link “In between”
Group decision, insensitive to outliers
Which one is the best? Depends on what you need!
![Page 26: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/26.jpg)
Method 2 (Model-based):
K-Means
![Page 27: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/27.jpg)
A Model-based View of K-Means
Each cluster is modeled by a centroid point
A partition of the data V={v1…vn} with k clusters is modeled by k centroid points X={x1…xk}; Note that a model X uniquely specifies the clustering results
Define an objective function on X and V to measure how good/bad V is as clustering results of X, f(X,V)
Clustering is reduced to the problem of finding a model X* that optimizes the value f(X*,V)
Once X* is found, the clustering results are simply computed according to our model X*
Major tasks: (1) define d(x,v); (2) define f(X,V); (3) Efficiently search for X*
( ) { | , , ( , ) ( , )}i i j l j i j lC x v V x X l i d v x d v x
![Page 28: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/28.jpg)
Squared Error Distortion● Given a set of n data points V={v1…vn} and a set
of k points X={x1…xk}
● Define d(xi,vj)=Euclidean distance
● Define f(X,V)=d(X,V) (Squared error distortion)
2
1
1
1( , ) ( , )
( , ) min ( , )
n
ii
i i jj k
d X V d v Xn
d v X d v x
The distance from vi to
the closest point in X
![Page 29: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/29.jpg)
K-Means Clustering Problem: Formulation
● Input: A set, V, consisting of n points and a parameter k
● Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X
This problem is NP-complete.
![Page 30: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/30.jpg)
K-Means Clustering
Given a distance function
Start with k randomly selected data points
Assume they are the centroids of k clusters
Assign every data point to a cluster whose centroid is the closest to the data point
Recompute the centroid for each cluster
Repeat this process until the distance-based objective function converges
![Page 31: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/31.jpg)
K-Means Clustering: Lloyd Algorithm
1. Lloyd Algorithm
2. Arbitrarily assign the k cluster centers
3. while the cluster centers keep changing
4. Assign each data point to the cluster Ci (xi) corresponding to the closest
cluster representative (i.e,. the center xi) (1 ≤ i ≤ k)
5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of
each cluster, that is, the new cluster representative is
*This may lead to merely a locally optimal clustering.
'( )| |v C
i ii
vx C
C
![Page 32: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/32.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
ex
pre
ss
ion
in
co
nd
itio
n 2
x1
x2
x3
![Page 33: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/33.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
x1
x2
x3
![Page 34: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/34.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
x1
x2
x3
![Page 35: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/35.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
x1
x2 x3
![Page 36: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/36.jpg)
Evaluation of Clustering Results
How many clusters? Score threshold in hierarchical clustering (need to have some confidence in the
scores) Try out multiple numbers, and see which one works the best (best could mean
fitting the data better or discovering some known clusters)
Quality of a single cluster A tight cluster is better (but needs to quantify the “tightness”) Prior knowledge about clusters
Quality of a set of clusters Quality of individual clusters How well different clusters are separated
In general, evaluation of clustering and choosing the right clustering algorithms are difficult
But the good news is that significant clusters tend to show up regardless what clustering algorithms to use
![Page 37: Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of](https://reader036.vdocument.in/reader036/viewer/2022062722/56649f295503460f94c43430/html5/thumbnails/37.jpg)
What You Should Know
Why clustering microarray data is interesting
How hierarchical agglomerative clustering works
The 3 different ways of computing the group similarity/distance
How k-means works