![Page 1: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/1.jpg)
Cluster analysis Cluster analysis Function Function Places genes with similar expression patterns in
groups. Sometimes genes of unknown function will be
grouped with genes of known function. The functions that are known allow the investigator
to hypothesize regarding the functions of genes not yet characterized.
Examples: Identify genes important in cell cycle regulation Identify genes that participate in a biosynthetic pathway Identify genes involved in a drug response Identify genes involved in a disease response
![Page 2: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/2.jpg)
Cluster analysis software is developed to group genes with similar patterns of expression.
In this example, the columns represent different timepoints, and the rows represent the results for a single gene.
The products of the genes expressed in a single cluster may have related or similar functions.
FUNCTIONAL GENOMICSCLUSTER ANALYSIS OF MICROARRAY DATA
(11)
![Page 3: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/3.jpg)
CLUSTER ANALYSISCLUSTER ANALYSIS
![Page 4: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/4.jpg)
OUTLINE OF TALK
![Page 5: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/5.jpg)
Clustering - GoalClustering - Goal
Partition of the genes in the dataset into distinct sets (clusters), according to similarity in their expression profiles across the probed conditions
![Page 6: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/6.jpg)
MATRIXMATRIXgenes,conditionsgenes,conditions = Expression dataset = Expression datasetthe first genevector = (xthe first genevector = (x1111, x, x1212, x, x1313, x, x1414… x… x1n1n))
the leftmost condition vector = (xthe leftmost condition vector = (x1111, x, x2121, x, x3131 … x … xm1m1))R
ows
(gen
es)
Columns (conditions [timpepoints, or tissues])
x11 , x12 , x13 , … x1n
x21
x31 ,…Xm1 … xmn
![Page 7: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/7.jpg)
Clustering yeast cell cycle dataset VS gene tree ordering
![Page 8: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/8.jpg)
Clustering – why?Clustering – why? Reduce the dimensionality of the
problem – identify the major patterns in the dataset
Similar expression profiles suggest functional relationship Functional annotation of ESTs Links among pathways
Related functions suggest coordinated regulatory control Dissection of regulatory networks
![Page 9: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/9.jpg)
Clustering identifies group of genes with “similar” expression profiles
How is similarity measured? Euclidian distance Correlation coefficient Others
Similarity measuresSimilarity measures
![Page 10: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/10.jpg)
In an experiment with 10 conditions, the gene expression profiles for two genes X, and Y would have this form
X = (x1, x2, x3, …, xm)
Y = (y1, y2, y3, …, ym)
![Page 11: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/11.jpg)
Similarity measure - Euclidian distanceSimilarity measure - Euclidian distance
In general: if there are M experiments:
X = (x1, x2, x3, …, xm)
Y = (y1, y2, y3, …, ym)
![Page 12: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/12.jpg)
Similarity measure – Correlation Similarity measure – Correlation CoefficientCoefficient
X = (x1, x2, x3, …, xm)
Y = (y1, y2, y3, …, ym)
-1 ≤ S(X,Y) ≤ 1
![Page 13: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/13.jpg)
Euclidian vs CorrelationEuclidian vs Correlation
Euclidian distance – takes into account the magnitude of the expression
Correlation distance - insensitive to the amplitude of expression, takes into account the trends of the change.
Common trends are considered biologically relevant, the magnitude is considered less important → correlation
Gene X
Gene Y
![Page 14: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/14.jpg)
What correlation distance sees
What euclidean distance sees
![Page 15: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/15.jpg)
PCAPCA
A technique for projecting the expression data set onto a reduced (2 or 3 dimensional) easily visualized space
Dataset: Thousands of genes probed in 10 conditions. The expression profile of each gene is presented by the
vector of its expression levels: X = (X1, X2, X3, X4, X5) Imagine each gene X as a point in a 5-dimentional
space. Each direction/axis corresponds to a specific condition Genes with similar profiles are close to each other in
this space PCA- Project this dataset to 2 dimensions, preserving
as much information as possible
![Page 16: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/16.jpg)
PCA transformation of a microarray PCA transformation of a microarray datasetdataset
Visual estimation of the number of clusters in the data
![Page 17: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/17.jpg)
Clustering AlgorithmsClustering Algorithms
K–meansSOMsHierarchical clustering
![Page 18: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/18.jpg)
K-MEANSK-MEANS1. The user sets the number of clusters- k2. Initialization: each gene is randomly assigned
to one of the k clusters3. Average expression vector is calculated for
each cluster (cluster’s profile) 4. Iterate over the genes:
• For each gene- compute its similarity to the cluster profiles.
• Move the gene to the cluster it is most similar to.• Recalculated cluster profiles.
5. Score current partition: sum of distances between genes and the profile of the cluster they are assigned to (homogeneity of the solution).
6. Stop criteria: further shuffling of genes results in minor improvement in the clustering score
![Page 19: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/19.jpg)
How to choose the number of clusters How to choose the number of clusters needed to informatively partition the data needed to informatively partition the data
Try several parameters (number of desired clusters,distance metric) and compare the clustering
solutions Criteria for comparison: Homogeneity vs
SeparationUse PCA (Principle Component Analysis)
![Page 20: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/20.jpg)
K-MEANS example: 4 clustersK-MEANS example: 4 clusters
Mean profile
Standard deviation in each condition
![Page 21: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/21.jpg)
Evaluating KmeansEvaluating Kmeans
Cluster 3
Cluster 1
Cluster 4
Cluster 2
Mis-classified
![Page 22: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/22.jpg)
K-means example: 3 K-means example: 3 clustersclusters
![Page 23: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/23.jpg)
Too few clusters: K=2Too few clusters: K=2
![Page 24: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/24.jpg)
SOMs (Self-Organizing SOMs (Self-Organizing Maps)Maps)
User sets the number of clusters in a form of a rectangular grid (e.g., 3x2) – ‘map nodesmap nodes’
Imagine genes as points in (M-dimensional) space
Initialization: map nodes are randomly placed in the data space
![Page 25: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/25.jpg)
Genes – data points
Clusters – map nodes
![Page 26: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/26.jpg)
SOM - SchemeSOM - Scheme
• Randomly choose a data point (gene).
• Find its closest map node
• Move this map node towards the data point
• Move the neighbor map nodes towards this point, but to lesser extent (thinner arrows show weaker shift)
• Iterate over data points
![Page 27: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/27.jpg)
• Each successive gene profile (black dot) has less of an influence on the displacement of the nodes.
• Iterate through all profiles several times (10-100)
• When positions of the cluster nodes have stabilized, assign each gene to its closest map node (cluster)
![Page 28: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/28.jpg)
![Page 29: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/29.jpg)
Hierarchical ClusteringHierarchical Clustering Goal#1: Organize the genes in a
structure of a hierarchical tree 1) Initial step: each gene is
regarded as a cluster with one item 2) Find the 2 most similar clusters
and merge them into a common node (red dot)
3) Merge successive nodes until all genes are contained in a single cluster
Goal#2: Collapse branches to group genes into distinct clusters g1 g2 g3 g4 g5
{1,2}
{4,5}
{1,2,3}
{1,2,3,4,5}
![Page 30: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/30.jpg)
Mathematical evaluation of Mathematical evaluation of clustering solutionclustering solution
Merits of a ‘good’ clustering solution: Homogeneity:
Genes inside a cluster are highly similar to each other. Average similarity between a gene and the center
(average profile) of its cluster.
Separation: Genes from different clusters have low similarity to each
other. Weighted average similarity between centers of clusters.
These are conflicting features: increasing the number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation
![Page 31: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/31.jpg)
“True”CAST*
GeneCluster
K-means
CLICK
Homogeneity
Separa
tion
Performance on Yeast Cell Cycle Data
*Ben-Dor, Shamir, Yakhini
1999
698 genes, 72 conditions (Spellman et al. 1998). Each algorithm was run by its authors in a “blind” test.
![Page 32: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/32.jpg)
Overall strategy:
PCA-transformationClustering and evaluation of clusteringCheck for bio-significance
![Page 33: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/33.jpg)
Which genes to cluster? Which genes to cluster? Apply filtering prior to clustering – focus the
analysis on the ‘responding genes’ Applying controlled statistical tests to identify
‘responding genes’ usually ends up with too few genes that do not allow for a global characterization of the response.
Fold change: choose genes that changed by at least M-folds in at least L conditions
Variance: filter out genes that do not vary greatly among the conditions of the experiment.
Try various filtering schemes to find the setting that gives the best biological results
![Page 34: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/34.jpg)
Clustering – ToolsClustering – Tools
Cluster (Eisen) – hierarchical GeneCluster (Tamayo) – SOM TIGR MeV – K-Means, SOM,
hierarchical, QTC, CAST Expander – CLICK, SOM, K-means,
hierarchical Many others (e.g. GeneSpring)
![Page 35: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/35.jpg)
CSLA Workshop, Day2CSLA Workshop, Day2
Presentation created by Rani Elkon and posted at:
http://www.tau.ac.il/lifesci/bioinfo/teaching/2002-2003/DNA_microarray_winter_2003.html
![Page 36: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/36.jpg)
Ascribing Biological Meaning to Ascribing Biological Meaning to ClustersClusters
Identify over-represented functional categories in the clusters (i.e., cluster contains much more genes of specific biological process than expected by chance)
Requirements for systematic analysis: Controlled vocabulary for describing biological
processes (protein biosynthesis\translation, apoptosis\programmed cell death)
Standard assignment of genes into functional categories
![Page 37: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/37.jpg)
Gene Ontology (GO) projectGene Ontology (GO) project Purpose:
1) Define controlled terms (ontologies) for description of gene products from 3 aspects:
Biological process (DNA repair, mitosis) Molecular function (protein serine/threonine kinase activity,
transcription factor activity) Cellular component (nucleus, ribosome)
2) Establish a unified framework for organism-independent gene annotation
Characteristics:1) A gene can have multiple associations in each ontology2) GO terms are organized in hierarchical structures called
directed acyclic graphs (DAGs)- The most general classifications are at top levels
of the graph- More specialized classifications at lower levels
![Page 38: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/38.jpg)
Hierarchical classification scheme for proteins that function in M-phase of mitosis
Each gene can be a member of more than one GO classifications
![Page 39: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/39.jpg)
Online Databases that Online Databases that annotate genes by GOannotate genes by GO
Human Entrez http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene GOA http://www.ebi.ac.uk/GOA/
Mouse – Mouse Genome Informatics (MGI) http://www.informatics.jax.org/
Rat – Rat Genome Database http://rgd.mcw.edu/
Fly – FlyBase http://flybase.bio.indiana.edu/
Arabidopsis – TAIR http://www.arabidopsis.org/
Yeast – Sacchromaces Genome Database http://www.yeastgenome.org/
Affymetrix chips – Netaffx http://www.affymetrix.com
![Page 40: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/40.jpg)
Example: Cluster 3, 95 genesExample: Cluster 3, 95 genes
![Page 41: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/41.jpg)
Identifying enriched GO categories Identifying enriched GO categories in clustersin clusters
In the previous example: Total number of chip’s genes with annotation = 5000 Total number of chip’s genes associated with metabolism
GO category = 3,600 (72%) Number of annotated genes in cluster 3 = 73 Number of metabolic genes in cluster 3 = 50 (68%)
Statistical tests are essential to determine whether enrichment of a certain class of proteins is significant
![Page 42: Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d605503460f94a40d05/html5/thumbnails/42.jpg)
AcknowledgementsAcknowledgements
SOM Figures in this presentations were taken from presentation of Benedikt Brors