microarray data preprocessing and clustering analysis liangjiang (lj) wang [email protected] ksu...
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/1.jpg)
Microarray Data Preprocessing and
Clustering Analysis
Liangjiang (LJ) Wang
KSU Bioinformatics Center, Biology Division
June, 2005
Spotted Microarray Workshop
![Page 2: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/2.jpg)
Outline
• Overview of microarray data analysis.
• Microarray data preprocessing.
• Statistical inference of significant genes.
• Clustering analysis and visualization.
• Microarray databases and standards.
![Page 3: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/3.jpg)
Spotted Microarray
Extract mRNA
Reference Cells
Experimental Cells
Array image data
Make and label cDNA
Hybridize
Probes
Gen
es
Samples
Gene expression
matrix (ratios)
![Page 4: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/4.jpg)
Overview of Microarray Data Analysis
Microarray experiment
Image analysis and data normalization
Statistical inference of significant genes
Sample classification
Clustering analysis of co-expressed genes
List of significant or co-expressed genes
Promoter analysis, gene function prediction, and pathway analysis
![Page 5: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/5.jpg)
Microarray Image Analysis• Spot finding: place a grid to identify spot locations.
• Segmentation: separate each spot (foreground) from the background.
• Spot intensity extraction: often use mean or median intensity of all the pixels within a spot.
• Background subtraction: may subtract local background or globally estimated background.
![Page 6: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/6.jpg)
Microarray Data Normalization• To remove the systemic bias in the data so that
meaningful biological comparisons can be made:
– Unequal quantities of starting RNA.
– Differences in labeling (e.g., Cy3 versus Cy5).
– Different detection efficiencies between the dyes.
– Differences in hybridization and washing.
– Other experimental variations.
• Normalization is based on some assumptions:
– A subset of genes (housekeeping genes) is assumed to be constant.
– The total intensity or overall intensity distributions between the two channels are comparable.
![Page 7: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/7.jpg)
Global Normalization• Total intensity normalization:
– A normalization factor is calculated by summing the measured intensities in both channels and then taking the ratio:
– All the intensities in one channel are multiplied by the normalization factor:
• A subset of genes (housekeeping genes) may be also used for the global normalization.
N
i i
N
i i
G
R
1
1
iiii RRGG and
![Page 8: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/8.jpg)
Scatter Plot of Cy3 vs Cy5 Intensities
Intensities from “self-self” hybridization
Before normalization
After normalization
(Quackenbush, 2001)
![Page 9: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/9.jpg)
Lowess Normalization• Probably the most widely used approach for
spotted microarray normalization.
• A locally weighted linear repression is used to estimate the systematic bias in the data.
Ratio-Intensity (R-I) plot (also called MA plot)
After lowess
)*(log intensity, logMean 1021 GR
log
ratio
, lo
g 2(R
/ G
)0
(Quackenbush, 2001)
Raw data
)*(log intensity, logMean 1021 GR
log
ratio
, lo
g 2(R
/ G
)
0
(Quackenbush, 2001)
![Page 10: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/10.jpg)
Why Log Transformation?• Log 2 (R / G) treats up-regulated and down-
regulated genes in a similar fashion:
– If R / G = 4, log 2 (R / G) = 2.
– If R / G = 1/4 = 0.25, log 2 (1/4) = -2.
• Log normalizes distribution.
![Page 11: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/11.jpg)
Finding Significant Genes• Fold change: uses a single fold change threshold to
select genes; does not take into account the biological and experimental variability.
• Statistical tests: t test, SAM and ANOVA; require a number of replicates for each condition.
![Page 12: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/12.jpg)
Volcano Plot
Larger fold changes does not necessarily mean higher significance levels.
Sta
tist
ical
sig
nif
ican
ce →
hig
h
(Wolfinger et al., 2001)
![Page 13: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/13.jpg)
Student’s t Test• To test whether there is a significant difference in
gene expression measurements between two conditions (A and B):
– H0: no difference in gene expression,
– H1: the gene is differentially expressed,
• Test statistic:
• Calculate the probability (p value) of the t statistic with degree of freedom, df = nA + nB - 2.
• Assume a 95% confidence level (i.e., 5% false positive rate). If p ≤ 0.05, reject the null hypothesis.
BA XX
BA XX
B
B
A
A
BA
d
BA
nn
XXXXt
22
![Page 14: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/14.jpg)
Problem of Multiple Testing
Suppose that you have 5,000 genes on your microarray, and you select the genes with p ≤ 0.05 (i.e., 5% false positive rate). Because you have applied 5,000 times of the t test, you may have 5,000 x 0.05 = 250 false positives!
![Page 15: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/15.jpg)
Correction for Multiple Testing• Bonferroni correction:
– Set the significance cutoff, p' = α / N, where α is the false positive rate, and N is the number of genes.
– For example, if you have 5,000 genes in your microarray, and you expect 5% of false positives, the significance cutoff, p' = 0.05 / 5000 = 1.0 E -5.
• False Discovery Rate (FDR):– Rank all the genes by significance (p value) so that
the top gene has the most significant p value.– Start from the top of the list, and accept the genes if
qN
ip
i: the rank of the gene in the list.N: the number of genes in the array.q: the desired FDR.
![Page 16: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/16.jpg)
SAM: Significance Analysis of Microarrays• SAM (http://www-stat.stanford.edu/~tibs/SAM/) is a modified t test.
• The observed d statistic is computed from the data, and the expected d statistic is assessed by permutation.
• With a user-defined FDR, SAM derives the significance cutoffs for selecting up- and down-regulated genes.
Down-regulated
Up-regulated
Expected d statistic
Ob
serv
ed d
sta
tist
ic
SAM Plot
Observed d = expected d
Significance cutoffs
![Page 17: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/17.jpg)
ANOVA• ANalysis Of VAriance (ANOVA) is used to find
significant genes in more than two conditions:
• For each gene, compute the F statistic.
• Calculate the p value for the F statistic.
• Adjust the significance cutoff for multiple testing.
Gene
Disease A Disease B Disease C
A1 A2 A3 B1 B2 B3 C1 C2 C3
g1 0.9 1.1 1.4 1.9 2.1 2.5 3.1 2.9 2.6
g2 4.2 3.9 3.5 5.1 4.6 4.3 1.8 2.4 1.5
g3 0.7 1.2 0.9 1.1 0.9 0.6 1.2 0.8 1.4
g4 2.0 1.2 1.7 4.0 3.2 2.8 6.3 5.7 5.1
∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙
![Page 18: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/18.jpg)
Clustering Analysis• Clustering analysis is to partition a dataset into a
few groups (clusters) such that:– Homogeneity: objects in the same
cluster are similar to each other.
– Separation: dissimilar objects are placed in different clusters.
• In microarray data analysis, thismeans to find groups of genes (or samples) with similar gene expression patterns.
• Two key questions:
– How to measure similarity of gene expression?– How to find these gene clusters?
![Page 19: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/19.jpg)
Distance Metrics• Expression vector: each gene can be represented as
a vector in the N-dimensionalhyperspace, where N is the number of samples.
• Euclidean distance:
• Vector angle:
• Pearson correlation coefficient:
Sample 1
Sam
ple
2 A a2
a1
B b2
b1
d
α
N
i ii bad1
2)(
N
i i
N
i i
N
i ii
ba
ba
1
2
1
2
1cos
1]. 1,[ ,)()(
))((
1
2
1
2
1
N
i i
N
i i
N
i ii
bbaa
bbaa
![Page 20: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/20.jpg)
Z Transformation• If Euclidean distance is used for clustering
analysis, z transformation of the gene expression matrix may be necessary.
• For each gene, calculate the z scores of the expression values:
x
ix
xxz
i
Lo
g (r
atio
)
Samples
— Gene A— Gene B
dAB = 3.58
Z s
core
Samples
— Gene A— Gene B
dAB = 0.36
![Page 21: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/21.jpg)
Hierarchical Clustering
a b
d e
c d e
a b c d eb
d
c
e
a
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative approach
Initialization: each object is a cluster
Iteration
Merge two clusters which are most similar to each other
Until all objects are merged into a single cluster
![Page 22: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/22.jpg)
Hierarchical Clustering (Cont’d)
• Calculating distances between clusters:
– Single linkage: takes the shortest distance between two clusters.
– Complete linkage: uses the largest distance between two clusters.
– Average linkage: uses the average distance between two clusters.
• The clustering results are visualized using a tree (called dendrogram) with color-coded gene expression levels.
• Hierarchical clustering can be applied to genes, samples, or both.
SL
AL
CL
![Page 23: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/23.jpg)
Sample Clustering
Alizadeh, et al., 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:503-511.
![Page 24: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/24.jpg)
k-Means ClusteringInitialization
User-defined k (# clusters)
Randomly place k vectors (called centroids) in the data space
Iteration
Each object is assigned to its closest centroid
Re-compute each centroid by taking the mean of data vectors currently assigned to the cluster
Until the cluster centroids no longer change
Iteration
0:
1:
2:
3:
k = 2
![Page 25: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/25.jpg)
Self-Organizing Map (SOM)• The user defines an initial geometry of nodes (reference
vectors) for the partitions such as a 3 x 2 rectangular grid.
• During the iterative “training” process, the nodes migrate to fit the gene expression data.
• The genes are mapped to the most similar reference vector.
![Page 26: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/26.jpg)
Clustering analysis of a yeast cell cycle time-series dataset
k-means SOM
237 genes 194 genes
![Page 27: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/27.jpg)
Tools for Microarray Data Analysis• GenePix (http://www.axon.com/GN_GenePixSoftware.html):
commercial software for microarray image analysis.
• GeneSpring (http://www.silicongenetics.com/cgi/SiG.cgi/Products/GeneSpring/index.smf): commercial software for microarray data analysis.
• TIGR MeV (http://www.tm4.org/mev.html): free software for clustering, visualization, classification and statistical analysis of microarray data.
• Bioconductor (http://www.bioconductor.org/): open source, free software for the analysis of genomic data. For microarray data analysis, most of the statistical methods are implemented in R.
![Page 28: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/28.jpg)
Microarray Databases
• Gene Expression Omnibus (GEO) at NCBI (http://www.ncbi.nlm.nih.gov/geo/): a public repository for high throughput gene expression data.
• ArrayExpress at EBI (http://www.ebi.ac.uk/arrayexpress/): a public repository for microarray gene expression data; MIAME compliant.
• Stanford Microarray Database (SMD at http://genome-www5.stanford.edu/): stores raw and normalized microarray data; provides data retrieval and online data processing.
![Page 29: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/29.jpg)
The MIAME Standard• MIAME (Minimum Information About a Microarray
Experiment) is a microarray data standard proposed by the Microarray Gene Expression Database group (MGED, http://www.mged.org/).
• MIAME (http://www.mged.org/Workgroups/MIAME/) is needed to interpret the results from a microarray experiment and potentially to reproduce the microarray experiment.
• MIAME checklist helps authors, reviewers and editors of scientific journals to meet the MIAME requirements and to make microarray data available to the community in a useful way.
![Page 30: Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005 Spotted](https://reader036.vdocument.in/reader036/viewer/2022062320/56649d3f5503460f94a19094/html5/thumbnails/30.jpg)
Summary• Image analysis and data normalization are
important preprocessing steps for microarray data analysis.
• Statistical methods are available for selecting significantly up- or down-regulated genes.
• Clustering analysis is widely used to explore and visualize microarray data.
• The resulting significant or co-expressed genes can be further investigated using Gene Ontology annotation and promoter analysis.