computational genomics 10810 guy zinmanepxing/class/10810/lecture/recitation7.pdf · cdnaand...
TRANSCRIPT
![Page 1: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/1.jpg)
Computational Genomics
Microarray normalization
Computational Genomics
10810
Guy Zinman
![Page 2: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/2.jpg)
Different types of microarrays
cDNA spotted arrays
Method of generating
# of samples hybridized
Single-channel Dual-channel
cDNA
Library cDNA spotted arrays
(Stanford)
Affymetrix (25-mer)
NimbleGen (50-75-mer)
Agilent (60-mer)
Illumina (50-mer)
NimbleGen
Agilent
Illumina
Method of generating
probes cDNA
Library
Synthesized
![Page 3: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/3.jpg)
cDNA microarraysKnown DNA sequences
Isolate mRNA
Cells of Interest
Glass slide
Reference sample
Flourescently
labeled
![Page 4: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/4.jpg)
cDNA microarrays� cDNA microarray experiments typically involve hybridisingtwo mRNA samples, each of which has been converted into cDNA and labelled with its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and a green-fluorescent dye, Cyanine 3 (Cy3)), on a single glass slide that has been spotted with (several thousands of) cDNA probes.spotted with (several thousands of) cDNA probes.
� Because of competitive binding between the two samples, the ratio of the red and green fluorescence intensities for each spot is indicative of the relative abundance of the corresponding DNA probe in the two samples. Thus data from cDNA microarrays only provide information on the relative expression of the genes.
![Page 5: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/5.jpg)
ScanningHybridized Microarray
Laser 2Laser 1Excitation
Monochrome pictures combined
Emission
(two-color arrays)
![Page 6: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/6.jpg)
Image SegmentationScanned Image
Numerical Data
Segmentation Software
![Page 7: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/7.jpg)
Oligonucleotide (synthesized) arrays
� The probes are short sequences designed to match parts of the sequence of known or predicted open reading frames.
� Oligonucleotide arrays are produced by printing short oligonucleotide sequences designed to represent a single gene or family of gene splice-variants by synthesizing this gene or family of gene splice-variants by synthesizing this sequence directly onto the array surface.
![Page 8: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/8.jpg)
Oligonucleotide (synthesized) arrays
� Instead of one or several cDNA clones per gene, an oligo-microarray contains two times a set of probe pairs for each gene.
� One set consists of ‘perfect match’ (PM) oligonucleotidesthat are designed on non-conserved regions in a gene. that are designed on non-conserved regions in a gene.
� The other sets contains ‘mismatch oligonucleotides’ (MM) that are identical to the set of ‘perfect match’ oligonucleotides except for the nucleotide in the middle of the sequence.
![Page 9: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/9.jpg)
Oligonucleotide (synthesized) arrays
� In the presence of a specific RNA in the hybridization solution, the perfect match probes will hybridize more strongly on average than their mismatch partners.
� This assumption is used to determine the presence or absence of this specific RNA. this specific RNA.
� For each probeset, the value that is usually taken as representative for the expression level of the corresponding gene (the quantitative RNA abundance), is the average difference between the set of perfect match probes minus the set of mismatch probes.
� In addition, the mismatched oligonucleotides are used to calculate cross-hybridisation and local background signals.
![Page 10: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/10.jpg)
Other applications of microarray technology(besides measuring gene expression)
� DNA copy number analysis� SNP analysis
� chIP-chip (interaction data)� chIP-chip (interaction data)� Competitive growth assays� …
![Page 11: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/11.jpg)
Sample Preparation
Hybridization
Array design
Probe design
Question
Experimental Design
Buy Chip/Array
Expression Index
Normalization
Image analysis
The DNA Array Analysis Pipeline
Statistical Analysis
Fit to Model (time series)
Expression Index
Calculation
Advanced Data AnalysisClustering PCA Classification Promoter Analysis
Meta analysis Survival analysis Regulatory Network
Comparable
Gene Expression Data
![Page 12: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/12.jpg)
Technique Choices Aim (Real) Aim (Ideal)
Transformation
“To Near Normality”
Log2
Lin-Log
Numerically tractable
Gaussian
Data Normalisation
Normalisation
“Location”
Location Parameter:
1. Mean
2. Median
3. Regression(s) (LOWESS)
Account for systematic effects
Gaussian
Standardisation
“Scale”
Scale Parameter Stabilise variance
Gaussian
![Page 13: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/13.jpg)
Data Normalisation
Transformation …to near normality
Raw Data…exponential-like
Log2 Transformed…normal-like
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
![Page 14: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/14.jpg)
Ratio Histogram
3000
3500
4000
4500
5000
Fre
qu
ency
The Starting Point: The Ratio (2-color arrays)
0
500
1000
1500
2000
2500
3000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Ratio
Fre
qu
ency
![Page 15: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/15.jpg)
Log(ratio) Histogram
2000
2500
3000
Log ratios treat up- and down-regulated
genes equally
0
500
1000
1500
2000
-2
-1.8
-1.6
-1.4
-1.2 -1
-0.8
-0.6
-0.4
-0.2 0
0.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
Log(ratio)
Fre
qu
ency
log2(1) = 0log2(2) = 1log2(1/2) = -1(two-color arrays)
![Page 16: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/16.jpg)
�Gene-specific variation
�Spotting (size and shape)
�Cross-hybridization
�Dye
�Global variation
RNA quality
Sample preparation
Dye
Two kinds of variation in the signal
Dye
�Biological variation� Effect
� Noise
Dye
Hybridization
Photodetection
Systematic Stochastic
![Page 17: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/17.jpg)
Sources of Non-Biological Variation� Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation
� Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment (here channel is used to refer to a particular slide/dye combination.)
17
particular slide/dye combination.)
� Variation across replicate slides
� Variation across hybridization conditions
� Variation in scanning conditions
� Variation among technicians doing the lab work
� etc.
![Page 18: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/18.jpg)
� Regional variations
� Up (red) and down (green) regulated genes should be randomly distributed across the slide (but they’re not)
Spatial variations
across the slide (but they’re not)
Green corner!
![Page 19: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/19.jpg)
�Gene-specific variation:
• Too random to be explicitly accounted for
�Global variation:
• Similar effect on many measurements
Sources of variation
Systematic Stochastic
accounted for
• “noise”
measurements
• Corrections can be estimated from data
Normalization Statistical testing
![Page 20: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/20.jpg)
What is Normalization?
� Calibration = Normalization = Scaling
� Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that
20
biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.
� Typically normalization attempts to remove global effects, i.e., effects that can be seen by examining plots that show all the data for a slide or slides.
![Page 21: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/21.jpg)
Normalization - two problems
I. How do we detect biases? Which genes should we use for estimating biases among chips/channels?among chips/channels?
II. How do we remove the biases?
![Page 22: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/22.jpg)
Which Genes to use for bias detection?
1. All genes on the chip� Assumption: Most of the genes are equally expressed in the
compared samples, the proportion of the differential genes is low (<20%).
� Limits: Limits: � Not appropriate when comparing highly heterogeneous samples
(different tissues)
� Not appropriate for analysis of ‘dedicated chips’ (apoptosis chips, inflammation chips etc)
![Page 23: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/23.jpg)
Which Genes to use for bias detection?
2. Housekeeping genes• Assumption: based on prior knowledge a set of genes can be
regarded as equally expressed in the compared samples• In other words one or more genes assumed expressed at
constant rate.constant rate.• Affy novel chips: ‘normalization set’ of 100 genes • NHGRI’s cDNA microarrays: 70 "house-keeping" genes set • Limits:
� The validity of the assumption is questionable� Housekeeping genes are usually expressed at high levels, not informative
for the low intensities range
![Page 24: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/24.jpg)
Which Genes to use for bias detection? 3. Spiked-in controls from other organism, over a range
of concentrations • Limits:
� low number of controls- less robust� Can’t detect biases due to differences in RNA extraction protocols
4. “Invariant set”4. “Invariant set”• Trying to identify genes that are expressed at similar levels
in the compared samples without relying on any prior knowledge:� Rank the genes in each chip according to their expression level� Find genes with small change in ranks
![Page 25: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/25.jpg)
Normalization methods
![Page 26: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/26.jpg)
1. Global normalization (Scaling)� A single normalization factor (k) is computed for balancing
chips\channels:
Xinorm = k*Xi or
log2 R/G � log2 R/G – c (2-color)
� Multiplying intensities by this factor equalizes the mean � Multiplying intensities by this factor equalizes the mean (median) intensity among compared chips
� Assumption: Total RNA (mass) used is same for both samples.
� So, averaged across thousands of genes, total hybridization should be the same for both samples.
![Page 27: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/27.jpg)
Global Normalization (2-color)Before After
![Page 28: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/28.jpg)
Linear normalization – doesn’t work well
![Page 29: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/29.jpg)
2. Quantile normalization� A quantile has a fixed number of genes below it in intensity. The first quantile could have 1% of the genes below in intensity, and so on.
� Assumption: equalizing the entire distribution
� Sort intensities in each chipSort intensities in each chip
� Compute mean intensity in each rank across the chips
� Replace each intensity by the mean intensity at its rank
Chip #3 Average chip
Chip #1 Chip #2
![Page 30: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/30.jpg)
Side-by-side boxplots show examples of variation across channels.
30
![Page 31: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/31.jpg)
Slide 2Cy3 Cy5Slide 1
Cy3 Cy5
median
Q3=75th percentile
Q1=25th percentile
maximum
31
Q1=25 percentile
minimum
![Page 32: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/32.jpg)
Log Mean Signal Centered at 0
32
Log Mean Signal Centered at 0
![Page 33: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/33.jpg)
![Page 34: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/34.jpg)
3. Intensity-dependent normalization
� Compensate for intensity-dependent biases
M vs A plots (also called R-I plot)� M vs A plots (also called R-I plot)
� X axis: A – average intensity
A = 0.5*log(Cy3*Cy5)
� Y axis: M – log ratio
M = log(Cy3/Cy5)
![Page 35: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/35.jpg)
Intensity-dependent bias
M = log(Cy3/Cy5)
M>0: Cy3>Cy5
ALow intensities
M<0: Cy3<Cy5
High intensities
* Global normalization cannot remove intensity-dependent biases
![Page 36: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/36.jpg)
We expect the M vs A plot to look like:
M = log(Cy3/Cy5)
A
log(Cy3/Cy5)
![Page 37: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/37.jpg)
The Qspline method (Workman et al. 2002)
�From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black)
�A QQ-plot is made and a normalization curve is constructed by fitting a cubic splinefunction
�As reference one can use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation.
![Page 38: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/38.jpg)
The Qspline method (Workman et al. 2002)
� Assumes overall intensity between two arrays should be comparable (Majority of genes assumed constant). This means that the quantiles of the distributions, plotted in a quantile-quantile plot , should lie on the diagonal. If they do not, they form a normalization curve that is signal do not, they form a normalization curve that is signal dependent.
� Even if the assumption does not hold, and less than a majority of genes are constant, the normalization still works provided that the number of up-regulated and down-regulated genes are roughly equal.
![Page 39: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/39.jpg)
dChip (Li & Wong)� Assumption: subset of genes assumed constant.� The constant control genes are not known a-priori but are instead identified as genes whose intensity rank does not differ more than a threshold value between two arrays.
� This invariant set is defined iteratively and used to draw a normalization curve that is signal dependent.
� In practice it works well.
(Li and Wong, 2001)
![Page 40: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/40.jpg)
RMA
�Robust Multi-array Average (RMA) expression measure (Irizarry et al., Biostatistics, 2003) (ignores MM probes)
�Assumes all chips have same background,
distribution of values.distribution of values.
�Fit this additive model by iteratively re-
weighted least-squares or median polish.
![Page 41: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/41.jpg)
RMA
![Page 42: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/42.jpg)
RMA Model
From a presentation by Ben Bolstadhttp://bmbolstad.com/talks/Bolstad-%20AffymetrixTalkAug072006.pdf
![Page 43: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/43.jpg)
RMA: Fitting Model
� Median polish (figure)
� Robust � Robust regression
From a presentation by Ben Bolstadhttp://bmbolstad.com/talks/Bolstad-%20AffymetrixTalkAug072006.pdf
![Page 44: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/44.jpg)
RMA: One Drawback
From a presentation by Ben Bolstadhttp://bmbolstad.com/talks/Bolstad-%20AffymetrixTalkAug072006.pdf
![Page 45: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/45.jpg)
MAS5�(now GCOS/GDAS) by Affymetrix
�Unlike RMA compares PM and MM probes
�MM* is an adjusted MM that is never bigger than PM
�Signal = TukeyBiweight{log(PMj - MM*j)}
�Tukey biweight is a robust average procedure with
weights and outlier rejection
�Robust measure of center, like median or trimmed mean
![Page 46: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/46.jpg)
MAS5: p-value and calls� First calculate discriminant for each probe pair:
R=(PM-MM)/(PM+MM)
� Wilcoxon one sided ranked test used to compare R vs τand determine p-value
� Present/Marginal/Absent calls are thresholded� Present/Marginal/Absent calls are thresholded� Present: p-value ≤ α1� Marginal: α1 < p-value < α2� Absent: p-value ≥ α2
� Default: α1 = 0.04, α2 = 0.06, τ=0.015
![Page 47: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/47.jpg)
M*A Plot of MAS5
![Page 48: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/48.jpg)
MAS5: Summary� Good
� Usable with single chips (though replicated preferable)
� Gives a p-value for expression data
� Bad:
� Lots of fudge factors in the algorithm
� M*A plots can be better
� Misc
� Commonly used processing method for Affy chips
� Highly dependent on Mismatch probes
![Page 49: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/49.jpg)
Methods compared on expression variance
Standard deviation of gene measures
from 20 replicate arrays
� Std Dev of gene measures from 20 replicate arrays
RMA: Blue and Red
MAS5: Green
dChip: Black
Expression level
From Terry speed
![Page 50: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/50.jpg)
LOWESS (Locally Weighted Scatterplot Smoothing)• Local linear regression model
•Tri-cube weight function
• Least Squares
Estimated values of log2(Cy5/Cy3) as function of log10(Cy3*Cy5)
![Page 51: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/51.jpg)
Location Normalisation
LOWESS = Locally WEighted Regression andSmoothing Scatterplots
NormalisedIntensities
Source: G Rosa 2003.
![Page 52: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/52.jpg)
4. Logit normalization
(Lemon et al. 2003)
� Background is calculated as the minimum intensity measured on the chip minus 0.1% of the signal intensity range: min-0.001*(max-min). The result is Z-transformed to mean zero and
)()(logitysaturation
backgroundyy
−
−=
0.001*(max-min). The result is Z-transformed to mean zero and sd one.
� Performance is surprisingly good.
![Page 53: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/53.jpg)
5. Spike-in controls
� If none of the assumptions seems applicable to the experiment, add a spiked control to the mRNA preparation.
� The idea is to measure the amount of mRNA or total RNA extracted from the cell, and then add a known transcript of known concentration to the pool.
� This spiked transcript is assumed to be amplified and labeled the same � This spiked transcript is assumed to be amplified and labeled the same way as the other transcripts and detected with unique probe on the array.
� The spiked transcript must not match any gene in the RNA preparation. (For human preparation, use E.coli).
� After scanning you multiple all measurements on one array until the spiked control matches that on the other array.
![Page 54: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/54.jpg)
Spike-in controls� This approach has the limitation that it results in a linear normalization that does not correct signal-dependent bias (unless you use many spiked control genes with different concentrations).
� Finally, it is only as accurate as the accuracy of measuring the � Finally, it is only as accurate as the accuracy of measuring the total amount of RNA and accuracy of adding an exact amount of spiked transcript.
![Page 55: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/55.jpg)
Summary for normalization� Systematic biases exist in microarray data
� Normalization can remove (reduce) these biases� There are many different ways to normalize data
� Global median, dChip, MAS 5.0, RMA, LOWESS, etc
� Each normalization method has different assumption � Each normalization method has different assumption regarding the data.
� BUT: don’t expect it to fix bad data!� Won’t make up for lack of replicates� Won’t make up for horrible slides
![Page 56: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/56.jpg)
The missing values problem
� Microarrays can have systematic or random missing values due to experimental artifacts/problems
� Some algorithms aren’t robust to missing values
� How do we “guess” at these values?
![Page 57: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/57.jpg)
Why Estimate Missing Values?
Complete data set Data set with missing
values estimated by
KNNimpute algorithm
Data set with 30%
entries missing
(missing values appear
black)
![Page 58: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/58.jpg)
KNNimpute Algorithm
� Idea: use genes with similar expression profiles to estimate missing values
� Replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.the distances from the neighboring columns.
2 | 4 | 5 | 7 | 3 | 2
2 | | 5 | 7 | 3 | 1
3 | 5 | 6 | 7 | 3 | 2
Gene X
Gene B
Gene C
j
2 | 4 | 5 | 7 | 3 | 2
2 |4.3| 5 | 7 | 3 | 1
3 | 5 | 6 | 7 | 3 | 2
Gene X
Gene B
Gene C
j
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001 Jun;17(6):520-5.
![Page 59: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/59.jpg)
References�Li and Wong, (2001). Model-based analysis of oligonucleotide arrays: Model validation, design issues and standard error application.
�Genome Biology 2:1–11.
�Irizarry, Bolstad, Collin, Cope, Hobbs and Speed, (2003) Summaries of Affymetrix GeneChip probe level data.
�Nucleic Acids Research 31(4):e15.)
�Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA,
version 5 edition, 2001.
�Gautier, Cope, Bolstad, and Irizarry, (2003). affy - an r package for the analysis of affymetrix genechip data at the probe level. Bioinformatics
![Page 60: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/60.jpg)
2. Identification of2. Identification of2. Identification of2. Identification of
Differential GenesDifferential GenesDifferential GenesDifferential Genes
![Page 61: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/61.jpg)
Identification of differential genes
�The most basic experimental design: comparison between 2 conditions –‘treatment’ vs control
�The goal: to identify genes that are �The goal: to identify genes that are differentially expressed in the examined conditions
�Number of replicates is usually low (n=2-4)
![Page 62: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/62.jpg)
1. Fold Change� Consider genes whose mean expression level was change by at least 1.75-2 fold as differential genes
� Limits:� Usually no estimation of false positive rate is provided
� Biased to genes with low expression level� Biased to genes with low expression level
� Ignores the variability of gene levels over replicates.
![Page 63: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/63.jpg)
Fold Change limit – ignores variability
over replicates
control treatment
C1 C2 C3 mean_c t1 t2 t3 mean_t
g1 90 100 110 100 190 200 210 200
g2 50 100 150 100 100 150 350 200
� Seek for score that ‘punishes’ genes with high variability over replicates
g2 50 100 150 100 100 150 350 200
![Page 64: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/64.jpg)
2. T-test
� Compute a t-score for each gene
m , m – mean levels in Control and Treatmentmc, mt – mean levels in Control and Treatment
Sc2, St
2 – variance estimates in Control and Treatment
nc, nt – number of replicates in in Control and Treatment
control treatment
C1 C2 C3 mean_c t1 t2 t3 mean_t t
g1 90 100 110 100 190 200 210 200 12
g2 50 100 150 100 100 150 350 200 1.3
![Page 65: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/65.jpg)
T - test� t-scores can be associated with p-value (under the assumption that expression levels follow normal distribution)
� Log-transformation
� Set cut-off for p-value (α=0.01)
� Consider all genes with p-value < α as differential genes
C1 C2 C3 mean_c t1 t2 t3 mean_t t p-val
g1 90 100 110 100 190 200 210 200 12.2 0.0001
g2 50 100 150 100 100 150 350 200 1.3 0.14
![Page 66: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/66.jpg)
Multiple Testing
� P-valg associated with the t-score Tg is the probability for obtaining by random a t-score that is at least as extreme as Tg.
� Multiplicity problem: thousands of genes are tested simultaneously.simultaneously.
� e.g. suppose:� 10,000 genes on a chip� If your p-value is 0.05 than you expect 5% type I errors.� 10000x0.05 = 500500 genes to be DE just by chance.� This in not acceptable!
![Page 67: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/67.jpg)
Multiple testing
�Need to adjust for multiple testing when assessing the statistical significance of findings
�Corrections:
![Page 68: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/68.jpg)
Bonferroni� The p-value of each gene is multiplied by the number of genes in the gene list. If the corrected p-value is still below the error rate, the gene will be significant:
� Corrected P-value= p-value * n (number of genes in test) <0.05<0.05
![Page 69: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/69.jpg)
Holm’s correction Holm’s correction Holm’s correction Holm’s correction
((((BonferroniBonferroniBonferroniBonferroni StepStepStepStep----down)down)down)down)� This correction is very similar to the Bonferroni, but a little less stringent:
� 1) The p-value of each gene is ranked from the smallest to the largest.
� 2) The first p-value is multiplied by the number of genes present in the gene list:present in the gene list:if the end value is less than 0.05, the gene is significant:Corrected P-value= p-value * n < 0.05
� 3) The second p-value is multiplied by the number of genes less 1:
� Corrected P-value= p-value * n-1 < 0.05
![Page 70: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/70.jpg)
Holm’s correction
Example:Let n=1000, error rate=0.05
![Page 71: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/71.jpg)
Westfall and Young Permutation (Westfall and Young Permutation (Westfall and Young Permutation (Westfall and Young Permutation (minPminPminPminP) ) ) ) � The Westfall and Young permutation follows a step-down procedure similar to the Holm method, combined with a bootstrapping method to compute the p-value distribution:
� 1) P-values are calculated for each gene based on the original data set and ranked.
� 2) The permutation method creates a pseudo-data set by dividing the data into artificial treatment and control groups.the data into artificial treatment and control groups.
� 3) P-values for all genes are computed on the pseudo-data set.� 4) The successive minima of the new p-values are retained and compared to the original ones.
� 5) This process is repeated a large number of times, and the proportion of resampled data sets where the minimum pseudo-p-value is less than the original p-value is the adjusted p-value
![Page 72: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/72.jpg)
Benjamini, Hochberg (1995)� Controlling the FDR at a specified level:
� After ranking the genes according to their significance (P-values) and starting at the top of the list , you accept all genes where:
� P <= i/m*q � P <= i/m*q
� (where i is the number accepted so far, m is the total number of genes tested, and q is the desired FDR.
� For i > 1 this correction is less strict than Bonferroni.
![Page 73: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/73.jpg)
Benjamini Hochberg� Let n=1000, error rate=0.05
![Page 74: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/74.jpg)
Permutation tests� FDR can also be assessed by permutation.
� If you permute the measurements from the sample and control and repeat the t-test for all genes, you get an estimate for the number of type-I errors, that can be expected at a chosen cutoff in significance.chosen cutoff in significance.
� Dividing this number by the number of genes that pass the t-test on the unpermuted data, you get the FDR. (approach of SAM).
![Page 75: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/75.jpg)
Differential Genes - Tools
��CyberCyber--TT
��SAMSAM (Significance Analysis of Microarray) (Tusher et al. 2001).Microarray) (Tusher et al. 2001).
![Page 76: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/76.jpg)
Testing for Differential Expression:
Significance Analysis of Microarrays (SAM)
� Two-group comparison: t-test with shrinkage
� 1 2Y Yd
−=
+�
� S0� shrinks t � H0� Empirical Bayes like� Multiple testing� Smooth distribution of test statistic
1 20Y Y
ds s
−
=+
![Page 77: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/77.jpg)
SAM: Computation of s01. sα = α percentile of the si (s for the ith gene)
2. Compute the 100 quantiles of the si values �
q1 < q2 . . . < q1003. Let α be a value in (0, .05, .10 . . . 1.0)
1 2Y Ys
−
a) Compute median absolute deviation from the median, divided by .64 for all adjacent di’s� vj
b) Compute cv(α) = coefficient of variation of the vj
4. Choose the α with the smallest cv� s0 = sα
![Page 78: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/78.jpg)
SAM: Multiclass Comparison
1/ 2
2
1
1/ 2
( )K
k
i k ik i
kk
nd n x x
n =
= −
∑ ∑∏
1/ 2
2
1
1 1( )
( 1)k
K
i ij ik
k j Ck k
s x xn n = ∈
= −
− ∑ ∑∑
∑
� Shrunken square root of F from one-way ANOVA
![Page 79: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/79.jpg)
SAM: Testing for
Differential Expression
� Permute class labels with balance
� Rank all observed di’s
� Rank all d ’s for each Obs P1 P2 …
Perm.
Avg.� Rank all di’s for each permutation separately
� Average permuted d’s at given quantile
� Compare observed quantile dto average of permutations
2.23 1.85 1.92 … 1.94
2.11 1.78 1.90 … 1.93
1.91 1.74 1.83 … 1.76
… … … … …
![Page 80: Computational Genomics 10810 Guy Zinmanepxing/Class/10810/lecture/recitation7.pdf · cDNAand labelledwith its own fluorescent dye (usually a red fluorescent dye, Cyanine 5 (Cy5) and](https://reader034.vdocument.in/reader034/viewer/2022043004/5f894eae65ddbb4beb061c4a/html5/thumbnails/80.jpg)
SAM: Differential Expression & FDR� Choose Δ�
“significant” genes� Assymetric cutoff for + & - d’s
� Find smallest observed +d and least negative -d
� Each permutation � count number of +d’s ≥ critical obs number of +d’s ≥ critical obs +d and number of –d’s ≤ critical obs –d
� FDR = π0(Average # of False Discoveries)/(# of discoveries)� π0 = Prior prob. of not differentially expressed = 1 –proportion non-differentially expressed