statistical analysis of microarray data
DESCRIPTION
Statistical Analysis of Microarray Data. By H. Bjørn Nielsen & Hanne Jarmer. Question Experimental Design. The DNA Array Analysis Pipeline. Array design Probe design. Sample Preparation Hybridization. Buy Chip/Array. Image analysis. Normalization. Expression Index Calculation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/1.jpg)
Statistical Analysis of
Microarray Data
ByH. Bjørn Nielsen & Hanne Jarmer
![Page 2: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/2.jpg)
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
![Page 3: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/3.jpg)
What's the question?
Typically we want to identify differentially expressed genes
Example: alcohol dehydrogenase is expressed at a higher level when alcohol is added to the media
without alcohol with alcohol
alcohol dehydrogenase
![Page 4: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/4.jpg)
He’s going to say it
However, the measurements contain stochastic noise
There is no way around it
Statistics
![Page 5: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/5.jpg)
Noisymeasurements p-value
statistics
You can choose to think of statistics as a black box
But, you still need to understand how to interpret the results
![Page 6: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/6.jpg)
P-valueThe chance of rejecting the null hypothesis by coincidence----------------------------For gene expression analysis we can say: the chance that a gene is categorized as differentially expressed by coincidence
The output of the statistics
![Page 7: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/7.jpg)
The statistics gives us a p-value for each gene
We can rank the genes according to the p-value
But, we can’t really trust the p-value in a strict statistical way!
![Page 8: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/8.jpg)
Why not!
For two reasons:
1. We are rarely fulfilling all the assumptions of the statistical test
2. We have to take multi-testing into account
![Page 9: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/9.jpg)
The t-test Assumptions
1. The observations in the two categories must be independent
2. The observations should be normally distributed
3. The sample size must be ‘large’(>30 replicates)
![Page 10: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/10.jpg)
Multi-testing?
In a typical microarray analysis we test thousands of genes
If we use a significance level of 0.05and we test 1000 genes. We expect 50 genes to be significant by chance
1000 x 0.05 = 50
![Page 11: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/11.jpg)
Correction for multiple testing
Benjamini-Hochberg:
P ≤ iN 0.01
Bonferroni:
P ≤ 0.01N
N = number of genesi = number of accepted genes
Confidence level of 99%
![Page 12: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/12.jpg)
But really, those methods are too strict
What we can trust is the ability of the statistical test to rank the genes according to their reliability
If we permute the samples we can get an estimate of the False Discovery Rate (FDR) in this set
The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff
![Page 13: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/13.jpg)
log2 fold change (M)
P-value
Volcano Plot
![Page 14: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/14.jpg)
What's inside the black box ‘statistics’
t-test or ANOVA
![Page 15: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/15.jpg)
The t-test
Calculate T
Lookup T in a table
![Page 16: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/16.jpg)
The t-test II
The t-test tests for difference in means ()
Intensityof gene x
Density wt wt mut mutant
![Page 17: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/17.jpg)
t
The t statistic is based on the sample mean and variance
The t-test III
![Page 18: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/18.jpg)
ANOVA
ANalysis Of Variance
Very similar to the t-test, but can test multiple categories
Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2
Advantage: it has more ‘power’ than the t-test
![Page 19: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/19.jpg)
ANOVA II
Intensity
Density
Variance between groups
Variance within groups
![Page 20: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/20.jpg)
Blocks and paired tests
Some undesired factors may influence the experiments, the effect of such can be greatly reduced if they are blocked out or if the experiment is paired.
Some possible blocks: - Dye - Patient - Technician - Batch - Day of experiment - Array
![Page 21: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/21.jpg)
Paired t-test
Hypothesis: There is no difference between the mean BLUE and RED
Block 1
Block 3Block 2
![Page 22: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/22.jpg)
An example: (2-way ANOVA with blocks)
Experimental Design: A187 CreA
Glucose Ethanol Glucose Ethanol
3x
Everything was done in batches to capture the systematic noise
![Page 23: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/23.jpg)
Example: Batch to batch variation
Within batch variation is lower than the between batch variation
![Page 24: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/24.jpg)
Example: Data analysis - Blocking
• We can capture the batch variation by blockingTwo-way ANOVA
Glucose
Ethanol
A187 CreA
Effect 1
Effect 2
Batch A B C
![Page 25: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/25.jpg)
Ex: Result of the 2-way ANOVA(3 p-values)
Two-way ANOVA
Ethanol
Glucose
A187 CreA
Media effect
Genotypeeffect
Batch A B C
A genotype p-valueA growth media p-valueAn interaction p-value
![Page 26: Statistical Analysis of Microarray Data](https://reader030.vdocument.in/reader030/viewer/2022013122/56815f6c550346895dce71df/html5/thumbnails/26.jpg)
Conclusion
• Array data contains stochastic noise– Therefore statistics is needed to conclude on
differential expression
• We can’t really trust the p-value• But the statistics can rank genes• The capacity/needs of downstream
processes can be used to set cutoff• FDR can be estimated• t-test is used for two category tests• ANOVA is used for multiple categories