statistical analysis of microarray data by h. bjørn nielsen
TRANSCRIPT
Statistical Analysis of
Microarray Data
ByH. Bjørn Nielsen
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
What's the question?
Typically we want to identify differentially expressed genes
without alcohol with alcohol
alcohol dehydrogenase
Example: alcohol dehydrogenase is expressed at a higher level when alcohol is added to the media
He’s going to say it
However, the measurements contain stochastic noise
There is no way around it
Statistics
Noisymeasurements p-value
statistics
You can choose to think of statistics as a black box
But, you still need to understand how to interpret the results
P-valueThe chance of rejecting the null hypothesis by coincidence----------------------------For gene expression analysis we can say: the chance that a gene is categorized as differentially expressed by coincidence
The output of the statistics
The statistics gives us a p-value for each gene
We can rank the genes according to the p-value
But, we can’t really trust the p-value in a strict statistical way!
Why not!
For two reasons:
1. We are rarely fulfilling all the assumptions of the statistical test
2. We have to take multi-testing into account
The t-test Assumptions
1. The observations in the two categories must be independent
2. The observations should be normally distributed
3. The sample size must be ‘large’(>30 replicates)
Multi-testing?
In a typical microarray analysis we test thousands of genes
If we use a significance level of 0.05and we test 1000 genes. We expect 50 genes to be significant by chance
1000 x 0.05 = 50
log2 fold change (M)
P-v
alu
e
Volcano Plot
What's inside the black box ‘statistics’
t-test or ANOVA
The t-test
Calculate T
Lookup T in a table
The t-test II
The t-test tests for difference in means ()
Intensityof gene x
Density wt wt mut mutant
t
The t statistic is based on the sample mean and variance
The t-test III
Conclusion
• Array data contains stochastic noise– Therefore statistics is needed to conclude on
differential expression
• We can’t really trust the p-value• But the statistics can rank genes• The capacity/needs of downstream
processes can be used to set cutoff• FDR can be estimated• t-test is used for two category tests• ANOVA is used for multiple categories