analysis of microarray data - dtu bioinformatics · buy chip/array statistical analysis fit to...
TRANSCRIPT
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
What's the question?
Typically we want to identify differentially expressed genes
without alcohol with alcohol
alcohol dehydrogenase
Example:alcohol dehydrogenase is expressed at a higher levelwhen alcohol is added to the media
Intensities are not just mRNAconcentrations
Sample preparation
• RNA degradation• RNA purification• Reverse transcription• Amplification efficiency• Dye effect (cy3/cy5)
Array / hybridization
• Spotting• DNA-support binding• Spatial effects (washing etc.)• Image segmentation
Gene-specific variationGlobal variation
• Spotting efficiency,– Spot size– Spot shape
• Cross-/unspecifichybridization
• Biological variation– Effect– Noise
•Amount of RNA in the biopsy
•Efficiencies of:– RNA extraction– Reverse transcription– amplification– Labeling– Photodetection
Systematic
Two kinds of variation
Stochastic
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
Lowess Normalization
• One of the most commonly utilized normalizationtechniques is the LOcally Weighted ScatterplotSmoothing (LOWESS) algorithm.
M
A
* * * *** *
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
Expression index value• Some microarrays have multiple probes
addressing the expression of the same gene
– Affymetrix chips have 11-20 probe pairs pr. Gene
- Perfect Match (PM)- MisMatch (MM)
PM: CGATCAATTGCACTATGTCATTTCTMM: CGATCAATTGCAGTATGTCATTTCT
Why use gene expression index?
For most downstream analysis only onevalue pr. gene are used.
Therefore we collapse the intensities frommany probes into one value:a gene expression index value
Expression index calculation• Simplest method?
Median
But more sophisticated methods exists:dChip, RMA and MAS 5 (from Affymetrix)
StochasticSystematic
• Too random to be explicitly accounted for
• “noise”
• Similar effect on manymeasurements
• Corrections can beestimated from data
Normalization Statistical testing
Sources of variation
Global variation Gene-specific variation
Sample PreparationHybridization
Array designProbe design
QuestionExperimental Design
Buy Chip/Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
The DNA Array Analysis Pipeline
He’s going to say it
However, the measurements containstochastic noise
There is no way around it
Statistics
Noisymeasurements p-value
statistics
You can choose to think ofstatistics as a black box
But, you still need to understand how tointerpret the results
P-valueThe chance of rejecting the nullhypothesis by coincidence----------------------------For gene expression analysis wecan say:the chance that a gene iscategorized as differentiallyexpressed by coincidence
The output of the statistics
The statistics gives us a p-valuefor each gene
We can rank the genes according to the p-value
But, we can’t really trust the p-value in a strictstatistical way!
Why not!
For two reasons:
1. We are rarely fulfilling all theassumptions of the statistical test
2. We have to take multi-testing intoaccount
The t-test Assumptions
1. The observations in the two categoriesmust be independent
2. The observations should benormal distributed
3. The sample size must be ‘large’(>30 replicates)
The t-test tests for difference in means (µ)
Intensityof gene x
Density µwt wt µmut mutant
The t-test II
Conclusion
• Array data contains stochastic noise– Therefore statistics is needed to conclude on differential
expression• We can’t really trust the p-value• But the statistics can rank genes• The capacity/needs of downstream processes can be
used to set cutoff• FDR can be estimated• t-test is used for two category tests• ANOVA is used for multiple categories