practical issues in microarray data analysis mark reimers national cancer institute bethesda...
TRANSCRIPT
Practical Issues in Microarray Data Analysis
Mark Reimers
National Cancer Institute
Bethesda Maryland
Overview
Scales for analysis Systematic errors
Sample outliers & experimental consistency Useful graphics Implications for experimental design
Platform consistency Individual differences
Distribution of Signals
•Most genes are expressed at very low levels•Even after log-transform the distribution is skewed•NB: Signal to abundance ratio NOT the same
for different genes on the chip
Explanation of Distribution Shape
Left hand steep bell curve probably due to measurement noise
Underlying real distribution probably even steeper
+ =
abundances + noise = observed values
Variation Between Chips Technical variation: differences between
measures of transcript abundance in same samples Causes:
Sample preparation Slide Hybridization Measurement
Individual variation: variation between samples or individuals Healthy individuals really do have consistently
different levels of gene expression!
Replicates in True Scale
Signals vary more between replicates at high end Level of ‘noise’ increases with signal
mean signal
chip 1
chip 2
Comparison of chips (Affy) Std Dev as a function of signal
across all chips
SD
Red line is lowess fit
Replicates on Log Scale
Measures fold-change identically across genes Noise at lower end is higher in log transform
SD vs signal
after log transform
chip 1 vs chip 2
after log transform
Ratio-Intensity (R-I) plots
Log scale makes it convenient to represent fold-changes up or down symmetrically
R = log(Red/Green); I = (1/2)log(Red*Green) aka. MA (minus, add) plots
(log)
Ratio
(log) Intensity
Simple power transforms (Box-Cox) often nearly stabilize variance
Durbin and Huber derived variance-stabilizing transform from a theoretical model: y = (background) + m e (mult. error) + static error) m is true signal; and have N(0,) distribution Transform:
Could estimate (background) and empirically
In practice often best effect on variance comes from parameters different from empirical estimates Huber’s harder to estimate
Variance Stabilization
222)(log)( yyyg
Box-Cox Transforms
•Simple power transformations (including log as extreme case), eg cube root•Often work almost as well as variance-stabilizing transform
Should you use Transforms?
Transforms change the list of genes that are differentially regulated
The common argument is that bright genes have higher variability However you aren’t comparing different genes
Log transform expands the variability of repressed genes
Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers)
Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology)
Graphical methods
Aims: Exploratory analysis, to see natural groupings, and to
detect outliers To identify combinations of features that usefully
characterize samples or genes Not really suitable for quantitative measures of
confidence Principal Components Analysis (PCA)
Standard procedure of finding combinations with greatest variance
Multi-dimensional scaling (MDS) Represent distances between samples as a two- or
three-dimensional distance Easy to visualize
Different Metrics – Same Scale
8 tumor; 2 normal tissue samples
Distances are similar in each tree Normals close
Tree topologies appear different
Take with a grain of salt!
Volcano Plot
Displays both biological importance and statistical significance
log2(fold change)
log2(p-value) or t-score
Quantile Plot
Plot sample t-scores against t-scores under random hypothesis
Statistically significant genes stand out
Corresponding quantiles of t-distribution
Sam
ple
t-sc
ores
Systematic Variation
Intensity-dependent dye bias due to ‘quenching’
Stringency (specificity) of hybridization due to ionic strength of hyb solution
How far hybridization reaction progresses due to variation in mixing efficiency
Spatial variation in all of the above
Relevance for Experimental Designs
Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill
Assuming error is random!
Sample 2
Sample 1
Sample 5
Sample 4 Sample 3
In practice very hard to deal with systematic errors in a symmetric design No two slides with comparable
fold-changes
Critique of Optimal Designs
Optimal for reduction of variance, if All chips are good quality No systematic errors – only random noise In fact systematic error is almost as great as
random noise in many microarray experiments With loop designs single chip failures cause
more loss of information than with reference designs
Individual Variation
Numerous genes show high levels of inter-individual variation
Level of variation depends on tissue also Donors, or experimental animals may be
infected, or under social stress Tissues are hypoxic or ischemic for variable
times before freezing