practical issues in microarray data analysis mark reimers national cancer institute bethesda...

Practical Issues in Microarray Data Analysis

Mark Reimers

National Cancer Institute

Bethesda Maryland

Overview

Scales for analysis Systematic errors

Sample outliers & experimental consistency Useful graphics Implications for experimental design

Platform consistency Individual differences

Distribution of Signals

•Most genes are expressed at very low levels•Even after log-transform the distribution is skewed•NB: Signal to abundance ratio NOT the same

for different genes on the chip

Explanation of Distribution Shape

Left hand steep bell curve probably due to measurement noise

Underlying real distribution probably even steeper

+ =

abundances + noise = observed values

Variation Between Chips Technical variation: differences between

measures of transcript abundance in same samples Causes:

Sample preparation Slide Hybridization Measurement

Individual variation: variation between samples or individuals Healthy individuals really do have consistently

different levels of gene expression!

Replicates in True Scale

Signals vary more between replicates at high end Level of ‘noise’ increases with signal

mean signal

chip 1

chip 2

Comparison of chips (Affy) Std Dev as a function of signal

across all chips

SD

Red line is lowess fit

Replicates on Log Scale

Measures fold-change identically across genes Noise at lower end is higher in log transform

SD vs signal

after log transform

chip 1 vs chip 2

after log transform

Ratio-Intensity (R-I) plots

Log scale makes it convenient to represent fold-changes up or down symmetrically

R = log(Red/Green); I = (1/2)log(Red*Green) aka. MA (minus, add) plots

(log)

Ratio

(log) Intensity

Simple power transforms (Box-Cox) often nearly stabilize variance

Durbin and Huber derived variance-stabilizing transform from a theoretical model: y = (background) + m e (mult. error) + static error) m is true signal; and have N(0,) distribution Transform:

Could estimate (background) and empirically

In practice often best effect on variance comes from parameters different from empirical estimates Huber’s harder to estimate

Variance Stabilization

222)(log)( yyyg

Box-Cox Transforms

•Simple power transformations (including log as extreme case), eg cube root•Often work almost as well as variance-stabilizing transform

Should you use Transforms?

Transforms change the list of genes that are differentially regulated

The common argument is that bright genes have higher variability However you aren’t comparing different genes

Log transform expands the variability of repressed genes

Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers)

Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology)

Graphical methods

Aims: Exploratory analysis, to see natural groupings, and to

detect outliers To identify combinations of features that usefully

characterize samples or genes Not really suitable for quantitative measures of

confidence Principal Components Analysis (PCA)

Standard procedure of finding combinations with greatest variance

Multi-dimensional scaling (MDS) Represent distances between samples as a two- or

three-dimensional distance Easy to visualize

MDS Plots

Representing Groups

Cluster diagram Multi-dimensional scaling

Day 1 Chips

Different Metrics – Same Scale

8 tumor; 2 normal tissue samples

Distances are similar in each tree Normals close

Tree topologies appear different

Take with a grain of salt!

Volcano Plot

Displays both biological importance and statistical significance

log2(fold change)

log2(p-value) or t-score

Quantile Plot

Plot sample t-scores against t-scores under random hypothesis

Statistically significant genes stand out

Corresponding quantiles of t-distribution

Sam

ple

t-sc

ores

Systematic Variation

Intensity-dependent dye bias due to ‘quenching’

Stringency (specificity) of hybridization due to ionic strength of hyb solution

How far hybridization reaction progresses due to variation in mixing efficiency

Spatial variation in all of the above

Relevance for Experimental Designs

Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill

Assuming error is random!

Sample 2

Sample 1

Sample 5

Sample 4 Sample 3

In practice very hard to deal with systematic errors in a symmetric design No two slides with comparable

fold-changes

Critique of Optimal Designs

Optimal for reduction of variance, if All chips are good quality No systematic errors – only random noise In fact systematic error is almost as great as

random noise in many microarray experiments With loop designs single chip failures cause

more loss of information than with reference designs

Individual Variation

Numerous genes show high levels of inter-individual variation

Level of variation depends on tissue also Donors, or experimental animals may be

infected, or under social stress Tissues are hypoxic or ischemic for variable

times before freezing

Frequent False Positives

Immuno-globulins, and stress response proteins often 5-10X higher than typical in one or two samples

Permutation p-values will be insignificant, even if t-score appears large

gene levels

frequency

Group 1Group 2

practical issues in microarray data analysis mark reimers national cancer institute bethesda...

Documents

chip slide

different genes log

log redgreen

neurobiology slide

variance stabilization

distribution transform

genes noise

log scale measures fold