microarray data analysis tutorial at ismb 2008 mark reimers virginia commonwealth university

158
Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Upload: claribel-tyler

Post on 27-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Microarray Data Analysis

Tutorial at ISMB 2008

Mark Reimers

Virginia Commonwealth University

Page 2: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Quality assessment

Normalizing expression arrays

Normalizing other array types

Selecting genes

Identifying functional groups

Outline

Page 3: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Array Quality Assessment

ISMB 2008 Microarray Tutorial

Mark Reimers

Virginia Commonwealth University

Page 4: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Approaches to Quality Control

• Careful Technique

• Practice!

• Testing the RNA quality

• Examining the chips and images

• Spot metrics

• Statistical approaches– Examining deviations as functions of technical

variables

Page 5: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Simple QA for Spotted Arrays

• Spot Measures– Signal/Noise

• Foreground / background or – foreground / SD

– Uniformity– Spot Area

• Global Measures– Qualitative assessments – Averages of spot measures

• Inspect images for artifacts– Streaks of dye, scratches etc.

• Are there biases in regions?

Page 6: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Using Statistics for QA• Significant artifacts may not be obvious

from visual inspection or bulk statistics• General approach: plot deviations from

average or residuals from fit against any technical variable:– Intensity– CG content or Tm– Probe position relative to 3’ end (for poly-T

primed)– Location on chip

• Color-code deviations on chip image

Page 7: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Simple Portrait - Boxplots

• Boxplot of 16 chips from Cheung et al Nature 2005

45

67

89

Page 8: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Another Portrait - Densities

4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Density Plots:before and after

log(Signal)

De

nsity

Chips

GSM25524.CELGSM25525.CELGSM25526.CELGSM25527.CELGSM25528.CELGSM25529.CELGSM25530.CELGSM25531.CEL

GSM25540.CELGSM25541.CELGSM25542.CELGSM25543.CELGSM25548.CELGSM25549.CELGSM25550.CELGSM25551.CEL

Page 9: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Ratio vs Intensity Plots: Saturation & Quenching

• Saturation– Decreasing rate of

binding of RNA at higher occupancies on probe

• Quenching:– Light emitted by one

dye molecule may be re-absorbed by a nearby dye molecule

– Then lost as heat– Effect proportional to

square of density

Plot of log ratio against average log intensity across chips

GSM25377 from the CEPH expression data GSE2552

Page 10: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

How Much Variability on R-I?

• Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005)

Page 11: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Covariation with Probe Tm

• MAQC project

• Agilent 44K– Array 1C3– Performed by

Agilent

•Plot of log ratios to average against Tm •Bimodal distribution because two samples are very different

Page 12: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Covariation with Probe Position

• RNA degrades from 5’ end

• Intensity should decrease from 3’ end uniformly across chips

• affyRNAdeg plots in affy package

Plot of average intensity for each probe position across all genes against probe position

Page 13: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Spatial Variation Across Chips

Red/Green ratios show variation-probably concentrated

Ratios of ratios on slide to ratios on standard show consistent biases

Page 14: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

In House Spotted Arrays

Ratio of ratios shows much clearer concentration of red spots on some slides

Note non-random but highly irregular concentration of red

Legend

Page 15: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Background Subtraction (1)

• We think that local background contributes to bias

• Does subtracting background remove bias?

Local off-spot background may not be the best estimate of spot background (non-specific hyb)

Spots BG subtracted

Page 16: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Background Subtraction (2)

Raw spot ratios show a mild bias relative to averageAfter subtracting a high green bg in the center a red bias results

Raw Ratios Background BG-subtracted

Page 17: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Other Bias Patterns

This spotted oligo array shows strong biases at the beginning and end of each print-tip group

The background shows a milder version of this effect

Subtracting background compensates for about half this effect

Processed Raw Spot Background

Page 18: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Local Bias on Affymetrix ChipsImage of raw data on a log2 scale shows striations but no obvious artifacts

Image of ratios of probes to standard shows a smudge

Non-coding probes

Images show high values as red, low values as yellow

Page 19: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Variation in Affy Chips

Page 20: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

QC in Bioconductor

• Robust Multi-chip Analysis (RMA) – fits a linear model to each probe set– High residuals show regional patterns

High residuals in green

Available in affyPLM package at www.bioconductor.org

Portion of dChip QA image High residuals in pinkwww.dchip.org

Page 21: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Affy QC Metrics in Bioconductor

• affyPLM package fits probe level model to Affymetrix raw data

• NUSE - Normalized Unscaled Standard Errors – normalized relative to

each gene

• How many big errors?

Page 22: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Conclusions

• Technical bias is a significant source of error in microarray studies

• Technical bias can be visualized and quantified

• Normalization can only partially compensate these problems

• It is best to drop chips with extreme technical deviations

Page 23: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Are Microarrays Reliable?• Microarray studies have an uneven track

record of replication• Huang et al (2003) replicated few of the markers

for breast cancer survival identified by van t’Veer et al (2002)

• Two Science papers in 2003 on stem cell gene expression describing parallel experiments identify only 2 genes in common out of hundreds

• The MAQC project papers in fall 2006 claimed to validate microarrays. Measures of consistency were highly variable across any two platforms

• Maybe both are true: – careful microarray studies are accurate

Page 24: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 25: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Quality assessment

Normalizing expression arrays

Normalizing other array types

Selecting genes

Identifying functional groups

Page 26: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Normalizing Expression Data

ISMB 2008 Tutorial

Page 27: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

QA and Normalization

• Technical differences cause changes in measures

• QA flags big technical differences in arrays– Often sporadic – e.g. scratches on chip

• Normalization attempts to compensate for modest but systematic differences– e.g. intensity-dependent bias (quenching)

• Both must be done together

Page 28: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Variations in Technique with Broad Consequences for Measures

• Temperature of hybridization

• Amount of RNA

• Degradation of RNA

• Yield of conversion to cDNA or cRNA

• Yield of labeling reaction

• Strength of ionic buffers

• Stringency of wash

Page 29: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Many Normalization Approaches

• One Parameter– Total or median brightness

• Two parameter– Variance stabilizing– Lowess for two-color arrays

• Non-parametric– Distribution (quantile) matching

• Regression on technical covariates

Page 30: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

• Can only measure relative levels of expression: per g RNA

• Assume: only difference between chips is due to different weights of RNA hybridized

• Set:

• For each chip, normalized values are:

• For 2-color cDNA use separate constant for each channel on each chip

• More consistent results if use a robust estimator, such as median or 1/3 – trimmed mean (Quackenbush): take mean of middle 2/3 of probes

N

ii

N

ii fCfC

1

greengreen

1

redred ;

One Parameter – Overall Mean

greengreen*

redred* /;/ CffCff iiii

Page 31: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

One-Parameter Limitations

4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Density Plots:before and after

log(Signal)

De

nsity

A centering transform can shift the density of log2 data but can’t get around the differences in shape of distributions

Page 32: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Intensity Dependent Bias

Ratio – Intensity (M-A) plot of raw data:

M = log2(R/G) ; A = (log2(R) + log2(G)) / 2

Page 33: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Global (lowess) Normalization

Same data set normalized by: Mnorm = M-c(A)

where c(A) is an intensity dependent functionestimated by local regression.

Page 34: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Print-tip Normalization

Separate lowess curves for each of 16 print-tips

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Print-tip layout

Box plots for log ratios in each of 16 print-tip groups

Page 35: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Scaled Print-tip Normalization

There still remain apparent technical artifacts

Try to fix them by scaling after print-tip lowess: Mp,norm = sp·(Mp-cp(A));

Box plot after print-tip normalization Box plot after scaled print-tip normalization

Page 36: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Spatial Effects

No normalization Global normalization

Print-tip normalization Scaled Print-tip normalization

Page 37: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

This is too Complex!

• We are piling fudge factor upon fudge factor

• We don’t know what errors these fudges are introducing…

• … and still haven’t removed all artifacts

• How about something simpler!

Page 38: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Quantile Normalization

• Currently most widely used ‘best’ method

• Implemented in BioC affy, oligo, limma

• Ignores causes of variation and technical covariates

• Shoehorns all data into the same shape distribution – matching quantiles

Page 39: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Motivation: Probe Intensities in 23 Replicates on Affy chips

Densities of intensities from GeneLogic spikein study; black is composite

Page 40: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Quantile Normalization (Irizarry et al 2002)

The mapping by quantile normalization

1. Map values in each curve separately to their quantile within the distribution

2. Map quantiles of each distribution to quantiles of the reference curve

Page 41: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Ratio Intensity Correction?

M-A plots of raw data from Affy chip pairs

Page 42: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Ratio-Intensity: After Quantile Norm

Page 43: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Critiques of Quantile Normalization

• Artificially compresses variation of highly expressed genes

• Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression

• Induces artifactual correlations in low-intensity genes

Page 44: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Technical Variable Regression

• Hypothesis: 1. Most technical variation between chips is

caused by a few systemic factors

2. Probes with similar technical characteristics (Tm, position in gene, location on chip, intensity) will be distorted by similar amounts

• Normalization: estimate bias due to technical factors by local averaging of deviations from reference profile

Page 45: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 46: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Models for Multiple-Probe Affymetrix and NimbleGen Data

Page 47: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Many Probes for One Gene

GeneGeneSequenceSequence

Multiple Multiple oligo probesoligo probes

Perfect MatchPerfect MatchMismatchMismatch

5´5´ 3´3´

How to combine signals from multiple probes into a single gene abundance estimate?

Page 48: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Probe Variation

• Individual probes don’t agree on fold changes

• Probes vary by two orders of magnitude on each chip– CG content is most important factor in signal strength

Signal from 16 probes along one gene on one chip

Page 49: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Probe Measure Variation

•Typical probes are two orders of magnitude different!•CG content is most important factor•RNA target folding also affects hybridization

3x104

0

Page 50: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Many Approaches

• Affymetrix MicroArray Suite

• dChip - Li and Wong, HSPH

• Bioconductor: – RMA - Bolstad, Irizarry, Speed, et al– affyPLM – Bolstad– gcRMA – Wu

• Physical chemistry models – Zhang et al

Page 51: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Critique of Averaging (MAS5)

• Not clear what an average of different probes should mean

• Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here

• No ‘learning’ based on cross-chip performance of individual probes

Page 52: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Motivation for multi-chip models:

Probe level data from spike-in study ( log scale ) note parallel trend of all probes

Courtesy of Terry Speed

Page 53: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Model for Probe Signal

• Each probe signal is proportional to – i) the amount of target sample – a – ii) the affinity of the specific probe sequence to the target – f

• NB: High affinity is not the same as Specificity– Probe can give high signal to intended target and also to

other transcripts

a1

a2

Probes 1 2 3

chip 1

chip 2 f1 f2 f3

Page 54: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Multiplicative Model

• For each gene, a set of probes p1,…,pk

• Each probe pj binds the gene with efficiency fj

• In each sample there is an amount ai.

• Probe intensity should be proportional to fjxai

• Always some noise!

Page 55: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Robust Linear Models

• Criterion of fit– Least median squares– Sum of weighted squares– Least squares and throw out outliers

• Method for finding fit– High-dimensional search – Iteratively re-weighted least squares– Median Polish

Page 56: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

• For each probe set, take the log transform of

PMij = aifj:

• i.e. fit the model:

• Fit this additive model by iteratively re-weighted least-squares or median polish

ijjiij baPM )bg(log

Bolstad, Irizarry, Speed – (RMA)

Critique: Model assumes probe noise is constant (homoschedastic) on log scale

After normalization!

)log()log()(log jiij faPM

Page 57: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Comparing Measures

20 replicate arrays – variance should be smallStandard deviations of expression estimates on arraysarranged in four groups of genesby increasing mean expression level

Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA

Courtesy of Terry Speed

Page 58: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Current Multi-chip Approaches

• PLIER – Affymetrix’ own multi-chip model• RMA – Bioconductor standard

– Affy package– To be replaced by oligo package?

• affyPLM– More sophisticated RMA– Careful with more sophisticated models!

• gcRMA– Estimates background

Page 59: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Quality assessment

Normalizing expression arrays

Normalizing other array types

Selecting genes

Identifying functional groups

Page 60: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 61: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Outline of Section

• General Issues

• Normalizing CGH arrays

• Normalizing ChIP-chip arrays

• Normalizing SNP arrays

• Normalizing methylation (HELP) arrays

Page 62: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

General Issues

• Quantile (and other expression normalizations) assume that overall distribution of values is invariant across samples– Roughly true for most expression sets– Untrue for copy number, DNA methylation, ChiP-chip

data

• Many choices for expression probes• Tiling, exon, and promoter arrays must work with

what is there

Page 63: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Array CGH Technology

Page 64: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Array CGH Data:Chromosome 8 (241 BAC’s)

10 cell lines and 45 tumor samples

Page 65: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

CGH Array Pre-processing

• Dye bias effects are present but because the ranges are usually smaller they have less impact– Most programs don’t do a dye bias adjustment

• Probe bias effects– Most probes show consistent biases in

relation to their neighbors– About 50% of total variance is such bias– Few programs adjust for that– See Bilke et al (Bioinformatics 2005)

Page 66: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

CGH Array Segmentation

• Key idea: Most probe targets have same copy number as their next neighbors

• Can average over neighbors

• Key issue: when is a difference real?

• Recommended Programs:

• DNACopy – Solid statistical basis; slow

• StepGram – Heuristic ; fast

Page 67: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 68: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

ChIP-chips

Page 69: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Chromatin Immuno-precipitation

Page 70: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Tiling Array

• One probe every n base pairs over some length of chromosome

Page 71: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

What the data look like

Page 72: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Superimposing channels

Giresi et al, Genome Res. 10

Page 73: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Methods and Issues

• Normalization– Different enrichment ratios– Different probe thermodynamics– Dye and probe bias

• Estimation– Categorical or continuous?– Individual values are noisy:

• Where is the peak?

Page 74: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

TileMap

• Ignores normalization

• ‘Shrinkage’ estimator of variance– Improves individual scores

• Smooths noise by moving average

• Alternative HMM

Page 75: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

MAT

• One color array– Needs separate array for comparison

• Normalizes probe thermodynamics & enrichment ratio

• Estimation by (robust) moving average

Page 76: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 77: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

SNP chips

Page 78: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Affy SNP chips

Page 79: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

SNP Chip Probe Design

• 10 25-mers overlapping the SNP

• Alleles A & B• Sense and Anti-sense

– or PM and MM (old)

Page 80: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

RMA for SNP chips

• Early Affy software wasn’t very accurate• Rabbee & Speed (2006) proposed RLMM,

an RMA-like method using:– Quantile normalization– Two variables ( A & B signals)– Discriminant analysis (Mahalanobis metric) to

tell apart alleles

• Much better than Affy software• Has been adopted by Affy

Page 81: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Discriminating SNPs

SNP A 1753037 SNP A 1698299

Page 82: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Improvements

• RLMM still had problems with several hundred SNPs

• Also problems comparing across labs

• CRLMM addresses these by careful normalization

Page 83: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

• Model intensity on each chip separately by

• Summarize ,

, ,

by median polish: M+ = -

; M- =

-

• Model log ratio bias on each chip by

• Estimate log ratio bias using E-M

– Zi indexes which SNP state is likely

CRLMM Overview

Page 84: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Normalization – Step 1

• Regress (PM) intensity on sequence predictors and fragment length

Page 85: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Normalization – Step 1• Too many hb(t)’s

– Impose constraint:• hb(t) is a cubic spline with 5 df on [1,25]

• Forces neighboring values of h to be close• Allows variation in smoothness (unlike loess)

• Subtract fitted values from signal

• BUT: bias still present

Page 86: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Summary Method

• Median Polish

• For arrays of numbers

• Iterative method

• Subtract medians of each row and each column (and accumulate) until medians converge

Page 87: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Normalization Step 2

• Fit bias function:

• But what is k?

Page 88: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

E-M Algorithm

• Systematic way to ‘guess and improve’

• Start with putative assignments to classes

• Estimate bias and fit

• Use residuals from fit to classify again

Page 89: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Final Step: Calling

• Separation in two-dimensional log-ratio space:

Page 90: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 91: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

BREAKReturn at 3:45

Page 92: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Exploratory Data Analysis

Page 93: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Clustering

• Can be useful to detect unsuspected affiliations among clinical samples or genes

• For most designed experiments significance testing or multivariate analysis are better choices

Page 94: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

When is Clustering Useful?

• To examine data for artifacts

• To try to identify sub-categories in undifferentiated clinical samples

• To impute roles / functions for genes by association

Page 95: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Caveats for Clustering

• Ordering of samples in dendrogram is arbitrary – not meaningful relation

• Distance metrics are usually arbitrary

• Correlation ‘distance’ usually improves if increase dynamic range

Samples 3 & 6 could be on the right

Page 96: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Multivariate Exploration

Page 97: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Example – Ape Brain Gene Exp.

• Enard et al, Science (2002)

• Nice separation between species

Page 98: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Caveats for PCA

• Multivariate methods are sensitive– Pick up subtle patterns – Can easily pick up artifacts in data

1. The clustering does not reflect phylogeny

2. The principal components don’t load heavily onto distinct subsets of genes

Page 99: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 100: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Quality assessment

Normalizing expression arrays

Normalizing other array types

Selecting genes

Selecting functional groups

Page 101: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Selecting Genes Multiple Comparisons Issues

ISMB 2008 Tutorial

Page 102: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Outline

• Multiple Testing– Family wide error rates– False discovery rates

• Application to microarray data– Practical issues– Computing FDR by permutation procedures– Conditioning t-scores

Page 103: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Goals

• To identify genes most likely to be changed or affected

• To prioritize candidates for focused follow-up studies

• To characterize functional changes consequent on changes in gene expression

Page 104: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Individual Variation Within Groups

• Numerous genes show high levels of inter-individual variation

• Level of variation depends on tissue also• Donors or experimental animals may be infected

or under social stress• Tissues are hypoxic or ischemic for variable

times before freezing• Tissues are mixtures of different types of cells –

proportions may vary• Inflammatory cells recruited differently

Page 105: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Technical Variation

• Temperature of hybridization• Amount of RNA• Degradation of RNA• Yield and lengths of cDNA conversion• Strength of ionic buffers• Stringency of wash• Scratches on chip• Random variations in number of molecules

Page 106: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Multiple comparisons

• Suppose no genes really changed– (as if random samples from same population)

• 10,000 genes on a chip

• Each gene has a 5% chance of exceeding the threshold at a p-value of .05– Type I error

• The test statistics for 500 genes should exceed .05 threshold ‘by chance’

Page 107: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Distributions of p-valuesRandom Data Real Microarray Data

Page 108: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Characterizing False Positives

• Family-Wide Error Rate (FWE)– probability of at least one false positive

arising from the selection procedure

• Strong control of FWE:– Bound on the probability

• False Discovery Rate: – Proportion of false positives arising from

selection procedure

Page 109: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

‘Corrected’ p-Values for FWE

• Sidak (exact correction for independent tests)– pi* = 1 – (1 – pi)N

– Still conservative if genes are co-regulated (correlated)

– pi* = 1 – (1 – Npi + …) gives Bonferroni

• Bonferroni correction– pi* = Npi, if Npi < 1, otherwise 1– Too conservative!

Page 110: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Increasing Power by Lower N

• If we select a small subset of genes for tests we will have better p-values

• Selecting well-measured or biologically more probable genes is reasonable– e.g. high S/N ratio, high variance

• Selecting genes by statistical criteria related to the test that will be performed is not reasonable.– e.g. selecting genes obtained by clustering

Page 111: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Simes’ Lemma

• Suppose we order the p-values from N independent tests using random data: – p(1), p(2), …, p(N) are order statistics from

uniform

• Pick a target threshold – Then: P( p(1) < /N | p(2) < 2 /N | p(3) < 3 /N | … ) =

• Hochberg’s procedure:– If any p(k) < k /N – Then select genes (1) to (k)

Page 112: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

False Discovery Rate

• At threshold t* what fraction of genes are likely to be true positives?

• Illustration: 10,000 independent genest p #sig E(FP) FDR*

1.96 .05 600 500 87%

2.57 .01 200 100 50%

3.29 .001 40 10 20%

In practice use permutation algorithm to compute FDR

Page 113: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Benjamini-Hochberg

• Can’t know what FDR is for a particular sample• B-H suggest procedure specifying Average FDR

– Order the p-values : p(1), p(2), …, p(N)

– If any p(k) < k /N

– Then select genes (1) to (k)

• q-value: smallest FDR at which the gene becomes ‘significant’

• NB: acceptable FDR may be much larger than acceptable p-value (e.g. 0.10 )

Page 114: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Practical Issues

• Actual proportion of false positives varies from data set to data set

• Mean FDR could be low but could be high in your data set

Page 115: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

The Effect of Correlation

• If all genes are uncorrelated, Sidak is exact• If all genes were perfectly correlated

– p-values for one are p-values for all – No multiple-comparisons correction needed

• Typical gene data is highly correlated – First eigenvalue of SVD may be more than half

the variance– Distribution of p-values may differ from uniform

• True FDR more variable

Page 116: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

P-Values from Correlated Genes

.5 .3 .9

.7 .03 .1

.4 .9 .05

.6 .8 .4

.2 .2 .9

Null distribution from

independent genes

.5 .3 .9

.5 .3 .9

.5 .3 .9

.5 .3 .9

.5 .3 .9

Null distribution from

perfectly correlated genes

Rows: genes; columns: samples;

entries: p-values from randomized distribution

.5 .3 .9

.45 .2 .95

.65 .25 .8

.4 .35 .75

.5 .4 .85

Null distribution from

highly correlated genes

Page 117: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Re-formulating the Question

• Independent: ~5% of genes exceed .05 threshold, all the time

• Perfectly correlated: all genes exceed .05 threshold ~5% of the time

• Partially correlated: .05 < f1 < 1 of genes exceeds .05 threshold, .05 < f2 < 1 of the cases

• New question: for a given f1 and , how likely is it that a fraction f1 of genes will exceed the threshold?

Page 118: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Symptoms of Correlated Tests

Typical P-value histograms with NO real changes

P-values from random but correlated data P-values from random but

correlated data -- permuted

Page 119: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Distributions of numbers of p-values below threshold

• 10,000 genes;

• 10,000 random drawings

• L: Uncorrelated R: Highly correlated

Page 120: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Permutation Tests

• We don’t know the true distribution of gene expression measures within groups

• We simulate the distribution of samples drawn from the same group by pooling the two groups, and selecting randomly two groups of the same size we are testing.

• Need at least 5 in each group to do this!

Page 121: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Permutation Tests – How To

• Suppose samples 1,2,…,10 are in group 1 and samples 11 – 20 are from group 2

• Permute 1,2,…,20: say– 13,4,7,20,9,11,17,3,8,19,2,5,16,14,6,18,12,15,10

– Construct t-scores for each gene based on these groups

• Repeat many times to obtain Null distribution of t-scores

• This will be a t-distribution original distribution has no outliers

Page 122: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Multivariate Permutation Tests

• Want a null distribution with same correlation structure as given data but no real differences between groups

• Permute group labels among samples

• redo tests with pseudo-groups

• repeat ad infinitum (10,000 times)

Page 123: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Critiques of Permutations

• Variances of permuted values for truly changed genes are inflated – artificially low p-values

• Permuted t -scores for many genes may be lower than from random samples from the same population

Page 124: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Recurrent Aberrations

• Many genetic diseases are highly variable

• Large numbers of genes show changes:– mutation (Sjoblom et al, Science 2006)– differential expression (many)– amplifications/ deletions (many) – methylation (Figueroa & Reimers, submitted)

• Few show consistency!

Page 125: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Categorical Analysis: Mutations

• Mutations occur randomly– But at different rates in different samples

• Null model IID occurrence: Poisson– Estimated parameters: sample mutation rate

• Compare numbers of genes mutated in k samples to prediction

Page 126: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Copy Number Patterns

• Neighboring loci are dependent because long regions typically get amplified/deleted

• Different samples have widely differing rates of genomic aberrations

• Null model (like permutation)– same number of aberrant regions and lengths

as actually occur in each sample – but occurring at random positions

Page 127: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Segmented Data

Page 128: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Recurrent Loci

Page 129: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Statistical Significance

• ‘Permute’ aberrations in all samples

• For each n: what fraction of loci end up with > n aberrations in each permutation?

• Select n such that fraction exceeding n is less than x% of that observed in y% of the permutations

Page 130: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 131: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Quality assessment

Normalizing expression arrays

Normalizing other array types

Selecting genes

Selecting functional groups

Page 132: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Goals

• Characterize biological meaning of joint changes in gene expression

• Organize expression (or other) changes into meaningful ‘chunks’ (themes)

• Identify crucial points in process where intervention could make a difference

• Why? Biology is Redundant! Often sets of genes doing related functions are changed

Page 133: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Gene Sets

• Gene Ontology– Biological Process– Molecular Function– Cellular Location

• Pathway Databases– KEGG– BioCarta– Broad Institute

Page 134: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Other Gene Sets

• Transcription factor targets– All the genes regulated by particular TF’s

• Protein complex components– Sets of genes whose protein products

function together• Ion channel receptors• RNA / DNA Polymerase

• Paralogs– Families of genes descended (in eukaryotic

times) from a common ancestor

Page 135: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Approaches

• Univariate:– Derive summary statistics for each gene

independently– Group statistics of genes by gene group

• Multivariate:– Analyze covariation of genes in groups across

individuals– More adaptable to continuous statistics

Page 136: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Univariate Approaches

• Discrete tests: enrichment for groups in gene lists– Select genes differentially expressed at some cutoff– For each gene group cross-tabulate– Test for significance (Hyper-geometric or Fisher test)

• Continuous tests: from gene scores to group scores– Compare distribution of scores within each group to

random selections– GSEA (Gene Set Enrichment Analysis)– PAGE (Parametric Analysis of Gene Expression)

Page 137: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Multivariate Approaches

• Classical multivariate methods– Multi-dimensional Scaling– Hotelling’s T2

• Informativeness– Topological score relative to network– Prediction by machine learning tool

• e.g. ‘random forest’

Page 138: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Contingency Table: Category X Significance

Signif. Genes

NS Genes

All

Group of Interest

k n-k n

Others K-k (N-n)-(K-k)

N-n

Totals K N-K N

P =

CombinatorialProbability

for each table

Page 139: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Categorical Analysis• Fisher’s Exact Test

– Of all tables with same margins, how many show as much or more enrichment?

• Tedious to compute when n or k are large

• Approximations– Binomial (when k / n is small)– Chi-square (when expected values > 5 )– G2 (log-likelihood ratio; compare to 2)

• Approx. not needed for modern computersSignif. genes NS Genes All

Group k n-k n

Others K-k (N-n)-(K-k) N-n

Totals K N-K N

Page 140: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Issues in Assessing Significance

• P-value or FDR?– FDR more powerful but strong correlations

• What is appropriate Null Distribution?– Random sets of genes? Or– Random assignments of samples?

• If a child category is significant, how to assess significance of parent category?– Include child category– Consider only genes outside child category

Page 141: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Critiques of Discrete Approach

• Discrete approaches are simple – often good for a first pass at a novel situation

• No use of size of change information

• Continuous procedures have twice the power of analogous discrete procedures on discretized continuous data

• No account of gene covariation – Covariation changes p-values greatly

Page 142: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Continuous Scores

• Gene Set Enrichment Analysis– Subramanian et al (2005) improves by hack – Mootha et al (2003) uses distributional test

• PAGE– Simple z-score test– Adds up changes in all genes in a pathway

• T-profiler– T-version of PAGE

Page 143: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

GSEA• Uses Kolmogorov-Smirnov (K-S) test of

distribution equality to compare t-scores for selected gene group with all genes

Page 144: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Group Z- or T- Scores

• Form z-scores for each gene: – Z = ( mG – m ) / SAll

• Each gene’s z-score (zi) is distributed N(0,1)

• If z-scores are assigned randomly in groups– Hence the sum over genes in a group G:

• Compare group scores to Normal distribution– Remember multiple comparisons! (many groups)

Gi

i NGz )1,0(~/

Page 145: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Issues for Pathway Methods

• How to assess significance?– Null distribution by permutations– Permute genes or samples?

• How to handle activators and inhibitors in the same pathway?– Variance Test– Other approaches

• Critique: all genes treated as independent– No account of gene covariation – Covariation changes p-values greatly

Page 146: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Further Questions

Page 147: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Multivariate Approaches to Gene Set Selection

Page 148: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Key Multivariate Ideas

• PCA (Principal Components Analysis)– Identify major directions of covariation in data

• MDS (Multi-dimensional Scaling)– Easy graphical representation of data in terms

of many variables

• Hotelling T2

– Test for differences between two sample means

Page 149: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

PCA

Three correlated variablesPCA1 lies along the direction ofmaximal correlation; PCA 2 atright angles with the next highest variation.

Page 150: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

MDS Representation of Expression in Pathways

• BAD pathwayNormal

IBC

Other BC

• Clear separation between groups

• Variation differences

Page 151: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

• Compute distance between sample means using (common) metric of covariation

• Where

• Multidimensional analog of t (actually F) statistic

Hotelling’s T2

Page 152: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Principles of Kong et al Method

• Normal covariation generally acts to preserve homeostasis

• The transcription of genes that participate in many processes will be changed

• The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

Page 153: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Critiques of Hotelling’s T

• Small samples: unreliable estimates– N < p

• Estimates of x and not robust to outliers

• Assumes same covariance in each sample– = ? Usually not in disease

– Kong et al propose analog of Welch t-test– Permutation in samples for significance

Page 154: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Making it Stable

1. Insufficient information to capture all relationships – too much correlation!

– Power of Hotelling’s method comes from identifying directions of rare variation

– Many (spurious) directions of 0 variation

2. Random variation in data leads to random variation in PCA

• Regularization strategy: force covariance to be more like independent variables

Page 155: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Making it Robust

• Microarray data has many outliers

• Multivariate methods are very much distorted by outliers

• Robust estimates of covariance could give robust PCA

• Simple approach: trim outliers

Page 156: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Handling Changes of Covariance

• Power of Hotelling’s method comes from identifying directions of rare variation

• If one group shows little covariation in one direction but the other does – how to test for changes?

• If one group is control then its covariance should be taken as standard– Robust measure of means in both groups

Page 157: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Summary

• All arrays subject to systemic error – Often hidden from view!

• Best normalizations for expression arrays • Normalization methods for non-expression

arrays very different• Selecting individual genes problematic in

presence of correlated variation• Selection by pathways often more

powerful but often spurious

Page 158: Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

Additional Information

• An Opinionated Guide to Microarray Data Analysis (coming August 2008)

• www.people.vcu.edu/~mreimers/OGMDA

• BioConductor: www.bioconductor.org