microarray data analysis tutorial at ismb 2008 mark reimers virginia commonwealth university

Microarray Data Analysis

Tutorial at ISMB 2008

Mark Reimers

Virginia Commonwealth University

Quality assessment

Normalizing expression arrays

Normalizing other array types

Selecting genes

Identifying functional groups

Outline

Array Quality Assessment

ISMB 2008 Microarray Tutorial

Mark Reimers

Virginia Commonwealth University

Approaches to Quality Control

• Careful Technique

• Practice!

• Testing the RNA quality

• Examining the chips and images

• Spot metrics

• Statistical approaches– Examining deviations as functions of technical

variables

Simple QA for Spotted Arrays

• Spot Measures– Signal/Noise

• Foreground / background or – foreground / SD

– Uniformity– Spot Area

• Global Measures– Qualitative assessments – Averages of spot measures

• Inspect images for artifacts– Streaks of dye, scratches etc.

• Are there biases in regions?

Using Statistics for QA• Significant artifacts may not be obvious

from visual inspection or bulk statistics• General approach: plot deviations from

average or residuals from fit against any technical variable:– Intensity– CG content or Tm– Probe position relative to 3’ end (for poly-T

primed)– Location on chip

• Color-code deviations on chip image

Simple Portrait - Boxplots

• Boxplot of 16 chips from Cheung et al Nature 2005

45

67

89

Another Portrait - Densities

4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Density Plots:before and after

log(Signal)

De

nsity

Chips

GSM25524.CELGSM25525.CELGSM25526.CELGSM25527.CELGSM25528.CELGSM25529.CELGSM25530.CELGSM25531.CEL

GSM25540.CELGSM25541.CELGSM25542.CELGSM25543.CELGSM25548.CELGSM25549.CELGSM25550.CELGSM25551.CEL

Ratio vs Intensity Plots: Saturation & Quenching

• Saturation– Decreasing rate of

binding of RNA at higher occupancies on probe

• Quenching:– Light emitted by one

dye molecule may be re-absorbed by a nearby dye molecule

– Then lost as heat– Effect proportional to

square of density

Plot of log ratio against average log intensity across chips

GSM25377 from the CEPH expression data GSE2552

How Much Variability on R-I?

• Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005)

Covariation with Probe Tm

• MAQC project

• Agilent 44K– Array 1C3– Performed by

Agilent

•Plot of log ratios to average against Tm •Bimodal distribution because two samples are very different

Covariation with Probe Position

• RNA degrades from 5’ end

• Intensity should decrease from 3’ end uniformly across chips

• affyRNAdeg plots in affy package

Plot of average intensity for each probe position across all genes against probe position

Spatial Variation Across Chips

Red/Green ratios show variation-probably concentrated

Ratios of ratios on slide to ratios on standard show consistent biases

In House Spotted Arrays

Ratio of ratios shows much clearer concentration of red spots on some slides

Note non-random but highly irregular concentration of red

Legend

Background Subtraction (1)

• We think that local background contributes to bias

• Does subtracting background remove bias?

Local off-spot background may not be the best estimate of spot background (non-specific hyb)

Spots BG subtracted

Background Subtraction (2)

Raw spot ratios show a mild bias relative to averageAfter subtracting a high green bg in the center a red bias results

Raw Ratios Background BG-subtracted

Other Bias Patterns

This spotted oligo array shows strong biases at the beginning and end of each print-tip group

The background shows a milder version of this effect

Subtracting background compensates for about half this effect

Processed Raw Spot Background

Local Bias on Affymetrix ChipsImage of raw data on a log2 scale shows striations but no obvious artifacts

Image of ratios of probes to standard shows a smudge

Non-coding probes

Images show high values as red, low values as yellow

Variation in Affy Chips

QC in Bioconductor

• Robust Multi-chip Analysis (RMA) – fits a linear model to each probe set– High residuals show regional patterns

High residuals in green

Available in affyPLM package at www.bioconductor.org

Portion of dChip QA image High residuals in pinkwww.dchip.org

Affy QC Metrics in Bioconductor

• affyPLM package fits probe level model to Affymetrix raw data

• NUSE - Normalized Unscaled Standard Errors – normalized relative to

each gene

• How many big errors?

Conclusions

• Technical bias is a significant source of error in microarray studies

• Technical bias can be visualized and quantified

• Normalization can only partially compensate these problems

• It is best to drop chips with extreme technical deviations

Are Microarrays Reliable?• Microarray studies have an uneven track

record of replication• Huang et al (2003) replicated few of the markers

for breast cancer survival identified by van t’Veer et al (2002)

• Two Science papers in 2003 on stem cell gene expression describing parallel experiments identify only 2 genes in common out of hundreds

• The MAQC project papers in fall 2006 claimed to validate microarrays. Measures of consistency were highly variable across any two platforms

• Maybe both are true: – careful microarray studies are accurate

Further Questions

Quality assessment



Selecting genes


Normalizing Expression Data

ISMB 2008 Tutorial

QA and Normalization

• Technical differences cause changes in measures

• QA flags big technical differences in arrays– Often sporadic – e.g. scratches on chip

• Normalization attempts to compensate for modest but systematic differences– e.g. intensity-dependent bias (quenching)

• Both must be done together

Variations in Technique with Broad Consequences for Measures

• Temperature of hybridization

• Amount of RNA

• Degradation of RNA

• Yield of conversion to cDNA or cRNA

• Yield of labeling reaction

• Strength of ionic buffers

• Stringency of wash

Many Normalization Approaches

• One Parameter– Total or median brightness

• Two parameter– Variance stabilizing– Lowess for two-color arrays

• Non-parametric– Distribution (quantile) matching

• Regression on technical covariates

• Can only measure relative levels of expression: per g RNA

• Assume: only difference between chips is due to different weights of RNA hybridized

• Set:

• For each chip, normalized values are:

• For 2-color cDNA use separate constant for each channel on each chip

• More consistent results if use a robust estimator, such as median or 1/3 – trimmed mean (Quackenbush): take mean of middle 2/3 of probes

N

ii

N

ii fCfC

1

greengreen

1

redred ;

One Parameter – Overall Mean

greengreen*

redred* /;/ CffCff iiii

One-Parameter Limitations

4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Density Plots:before and after

log(Signal)

De

nsity

A centering transform can shift the density of log2 data but can’t get around the differences in shape of distributions

Intensity Dependent Bias

Ratio – Intensity (M-A) plot of raw data:

M = log2(R/G) ; A = (log2(R) + log2(G)) / 2

Global (lowess) Normalization

Same data set normalized by: Mnorm = M-c(A)

where c(A) is an intensity dependent functionestimated by local regression.

Print-tip Normalization

Separate lowess curves for each of 16 print-tips

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Print-tip layout

Box plots for log ratios in each of 16 print-tip groups

Scaled Print-tip Normalization

There still remain apparent technical artifacts

Try to fix them by scaling after print-tip lowess: Mp,norm = sp·(Mp-cp(A));

Box plot after print-tip normalization Box plot after scaled print-tip normalization

Spatial Effects

No normalization Global normalization

Print-tip normalization Scaled Print-tip normalization

This is too Complex!

• We are piling fudge factor upon fudge factor

• We don’t know what errors these fudges are introducing…

• … and still haven’t removed all artifacts

• How about something simpler!

Quantile Normalization

• Currently most widely used ‘best’ method

• Implemented in BioC affy, oligo, limma

• Ignores causes of variation and technical covariates

• Shoehorns all data into the same shape distribution – matching quantiles

Motivation: Probe Intensities in 23 Replicates on Affy chips

Densities of intensities from GeneLogic spikein study; black is composite

Quantile Normalization (Irizarry et al 2002)

The mapping by quantile normalization

1. Map values in each curve separately to their quantile within the distribution

2. Map quantiles of each distribution to quantiles of the reference curve

Ratio Intensity Correction?

M-A plots of raw data from Affy chip pairs

Ratio-Intensity: After Quantile Norm

Critiques of Quantile Normalization

• Artificially compresses variation of highly expressed genes

• Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression

• Induces artifactual correlations in low-intensity genes

Technical Variable Regression

• Hypothesis: 1. Most technical variation between chips is

caused by a few systemic factors

2. Probes with similar technical characteristics (Tm, position in gene, location on chip, intensity) will be distorted by similar amounts

• Normalization: estimate bias due to technical factors by local averaging of deviations from reference profile

Further Questions

Models for Multiple-Probe Affymetrix and NimbleGen Data

Many Probes for One Gene

GeneGeneSequenceSequence

Multiple Multiple oligo probesoligo probes

Perfect MatchPerfect MatchMismatchMismatch

5´5´ 3´3´

How to combine signals from multiple probes into a single gene abundance estimate?

Probe Variation

• Individual probes don’t agree on fold changes

• Probes vary by two orders of magnitude on each chip– CG content is most important factor in signal strength

Signal from 16 probes along one gene on one chip

Probe Measure Variation

•Typical probes are two orders of magnitude different!•CG content is most important factor•RNA target folding also affects hybridization

3x104

0

Many Approaches

• Affymetrix MicroArray Suite

• dChip - Li and Wong, HSPH

• Bioconductor: – RMA - Bolstad, Irizarry, Speed, et al– affyPLM – Bolstad– gcRMA – Wu

• Physical chemistry models – Zhang et al

Critique of Averaging (MAS5)

• Not clear what an average of different probes should mean

• Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here

• No ‘learning’ based on cross-chip performance of individual probes

Motivation for multi-chip models:

Probe level data from spike-in study ( log scale ) note parallel trend of all probes

Courtesy of Terry Speed

Model for Probe Signal

• Each probe signal is proportional to – i) the amount of target sample – a – ii) the affinity of the specific probe sequence to the target – f

• NB: High affinity is not the same as Specificity– Probe can give high signal to intended target and also to

other transcripts

a1

a2

Probes 1 2 3

chip 1

chip 2 f1 f2 f3

Multiplicative Model

• For each gene, a set of probes p1,…,pk

• Each probe pj binds the gene with efficiency fj

• In each sample there is an amount ai.

• Probe intensity should be proportional to fjxai

• Always some noise!

Robust Linear Models

• Criterion of fit– Least median squares– Sum of weighted squares– Least squares and throw out outliers

• Method for finding fit– High-dimensional search – Iteratively re-weighted least squares– Median Polish

• For each probe set, take the log transform of

PMij = aifj:

• i.e. fit the model:

• Fit this additive model by iteratively re-weighted least-squares or median polish

ijjiij baPM )bg(log

Bolstad, Irizarry, Speed – (RMA)

Critique: Model assumes probe noise is constant (homoschedastic) on log scale

After normalization!

)log()log()(log jiij faPM

Comparing Measures

20 replicate arrays – variance should be smallStandard deviations of expression estimates on arraysarranged in four groups of genesby increasing mean expression level

Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA

Courtesy of Terry Speed

Current Multi-chip Approaches

• PLIER – Affymetrix’ own multi-chip model• RMA – Bioconductor standard

– Affy package– To be replaced by oligo package?

• affyPLM– More sophisticated RMA– Careful with more sophisticated models!

• gcRMA– Estimates background

Quality assessment



Selecting genes


Further Questions

Outline of Section

• General Issues

• Normalizing CGH arrays

• Normalizing ChIP-chip arrays

• Normalizing SNP arrays

• Normalizing methylation (HELP) arrays

General Issues

• Quantile (and other expression normalizations) assume that overall distribution of values is invariant across samples– Roughly true for most expression sets– Untrue for copy number, DNA methylation, ChiP-chip

data

• Many choices for expression probes• Tiling, exon, and promoter arrays must work with

what is there

Array CGH Technology

Array CGH Data:Chromosome 8 (241 BAC’s)

10 cell lines and 45 tumor samples

CGH Array Pre-processing

• Dye bias effects are present but because the ranges are usually smaller they have less impact– Most programs don’t do a dye bias adjustment

• Probe bias effects– Most probes show consistent biases in

relation to their neighbors– About 50% of total variance is such bias– Few programs adjust for that– See Bilke et al (Bioinformatics 2005)

CGH Array Segmentation

• Key idea: Most probe targets have same copy number as their next neighbors

• Can average over neighbors

• Key issue: when is a difference real?

• Recommended Programs:

• DNACopy – Solid statistical basis; slow

• StepGram – Heuristic ; fast

Further Questions

ChIP-chips

Chromatin Immuno-precipitation

Tiling Array

• One probe every n base pairs over some length of chromosome

What the data look like

Superimposing channels

Giresi et al, Genome Res. 10

Methods and Issues

• Normalization– Different enrichment ratios– Different probe thermodynamics– Dye and probe bias

• Estimation– Categorical or continuous?– Individual values are noisy:

• Where is the peak?

TileMap

• Ignores normalization

• ‘Shrinkage’ estimator of variance– Improves individual scores

• Smooths noise by moving average

• Alternative HMM

MAT

• One color array– Needs separate array for comparison

• Normalizes probe thermodynamics & enrichment ratio

• Estimation by (robust) moving average

Further Questions

SNP chips

Affy SNP chips

SNP Chip Probe Design

• 10 25-mers overlapping the SNP

• Alleles A & B• Sense and Anti-sense

– or PM and MM (old)

RMA for SNP chips

• Early Affy software wasn’t very accurate• Rabbee & Speed (2006) proposed RLMM,

an RMA-like method using:– Quantile normalization– Two variables ( A & B signals)– Discriminant analysis (Mahalanobis metric) to

tell apart alleles

• Much better than Affy software• Has been adopted by Affy

Discriminating SNPs

SNP A 1753037 SNP A 1698299

Improvements

• RLMM still had problems with several hundred SNPs

• Also problems comparing across labs

• CRLMM addresses these by careful normalization

• Model intensity on each chip separately by

• Summarize ,

, ,

by median polish: M+ = -

; M- =

-

• Model log ratio bias on each chip by

• Estimate log ratio bias using E-M

– Zi indexes which SNP state is likely

CRLMM Overview

Normalization – Step 1

• Regress (PM) intensity on sequence predictors and fragment length

Normalization – Step 1• Too many hb(t)’s

– Impose constraint:• hb(t) is a cubic spline with 5 df on [1,25]

• Forces neighboring values of h to be close• Allows variation in smoothness (unlike loess)

• Subtract fitted values from signal

• BUT: bias still present

Summary Method

• Median Polish

• For arrays of numbers

• Iterative method

• Subtract medians of each row and each column (and accumulate) until medians converge

Normalization Step 2

• Fit bias function:

• But what is k?

E-M Algorithm

• Systematic way to ‘guess and improve’

• Start with putative assignments to classes

• Estimate bias and fit

• Use residuals from fit to classify again

Final Step: Calling

• Separation in two-dimensional log-ratio space:

Further Questions

BREAKReturn at 3:45

Exploratory Data Analysis

Clustering

• Can be useful to detect unsuspected affiliations among clinical samples or genes

• For most designed experiments significance testing or multivariate analysis are better choices

When is Clustering Useful?

• To examine data for artifacts

• To try to identify sub-categories in undifferentiated clinical samples

• To impute roles / functions for genes by association

Caveats for Clustering

• Ordering of samples in dendrogram is arbitrary – not meaningful relation

• Distance metrics are usually arbitrary

• Correlation ‘distance’ usually improves if increase dynamic range

Samples 3 & 6 could be on the right

Multivariate Exploration

Example – Ape Brain Gene Exp.

• Enard et al, Science (2002)

• Nice separation between species

Caveats for PCA

• Multivariate methods are sensitive– Pick up subtle patterns – Can easily pick up artifacts in data

1. The clustering does not reflect phylogeny

2. The principal components don’t load heavily onto distinct subsets of genes

Further Questions

Quality assessment



Selecting genes

Selecting functional groups

Selecting Genes Multiple Comparisons Issues

ISMB 2008 Tutorial

Outline

• Multiple Testing– Family wide error rates– False discovery rates

• Application to microarray data– Practical issues– Computing FDR by permutation procedures– Conditioning t-scores

Goals

• To identify genes most likely to be changed or affected

• To prioritize candidates for focused follow-up studies

• To characterize functional changes consequent on changes in gene expression

Individual Variation Within Groups

• Numerous genes show high levels of inter-individual variation

• Level of variation depends on tissue also• Donors or experimental animals may be infected

or under social stress• Tissues are hypoxic or ischemic for variable

times before freezing• Tissues are mixtures of different types of cells –

proportions may vary• Inflammatory cells recruited differently

Technical Variation

• Temperature of hybridization• Amount of RNA• Degradation of RNA• Yield and lengths of cDNA conversion• Strength of ionic buffers• Stringency of wash• Scratches on chip• Random variations in number of molecules

Multiple comparisons

• Suppose no genes really changed– (as if random samples from same population)

• 10,000 genes on a chip

• Each gene has a 5% chance of exceeding the threshold at a p-value of .05– Type I error

• The test statistics for 500 genes should exceed .05 threshold ‘by chance’

Distributions of p-valuesRandom Data Real Microarray Data

Characterizing False Positives

• Family-Wide Error Rate (FWE)– probability of at least one false positive

arising from the selection procedure

• Strong control of FWE:– Bound on the probability

• False Discovery Rate: – Proportion of false positives arising from

selection procedure

‘Corrected’ p-Values for FWE

• Sidak (exact correction for independent tests)– pi* = 1 – (1 – pi)N

– Still conservative if genes are co-regulated (correlated)

– pi* = 1 – (1 – Npi + …) gives Bonferroni

• Bonferroni correction– pi* = Npi, if Npi < 1, otherwise 1– Too conservative!

Increasing Power by Lower N

• If we select a small subset of genes for tests we will have better p-values

• Selecting well-measured or biologically more probable genes is reasonable– e.g. high S/N ratio, high variance

• Selecting genes by statistical criteria related to the test that will be performed is not reasonable.– e.g. selecting genes obtained by clustering

Simes’ Lemma

• Suppose we order the p-values from N independent tests using random data: – p(1), p(2), …, p(N) are order statistics from

uniform

• Pick a target threshold – Then: P( p(1) < /N | p(2) < 2 /N | p(3) < 3 /N | … ) =

• Hochberg’s procedure:– If any p(k) < k /N – Then select genes (1) to (k)

False Discovery Rate

• At threshold t* what fraction of genes are likely to be true positives?

• Illustration: 10,000 independent genest p #sig E(FP) FDR*

1.96 .05 600 500 87%

2.57 .01 200 100 50%

3.29 .001 40 10 20%

In practice use permutation algorithm to compute FDR

Benjamini-Hochberg

• Can’t know what FDR is for a particular sample• B-H suggest procedure specifying Average FDR

– Order the p-values : p(1), p(2), …, p(N)

– If any p(k) < k /N

– Then select genes (1) to (k)

• q-value: smallest FDR at which the gene becomes ‘significant’

• NB: acceptable FDR may be much larger than acceptable p-value (e.g. 0.10 )

Practical Issues

• Actual proportion of false positives varies from data set to data set

• Mean FDR could be low but could be high in your data set

The Effect of Correlation

• If all genes are uncorrelated, Sidak is exact• If all genes were perfectly correlated

– p-values for one are p-values for all – No multiple-comparisons correction needed

• Typical gene data is highly correlated – First eigenvalue of SVD may be more than half

the variance– Distribution of p-values may differ from uniform

• True FDR more variable

P-Values from Correlated Genes

.5 .3 .9

.7 .03 .1

.4 .9 .05

.6 .8 .4

.2 .2 .9

Null distribution from

independent genes

.5 .3 .9

.5 .3 .9

.5 .3 .9

.5 .3 .9

.5 .3 .9


perfectly correlated genes

Rows: genes; columns: samples;

entries: p-values from randomized distribution

.5 .3 .9

.45 .2 .95

.65 .25 .8

.4 .35 .75

.5 .4 .85


highly correlated genes

Re-formulating the Question

• Independent: ~5% of genes exceed .05 threshold, all the time

• Perfectly correlated: all genes exceed .05 threshold ~5% of the time

• Partially correlated: .05 < f1 < 1 of genes exceeds .05 threshold, .05 < f2 < 1 of the cases

• New question: for a given f1 and , how likely is it that a fraction f1 of genes will exceed the threshold?

Symptoms of Correlated Tests

Typical P-value histograms with NO real changes

P-values from random but correlated data P-values from random but

correlated data -- permuted

Distributions of numbers of p-values below threshold

• 10,000 genes;

• 10,000 random drawings

• L: Uncorrelated R: Highly correlated

Permutation Tests

• We don’t know the true distribution of gene expression measures within groups

• We simulate the distribution of samples drawn from the same group by pooling the two groups, and selecting randomly two groups of the same size we are testing.

• Need at least 5 in each group to do this!

Permutation Tests – How To

• Suppose samples 1,2,…,10 are in group 1 and samples 11 – 20 are from group 2

• Permute 1,2,…,20: say– 13,4,7,20,9,11,17,3,8,19,2,5,16,14,6,18,12,15,10

– Construct t-scores for each gene based on these groups

• Repeat many times to obtain Null distribution of t-scores

• This will be a t-distribution original distribution has no outliers

Multivariate Permutation Tests

• Want a null distribution with same correlation structure as given data but no real differences between groups

• Permute group labels among samples

• redo tests with pseudo-groups

• repeat ad infinitum (10,000 times)

Critiques of Permutations

• Variances of permuted values for truly changed genes are inflated – artificially low p-values

• Permuted t -scores for many genes may be lower than from random samples from the same population

Recurrent Aberrations

• Many genetic diseases are highly variable

• Large numbers of genes show changes:– mutation (Sjoblom et al, Science 2006)– differential expression (many)– amplifications/ deletions (many) – methylation (Figueroa & Reimers, submitted)

• Few show consistency!

Categorical Analysis: Mutations

• Mutations occur randomly– But at different rates in different samples

• Null model IID occurrence: Poisson– Estimated parameters: sample mutation rate

• Compare numbers of genes mutated in k samples to prediction

Copy Number Patterns

• Neighboring loci are dependent because long regions typically get amplified/deleted

• Different samples have widely differing rates of genomic aberrations

• Null model (like permutation)– same number of aberrant regions and lengths

as actually occur in each sample – but occurring at random positions

Segmented Data

Recurrent Loci

Statistical Significance

• ‘Permute’ aberrations in all samples

• For each n: what fraction of loci end up with > n aberrations in each permutation?

• Select n such that fraction exceeding n is less than x% of that observed in y% of the permutations

Further Questions

Quality assessment



Selecting genes

Selecting functional groups

Goals

• Characterize biological meaning of joint changes in gene expression

• Organize expression (or other) changes into meaningful ‘chunks’ (themes)

• Identify crucial points in process where intervention could make a difference

• Why? Biology is Redundant! Often sets of genes doing related functions are changed

Gene Sets

• Gene Ontology– Biological Process– Molecular Function– Cellular Location

• Pathway Databases– KEGG– BioCarta– Broad Institute

Other Gene Sets

• Transcription factor targets– All the genes regulated by particular TF’s

• Protein complex components– Sets of genes whose protein products

function together• Ion channel receptors• RNA / DNA Polymerase

• Paralogs– Families of genes descended (in eukaryotic

times) from a common ancestor

Approaches

• Univariate:– Derive summary statistics for each gene

independently– Group statistics of genes by gene group

• Multivariate:– Analyze covariation of genes in groups across

individuals– More adaptable to continuous statistics

Univariate Approaches

• Discrete tests: enrichment for groups in gene lists– Select genes differentially expressed at some cutoff– For each gene group cross-tabulate– Test for significance (Hyper-geometric or Fisher test)

• Continuous tests: from gene scores to group scores– Compare distribution of scores within each group to

random selections– GSEA (Gene Set Enrichment Analysis)– PAGE (Parametric Analysis of Gene Expression)

Multivariate Approaches

• Classical multivariate methods– Multi-dimensional Scaling– Hotelling’s T2

• Informativeness– Topological score relative to network– Prediction by machine learning tool

• e.g. ‘random forest’

Contingency Table: Category X Significance

Signif. Genes

NS Genes

All

Group of Interest

k n-k n

Others K-k (N-n)-(K-k)

N-n

Totals K N-K N

P =

CombinatorialProbability

for each table

Categorical Analysis• Fisher’s Exact Test

– Of all tables with same margins, how many show as much or more enrichment?

• Tedious to compute when n or k are large

• Approximations– Binomial (when k / n is small)– Chi-square (when expected values > 5 )– G2 (log-likelihood ratio; compare to 2)

• Approx. not needed for modern computersSignif. genes NS Genes All

Group k n-k n

Others K-k (N-n)-(K-k) N-n

Totals K N-K N

Issues in Assessing Significance

• P-value or FDR?– FDR more powerful but strong correlations

• What is appropriate Null Distribution?– Random sets of genes? Or– Random assignments of samples?

• If a child category is significant, how to assess significance of parent category?– Include child category– Consider only genes outside child category

Critiques of Discrete Approach

• Discrete approaches are simple – often good for a first pass at a novel situation

• No use of size of change information

• Continuous procedures have twice the power of analogous discrete procedures on discretized continuous data

• No account of gene covariation – Covariation changes p-values greatly

Continuous Scores

• Gene Set Enrichment Analysis– Subramanian et al (2005) improves by hack – Mootha et al (2003) uses distributional test

• PAGE– Simple z-score test– Adds up changes in all genes in a pathway

• T-profiler– T-version of PAGE

GSEA• Uses Kolmogorov-Smirnov (K-S) test of

distribution equality to compare t-scores for selected gene group with all genes

Group Z- or T- Scores

• Form z-scores for each gene: – Z = ( mG – m ) / SAll

• Each gene’s z-score (zi) is distributed N(0,1)

• If z-scores are assigned randomly in groups– Hence the sum over genes in a group G:

• Compare group scores to Normal distribution– Remember multiple comparisons! (many groups)

Gi

i NGz )1,0(~/

Issues for Pathway Methods

• How to assess significance?– Null distribution by permutations– Permute genes or samples?

• How to handle activators and inhibitors in the same pathway?– Variance Test– Other approaches

• Critique: all genes treated as independent– No account of gene covariation – Covariation changes p-values greatly

Further Questions

Multivariate Approaches to Gene Set Selection

Key Multivariate Ideas

• PCA (Principal Components Analysis)– Identify major directions of covariation in data

• MDS (Multi-dimensional Scaling)– Easy graphical representation of data in terms

of many variables

• Hotelling T2

– Test for differences between two sample means

PCA

Three correlated variablesPCA1 lies along the direction ofmaximal correlation; PCA 2 atright angles with the next highest variation.

MDS Representation of Expression in Pathways

• BAD pathwayNormal

IBC

Other BC

• Clear separation between groups

• Variation differences

• Compute distance between sample means using (common) metric of covariation

• Where

• Multidimensional analog of t (actually F) statistic

Hotelling’s T2

Principles of Kong et al Method

• Normal covariation generally acts to preserve homeostasis

• The transcription of genes that participate in many processes will be changed

• The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

Critiques of Hotelling’s T

• Small samples: unreliable estimates– N < p

• Estimates of x and not robust to outliers

• Assumes same covariance in each sample– = ? Usually not in disease

– Kong et al propose analog of Welch t-test– Permutation in samples for significance

Making it Stable

1. Insufficient information to capture all relationships – too much correlation!

– Power of Hotelling’s method comes from identifying directions of rare variation

– Many (spurious) directions of 0 variation

2. Random variation in data leads to random variation in PCA

• Regularization strategy: force covariance to be more like independent variables

Making it Robust

• Microarray data has many outliers

• Multivariate methods are very much distorted by outliers

• Robust estimates of covariance could give robust PCA

• Simple approach: trim outliers

Handling Changes of Covariance

• Power of Hotelling’s method comes from identifying directions of rare variation

• If one group shows little covariation in one direction but the other does – how to test for changes?

• If one group is control then its covariance should be taken as standard– Robust measure of means in both groups

Summary

• All arrays subject to systemic error – Often hidden from view!

• Best normalizations for expression arrays • Normalization methods for non-expression

arrays very different• Selecting individual genes problematic in

presence of correlated variation• Selection by pathways often more

powerful but often spurious

Additional Information

• An Opinionated Guide to Microarray Data Analysis (coming August 2008)

• www.people.vcu.edu/~mreimers/OGMDA

• BioConductor: www.bioconductor.org

http://www.people.vcu.edu/~mreimers/OGMDA

microarray data analysis tutorial at ismb 2008 mark reimers virginia commonwealth university

Documents

raw spot ratios

local background

arraysratio of ratios

chipsredgreen ratios

concentratedratios of

effectsubtracting background

agilentplot of log ratios

average log intensity