limma = linear models for microarray data linear models and · pdf file ·...

33
1 1 Linear models and Limma Københavns Universitet, 19 August 2009 Mark D. Robinson Bioinformatics, Walter+Eliza Hall Institute Epigenetics Laboratory, Garvan Institute (with many slides taken from Gordon Smyth) 2 2 Limma = linear models for microarray data o Morning Theory ! Introduction ! Background correction ! Moderated t-tests ! Simple linear models o Morning Practical ! Demonstration of smoothing ! Limma objects (beta7) ! Background correction and normalization (beta7) ! Simple experimental designs ! 2-colour example (beta7) ! Affymetrix example (cancer) o Afternoon Theory ! More advanced designs / linear modeling ! Moderated F-tests ! Gene set tests ! Other analyses limma can do o Afternoon practical ! Factorial design (estrogen) ! Gene set testing (cancer) ! Time course experiment (SAHA/depsipeptide) 3 3 Expression measures Two-colour Affymetrix Illumina log-intensity (summarized over probes) log-intensity (summarized over beads) probe or gene array y ga = log 2 (R/G) y ga = y ga = 4 4 Questions of Interest o What genes have changed in expression? (e.g. between disease/normal, affected by treatment) Gene discovery, differential expression o Is a specified group of genes all up-regulated in a particular condition? Gene set differential expression o Can the expression profile predict outcome? Class prediction, classification o Are there tumour sub-types not previously identified? Do my genes group into previously undiscovered pathways? Class discovery, clustering Today will cover first two questions - differential expression

Upload: phungkhanh

Post on 17-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

1

1

Linear models and Limma

Københavns Universitet,19 August 2009

Mark D. RobinsonBioinformatics, Walter+Eliza Hall InstituteEpigenetics Laboratory, Garvan Institute

(with many slides taken from Gordon Smyth)

2

2

Limma = linear models formicroarray data

o Morning Theory! Introduction

! Background correction

! Moderated t-tests

! Simple linear models

o Morning Practical! Demonstration of smoothing

! Limma objects (beta7)

! Background correction andnormalization (beta7)

! Simple experimental designs

! 2-colour example (beta7)

! Affymetrix example (cancer)

o Afternoon Theory

! More advanced designs /linear modeling

! Moderated F-tests

! Gene set tests

! Other analyses limma can do

o Afternoon practical

! Factorial design (estrogen)

! Gene set testing (cancer)

! Time course experiment(SAHA/depsipeptide)

3

3

Expression measures

Two-colour

Affymetrix

Illumina

log-intensity(summarizedover probes)

log-intensity(summarizedover beads)

probe or gene

array

yga = log2(R/G)

yga =

yga =

4

4

Questions of Interest

o What genes have changed in expression? (e.g.between disease/normal, affected by treatment)Gene discovery, differential expression

o Is a specified group of genes all up-regulated in aparticular condition?Gene set differential expression

o Can the expression profile predict outcome?Class prediction, classification

o Are there tumour sub-types not previouslyidentified? Do my genes group into previouslyundiscovered pathways?Class discovery, clustering

Today will cover first two questions - differential expression

Page 2: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

5

5

MicroarrayDifferential Expression Studies

o 103 – 106 genes/probes/exons on a chip orglass slide

o Inputs to limma: log-intensities (1-colour data)or log(R/G) log-ratios (for 2-colour data)

o Several steps to go from raw data to table of“expression”: background correction,normalization

o Idea: Fit a linear model to the expression datafor each gene

6

6

Two colourmicroarrays

http://en.wikipedia.org/wiki/DNA_microarray

7

7

Two-colour data: Log-Intensities

*

2log ( )f bR R R= !

*

2log ( )f bG G G= !

For each probe:

Various ways to calculate background.

Will often modify to ensure:

Rf – Rb >0 and Gf – Gb > 0.

8

8

Two-colour data: Means andDifferences

M R G= !

( ) / 2A R G= +

For each probe:

“minus”

“add”

Page 3: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

9

9

Data Summaries

G1 R1 G2 R2 G3 R3 G4 R4

A1M1 A2M2 A3M3 A4M4

For each gene

10

10

MA Plot

11

11

Log-Ratios orSingle Channel Intensities?

o Tradition analysis, treats log-ratios M=log(R/G) asthe primary data, i.e., gene expression measurementsare relative

o Alternative approach treats individual channelintensities R and G as primary data, i.e., geneexpression measures are absolute (Wolfinger,Churchill, Kerr)

o Single channel approach makes new analyses possiblebut- make stronger assumptions- requires more complex models (mixed models in place of

ordinary linear models) to accommodate correlationbetween R and G on same spot

12

12

BG correction affects DE results

o Importance of careful pre-processing and qualitycontrol cannot be over-emphasized for microarraydata

o Can have dramatic effect on differential expressionresults

o Consider here the normexp method of adaptivebackground correction- background correction step of the RMA algorithm

- Can also be applied to two colour data

Page 4: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

13

13

Additive + multiplicative errormodel

Observe intensity for one probe on one array

additiveerrors

multiplicativeerrors

Intensity = background + signal

This idea underlies variance stabilizing transformationsvsn (two colour data) and vst (for Illumina data)

I = B + S

14

14

normexp convolution model

Intensity = Background + Signal

N(!,!2) Exponential(")

+=

15

15

Conditional expectation undernormexp model

Then

with

16

16

normexp background correction

o Estimate the three parameters

o Replace I with E(S|I)

o For Affymetrix data, I is the “Perfect Match” dataintensity

o For two-colour data, I=Rf-Rb or I=Gf-Gb

o In the RMA algorithm, parameter estimation uses anad hoc density kernel method

o In limma (two colour), parameter estimationmaximises the saddlepoint approximation to thelikelihood

Page 5: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

17

17

PM data on log2 scale: raw and fitted model

18

18

Background corrected intensity

= E(Signal | Observed Intensity)

Observed Intensity

E(

Sign

al | Inte

nsi

ty)

Adaptivebackgroundcorrectionproducespositive signal

19

19

Offsets to stabilise the variance

Offset reduces variability at low intensities

Log-ratios

Background correction

20

20

Why do offsets stabilize thevariance?

Page 6: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

21

21

Why do offsets stabilize thevariance?

o log2( 800/100 ) = log2 ( 8/1 ) = 3 (8-fold)

o Additive noise affects small numbers more

o Offsets introduce bias:

o log2[(80+10)/(10+10)] = 2.17

o But the tradeoff (drop in variance for increasein bias) is usually worth it

22

22

A self-self experiment:two background methods

553 spots not plotted

First statistical point – choose background correction to stabilise the variance as a

function of intensity

23

23

Comparison of 2-colour BG correctionmethods

Fals

e dis

cove

ries

(lim

ma)

Genes selectedRitchie et al. 2007

24

24

References

o Silver et al. (2009). Microarray backgroundcorrection: maximum likelihood estimation for thenormal-exponential convolution. Biostatistics. [completemathematical development of the saddle point approximation]

o Ritchie et al. (2007). A comparison of backgroundcorrection methods for two-colour microarrays.Bioinformatics. [shows “normexp” performs best for 2-colour data]

o Irizarry et al. (2003). Exploration, normalization andsummaries of high density oligonucleotide arrayprobe level data. Biostatistics. [Describes RMA BG correction,but doesn't give much detail of the normexp convolution model.]

Page 7: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

25

25

Normalization

26

26

Normalization

Two-colour

BG correction andnormalization areclosely connected

Even after BGcorrection, someeffects remain.

27

27

Normalization

One-colour

Similarly for singlechannel data,adjustments needto be made for allsamples to becomparable.

28

28

Moderated t-tests

Page 8: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

29

29

Borrowing information acrossgenes

o Small data sets: few arrays, inference forindividual genes is uncertain

o Curse of dimensionality: many tests,need to adjust for multiple testing, loss ofpower

o Benefit of parallelism: same model isfitted for every gene. Can borrow informationfrom one gene to another

30

30

Hard and soft shrinkage

o Hard: simplest way to borrow information isto assume that one or more parameters areconstant across genes

o Soft: smooth genewise parameters towards acommon value in a graduated way, e.g., Bayes,empirical Bayes, Stein shrinkage …

31

31

A very common experiment

Wild-type mouse x 2 Mutant mouse x 2

Which genes are differentially expressed?

n1 = n2 = 2 Affymetrix arrays

25,000 probe-sets

Gene X

32

32

Ordinary t-tests

give very high false discovery rates

Residual df = 2

Page 9: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

33

33

Another very commonexperiment

Which genes are differentially expressed?

n = 2 two-colour arrays

30,000 probes

Wild-type mouse 1 Mutant mouse 1

Wild-type mouse 2 Mutant mouse 2

34

34

Ordinary t-tests

give very high false discovery rates

Residual df = 1

35

35

Small sample size, many tests

o These experiments would be under-poweredeven with just one gene

o Yet we want to test differential expression foreach of 50k genes, hence lots of multipletesting and further loss of power

The problem:

The solution:

The same statistical model is being fitted for everygene in parallel. Can borrow strength from othergenes.

36

36

t-tests with common variance

across genes

with residual standard deviation pooled

More stable, but ignores gene-specific variability

Page 10: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

37

37

A better compromise

Moderated t-statistics

Shrink standard deviations towards common value

= degrees offreedom

38

38

Gs%

0s

1s%

1s

2s%

2s

L

Gs

,pooledgt

gt%

gt

d

0d

Shrinkage of standard deviations

L

The data decides whethergt% should be closer to

,pooledgt or to

gt

39

39

Why does it work?

o We learn what is the typical variability level bylooking at all genes, but allow some flexibilityfrom this for individual genes

o Adaptive

40

40

Hierarchical model for variances

Data

Prior

Posterior

Page 11: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

41

41

Posterior Statistics

Moderated t-statistics

Posterior variance estimators

Baldi & Long 2001, Wright & Simon 2003, Smyth 2004

42

42

Exact distribution for moderatedt

0 ggddtt +%:

An unexpected piece of mathematics shows that, underthe null hypothesis,

The degrees of freedom add!

The Bayes prior in effect adds d0 extra arrays forestimating the variance.

Wright and Simon 2003, Smyth 2004

43

43

More on empirical Bayes statistics

44

44

Hierarchical model for means

Data

Prior

Lönnstedt and Speed 2002, Smyth 2004

Page 12: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

45

45

Posterior Odds

Posterior odds of differential expression

Lönnstedt and Speed 2002, Smyth 2004

Monotonic function of t%

Hence t% gives the best possible ranking of genes

46

46

Estimating Hyper-Parameters

Closed form estimators with good propertiesare available:

for c0 in terms of quantiles of the | |gt%

for s0 and d0 in terms of the first twomoments of log s2

47

47

Marginal Distributions

0

00

with prob 1-

1 / with prob

d d

g

d d

t pt

c c t p

+

+

!"#

+"$

%

Under the hierarchical model, sg is independent ofthe moderated t-statistics instead

Under usual likelihood model, sg is independent of theestimated coefficients.

48

48

Moment estimators for s0 and d0

Marginal moments of log s2 lead to estimators ofs0 and d0:

Estimate d0 by solving

where

Finally

Page 13: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

49

49

Quantile Estimation of c0

Let r be rank of | |gt% in descending order, and let F(;)

be the distribution function of the t-distribution. Canestimate c0 by equating empirical to theoretical quantiles:

Get overall estimator of c0 by averaging the individualestimators from the top p/2 proportion of the | |

gt%

50

50

Short note on multiple testing

51

51

Multiple testing and adjusted p-values

o Traditional method in statistics is to control familywise error rate, e.g., by Bonferroni.

o Holm’s method is improved (step-down)modification of Bonferroni.

o Controlling the false discovery rate (FDR) is moreappropriate in microarray studies

o Benjamini and Hochberg method controls expectedFDR for independent or weakly dependent teststatistics. Simulation studies support use formicroarray data.

o All methods can be implemented in terms ofadjusted p-values.

52

52

End of morning theory - Summary

o Background correction, normalization areimportant considerations -- normexp, offsets

o Moderation will generally always help --moderation of variances is very effective

o Convenient model gives known nulldistribution

o Multiple testing

Page 14: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

53

53

Linear models

54

54

More complex experiments

o More complex microarray experiments can berepresented by linear models

o For one-channel platforms, the linear modelcan be set up using the usual univariate linearmodel formulae

o For two-colour platforms, the linear modelshave some special properties

55

55

Linear Models

o In general, need to specify:

- Dependent variable

- Explanatory variables (experimental design,covariates, etc.)

o More generally:

vector ofobserveddata

designmatrix

Vector ofparameters toestimate

56

56

Linear Models for microarrays

o Analyse all arrays together combininginformation in optimal way

o Combined estimation of precision

o Extensible to arbitrarily complicatedexperiments

o Design matrix: specifies RNA targets usedon arrays

o Contrast matrix: specifies whichcomparisons are of interest

Page 15: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

57

57

Design # Linear models

Wild-type mouse x 2 Mutant mouse x 2

$1 = wt log-expression

$2 = mutant % wt

Design matrices for 1-colour arrays are easier to specify!

E[y1]=E[y2]=$1 E[y3]=E[y4]= $1+ $2

58

58

Designs # Linear Models

A B1

2

1

1

y

y!

" # " #=$ % $ %&' (' (

Ref

A

B

B A! " #

1

1

2

2

3

1 0

1 0

1 1

y

y

y

!

!

" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %

' ( ' (

A B2log ( / )y R G B A= ! "

1

2

RefA

B A

!

!

" #

" #

A B

C

1

1

2

2

3

1 0

1 1

0 1

y

y

y

!

!

" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %&' ( ' (

1

2

B A

C A

!

!

" #

" #

59

59

Matrix Multiplication

A B1

2

1

1

y

y

!!

!

" # " # " #= =$ % $ % $ %& &' ( ' (' (

Ref

A

B

1 1

1

2 1

2

3 1 2

1 0

1 0

1 1

y

y

y

!!

!!

! !

" # " # " #" #$ % $ % $ %= & = &$ %$ % $ % $ %' ($ % $ % $ %+' ( ' ( ' (

A B

C

1 1

1

2 1 2

2

3 2

1 0

1 1

0 1

y

y

y

!!

! !!

!

" # " # " #" #$ % $ % $ %= & = & +$ %$ % $ % $ %' ($ % $ % $ %& &' ( ' ( ' (

1

23

1

2

RefA

B A

!

!

" #

" #

1

2

B A

C A

!

!

" #

" #

B A! " #

Contrast:

60

60

Linear Model Estimates

Obtain a linear model for each gene g

Estimate models to get

coefficients

standard deviations

standard errors

Page 16: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

61

61

Contrasts

A contrast is any linear combination of the coefficients"j which we want to test equal to zero.

Define contrasts

Want to test

vs

were C is the contrast matrix.

62

62

Pax5: example of saturateddesign

Pax5-/-Wt

IL-7 removed Rag1-/-

3

2

7

5

11

1

12

4

6

8

9

10

Robust design – can tolerate failure of some of the arrays

63

63

Regression Analysis

Choose 3 comparisons between the 4 RNAsources to be the coefficients of the linearmodel, e.g.,- PW: Pax5-/- vs Wt

- RW: Rag1-/- vs Wt

- IW: IL-7 withdrawn vs Wt

For each gene, fit a linear model with acoefficient for each contrast

Any other comparisons of interest can beextracted from the linear model as contrasts

64

64

!!!

"

#

$$$

%

&

!!!!!!!!!!!!!!!!!

"

#

$$$$$$$$$$$$$$$$$

%

&

'

'

'

'

'

'

'

'

'

=

!!!!!!!!!!!!!!!!!

"

#

$$$$$$$$$$$$$$$$$

%

&

IW

RW

PW

m

m

m

m

m

m

m

m

m

m

m

m

E

110

110

010

010

011

011

101

101

100

100

001

001

12

11

10

9

8

7

6

5

4

3

2

1 Exercise: Fill in the designmatrix.

Can be fitted using robust regression, but problems with se’s, as Patty Solomon has

observed

Page 17: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

65

65

!!!!!!

"

#

$$$$$$

%

&

'''''''''

(

)

*********

+

,

=

!!!!!!!!!

"

#

$$$$$$$$$

%

&

ba

ba

b

a

a

y

y

y

y

y

y

y

2

1

2

1

7

6

5

4

3

2

1

11100

10010

00010

01100

01001

00001

00100

WT.P11 µ + a1

MT.P21µ + (a1+a2) + b + (a1+a2)b

MT.P11µ +a1+b+a1.b

WT.P21µ + a1 + a2

WT.P1

µ

MT.P1 µ + b

1

2

3

4

5

6

7

Example of factorial design

66

66

Moderated F-statistics

67

67

Moderated F-Statistic

Moderated F-statistic

MST=Mean Sum of squares between Treatments

The idea of shrinking the variance extendsimmediately to multiple contrasts

Wright & Simon 2003, Smyth 2004

68

68

Doubly shrunk F statistic

The moderated F is not a monotonically function ofthe posterior odds

A doubly shrunk F statistic can be shown have tothe desired relationship to the posterior odds

Tai and Speed 2006, 2007

Improves further the gene ranking

Page 18: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

69

69

Single or double shrinkage

o Shrinking the variances only is enough whencomparing two groups

o When comparing 3 or more groups, furthergain can be had by shrinking the $ also (recall

Stein estimator needs at least 3 means)

70

70

Functional category analysis

71

71

Functional category analysis

o Used on a set of genes deemed to bedifferentially expressed

o Asks the question: is my set of genes enrichedfor a particular molecular function?

o Useful for establishing what pathways / typesof genes are affected

o Nowadays largely superceded by gene settests

72

72

Overlap statistics

o Question: Say you have a set of 85 genes (ofa total 20000 genes) known to be associatedwith function X. Calculate the probability ofrandomly selecting 40 or more of those genesin a list of 100 DE genes.

o Answer: ?

Page 19: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

73

73

Overlap statistics

o Question: Say you have a set of 85 genes (of a total 20000genes) known to be associated with some function. Calculatethe probability of randomly selecting 40 or more of thosegenes in a list of 100 DE genes.

o Answer: Hypergeometric (i.e. the “urn”problem).

N=19915 “white”m=85 black

n=100k=40

74

74

Gene set tests

75

75

Gene sets

o Test significance of a (apriori specified) groupof genes

o The genes might belong to a known pathwayor might be the top genes from a relatedexperiment

o The set might be significant even if individualgenes are not

o Gene set enrichment analysis (GSEA)originated by Mootha et al PNAS 2003 andSubramanian et al PNAS 2005

76

76

Available gene set methods

o GSEA: gene set enrichment analysis. Complexmethod using Kolmogorov-Smirnov type tests andsample permutation. Needs two-groups, manyarrays, many genes and many sets.

o GSA: gene set analysis. Uses combination ofpermutation of samples and standardization acrossgenes. More powerful. Still needs two-groups, manygenes and many sets.

o GST: gene set tests using Wilcoxon test. Userandomization over genes. Applicable to linearmodels and small samples, but can be over-optimisticif the genes in the set are highly correlated.

o Now, rotation-based gene set tests.

Page 20: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

77

77

Gene set tests

All microarray probes,ranked by a test statisticof interest

t1t2t3t4:

A priori subsetof genes

X1, X2, X3 … Xn

Look for ranks for set genes amongst test statistics

78

78

Viewing gene sets

Cell adhesion genes

Genes regulated by MYB

79

79

What’s the hypothesis?

o Two major types of gene set tests:competitive or self-contained

o Competitive:Genes in the set tend to be more strongly DE thanrandomly chosen genes

o Self-contained:At least some genes in the set are truly DE

Goeman & Bühlmann, Bioinformatics 2007

80

80

Permutation

o Competitive gene set tests are usually testedby permuting genes, but this ignores inter-gene correlations

o Self-contained gene set tests are usually testedby permuting arrays, but this is limited to two-group comparisons with large numbers

Page 21: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

81

81

Rotation gene set tests (ROAST)

o Self-contained hypothesis

o The first test suitable for small samples whichcorrectly accounts for inter-gene correlation

o Can handle complex linear model designs,including array weights, random effects etc

82

82

Two steps: projection androtation

o Project data onto space orthogonal tonuisance parameters in the linear model

o Random rotation of the orthogonal residualsprovides fractional permutation, avoidsgranularity of p-values

o Assumes multivariate normality, but proves tobe highly robust against deviations fromnormality

83

83

Set summary statistics

o Compute empirical Bayes t-statistics for each gene inset

o Convert to z-statistics

o Mean of z2:– good power, even when only a subset of genes respond

– not robust against non-normality

o Mean50: mean of top half of |z|:– good power

– robust

84

84

References

o Subramanian et al (2005). A knowledge-basedapproach for interpreting genome-wideexpression profiles. PNAS 102, 15545-15550.

o Efron and Tibshirani (2007). On testing thesignificance of sets of genes. Ann Appl Stat 1,107-129.http://www-stat.stanford.edu/~tibs/GSA/

Page 22: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

85

85

Multi-level models I:duplicate spots

86

86

Hard shrinking: examples

o Common correlation model for within-arrayduplicates spotsSmyth et al (2005). The use of within-array replicate spots for assessingdifferential expression in microarray experiments. Bioinformatics 21, 2067-2075.

o Common correlation models for singlechannel analysis of two-color microarray dataSmyth, G. K. (2005). Individual channel analysis of two-colour microarraydata. 55th Session of the International Statistics Institute, 5-12 April 2005,Sydney, Australia.

87

87

Common correlation model

Given a blocking factor with variance component !b2,

focus on within-block correlations

Common correlation model assumes

Has proved effective for technical blocking factorsfor which correlations are high

88

88

Duplicate Spots

o If the clone library is not too large, it is oftenpossible to print each gene more than onceon an array

o Duplicates are always side-by-side or a fixeddistance apart

Page 23: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

89

89

Genes printed induplicate pairs

One pair

90

90

Duplicate spots are correlated

o Duplicate spots are a form of technicalreplication, share lots of common causes

o Cannot be treated as replicates on separatearrays, log-ratios from duplicate spots arecorrelated

o How best to use duplicate spots? Usualapproach is simply to average them

91

91

Common Correlation Model

o Assume the between-duplicate correlation isthe same for every gene

o Justified by the belief that the correlationsprings mainly from spatial proximity

o Improves estimation of variances

92

92

Consequences for individual genes

o If the number of genes is large, then theestimator of " is very accurate, so " may betreated as known as far as inference for eachindividual gene is concerned

o This doesn’t change estimation of !g butgreatly changes estimation of #g

Page 24: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

93

93

Validation with Spike-In Data

o Does the idea of using common correlationswork in practice?

o Check the ability of the common-correlationt-statistic to distinguish calibration from ratiospike-in spots

o Scorecard system includes calibrationcontrols, 3-fold up and down ratio controls,and 10-fold up and down ratio controls

94

94

95

95

96

96

Multilevel models II:Separate channel analysis

of two-colour data

Page 25: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

97

97

Separate channel analysis of two-colour microarray data

G1 R1 G2 R2 G3 R3 G4 R4

Each spot is a block of size 2

The two channels give correlated pair of values

Common correlation model sets intra-spotcorrelation equal across genes

98

98

Why Use Means and Differences?

cor( , ) 0M A =

An old idea dating back to Tukey, Altman & Bland

Ifvar( ) var( )R G=

then

regardless of the correlation between R andG.

99

99

Common Reference Experiment

Ref B

Ref C

Ref B

Ref C

RefM

Bµ = !}

}Ref

MCµ = !

( Ref)/2A

Bµ = +

( Ref)/2A

Cµ = +

Why not use the A-values as well as M-values?

100

100

A simple normal model

g! is the intra-spot correlation

Gene g, array i

Page 26: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

101

101

Models in terms of M and A

102

102

M and A parameters

Mgi Rgi Ggiµ µ µ= !

( ) / 2Agi Rgi Ggi

µ µ µ= +

2 22 (1 )Mgi g g

! ! "= #

2 2 (1 ) / 2Agi g g

! ! "= +

103

103

Correlation

2

1

2

41tanh log

2

Ag

g

Mg

!"

!

#=

M and A are independent but have differentvariances.

Have converted a correlated problem into aheteroscedastic problem

Can estimate correlation by estimatingvariances:

104

104

Common correlation model

o Assume the intra-spot correlation is constantacross genes

o Justified by (i) variance components areobserved to be positively correlated and (ii)standard errors for coefficients are notsensitive to correlation value

o Common correlation can then be assumed toknown at individual gene level

o Converts a mixed model into a weightedregression

lmscFit() in limma

Page 27: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

105

105

ApoAI Experiment

8 wild type arrays and 8 ApoAI-/- arrays, all relativeto a common reference

Median intraspot correlation is estimated as

ˆ 0.85! =

The efficiency gain from using A-values is

ˆ10.08

ˆ1

!

!

"=

+

106

106

Individual Channel Normalization

Using A-values in the analysis requires that theybe normalized to have comparable valuesbetween arrays: “single channel normalization”

For the ApoAI data:

Within-array loess normalization only

A-quantile normalization

Quantile normalization

ˆ 0.85! =

ˆ 0.89! =

ˆ 0.84! =

107

107

Ignoring the Reference

Why not ignore the common reference channel? If weuse only the red channel in the common referenceexperiment,

2

1 2

1 1var( )

C BR R

n n!

" #$ = +% &

' (

var( ) 1

var( ) 2(1 )

C B

C B

R R

M M !

"=

" "

so

so adjusting for a common reference is worthwhile

whenever " > 0.5

108

108

Disconnected Design

B C

D E

A-values make no contribution to estimating BvsCcomparison (direct)

M-values make no contribution to estimating BvsDcomparison (indirect)

Relative efficiency of indirect comparison comparedto direct is 1 !"

Page 28: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

109

109

Observed false discovery rates

0 1000 2000 3000 4000

01

00

20

03

00

40

0

Tests Rejected

Know

n F

als

e P

ositiv

es

Mixed modelIgnore CorSpotCor

Data from Holloway et al (2006) BMC Bioinformatics 7, Article 511.

110

110

Why common correlation iseffective

o Bias introduced into the variance estimationseems to be offset by increase in precision

o Great simplification of mathematical model

o Penalizes genes with large within-blockvariances

111

111

A Shrinkage Hierarchy

o Fold changes – shrinkage may not be required(unless more than 2 groups)

o Genewise variances – soft smoothing givesspectacular improvement

o Technical replicate correlations – hardsmoothing has proved successful

As we move from parameters of interest to higherorder nuisance parameters, bias decreases inimportance relative to noise

112

112

Other bits and pieces

Page 29: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

113

113

Some other common variations

o Technical replicates

o Paired samples

o Array weights

114

114

Summary

o Borrowing strength is essential in small-scalemicroarray experiments

o Information can be shared across genes oracross arrays

o Parameters may be set common betweengenes (correlations) or shrunk in a graduatedway (standard errors)

o Power can be increased by testing hypothesesfor sets of genes

115

115

Acknowledgements

WEHI Bioinformatics

o Gordon Smyth

o Matt Ritchie

o Alicia Oshlack

o Terry Speed

Garvan Epigenetics

o Susan Clark

o Aaron Statham

o Marcel Coolen

116

116

Computer Laboratories

Page 30: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

117

117

Proposed schedule

a.m.

o ModerationDemo.R

o LimmaObjects.R

o BGNormalization.R

o SimpleExperiments.R

p.m.

o FactorialDesign.R

o GeneSetAnalysis.R

o TimeCourse.R

118

118

Getting started

o Grab the files from the ‘KU-August2009-LIMMA’directory (or archive) and copy/move to aconvenient location on your computer

o Set the variable ‘rootDir’ to that directory. Forexample:

rootDir <- “~/Desktop/KU-August2009-LIMMA/data”

o Make rootDir the working directory of your Rsession

119

119

limma package documentation

o Function help pages?lmFit, ?eBayes

o Class help pages?"RGList-class"?"MArrayLM-class"

o Group help pageshelp("06.LinearModels")

o User’s Guide

limmaUsersGuide()

o The R html help system is a good top view

120

120

Moderation Demo

o Illustration of sampling from the model

o Reduction in false discoveries

o Empirical Bayes differential expression

Page 31: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

121

121

Limma Objects

o RGList

o MAList

o MArrayLM

122

122

BG correction / normalization demo

o Various procedures for BG correction,normalization

o Jurkat data:

same vs. same

123

123

Simple Experiment 1:Integrin beta7+ vs beta7–

beta7- beta7+

o Reading two-color data

o Control spots

o Background correction

o Dye-swaps

o Empirical Bayes differential expression

124

124

Simple Experiment 2: Cancer cellsversus normal cells

Normalcell line

Cancercell line

o Reading Affymetrix data

o Simple design versus contrasts

Page 32: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

125

125

Factorial Experiment: Estrogen

10 hours 48 hours

Absent

Estrogen

Present

o Reading Affymetrix data

o Factorial designs, contrasts

126

126

Gene set analysis

o Convenient “functional category analysis”

o Newer, flexible set-based testing

127

127

Targets of SAHA and depsipeptideA time course experiment

o Study effects of SAHA and depsipeptide onthe acute T-cell leukemia cell line CEM

o SAHA and depsipeptide are structurallydifferent but have similar biological effects(induce death through intrinsic apoptoticpathway)

o Prising out subtle differences is of greatinterest

128

128

SAHA/depsipeptide:Experimental design

SAHA Vehicle only depsipeptide

0hr 0hr 0hr

1hr 1hr 1hr

2hr 2hr 2hr

4hr 4hr 4hr

8hr 8hr 8hr

16hr 16hr 16hr

Time courses of 6 arrays were done at each time.

Page 33: Limma = linear models for microarray data Linear models and · PDF file · 2009-08-11Linear models and Limma Københavns Universitet, ... !Simple linear models oMorning Practical!Demonstration

129

129

Aims of experiment

o Identify common responders: genes whichrespond similarly to SAHA and depsipeptide

o Identify specific responders: genes whichrespond to one of SAHA or depsipeptide, butnot to the other

o Different responders, genes which respond toboth SAHA and depsipeptide but differently,are of lesser interest

130

130

Linear model analysis

o Fit genewise linear models to all the arrayssimultaneously

o Include effects for drug x time

o Allow for probe-specific dye-effects

o Treat each time series of 6 arrays as arandomized block, i.e., allow arrays hybridizedtogether to be correlated

2nd statistical point – analyse all arrays together