linear models and limma copenhagen, 19 august...

22
Linear models and limma Copenhagen, 19 August 2009 Mark D. Robinson, WEHI/Garvan 1 1 Linear models and Limma Københavns Universitet, 19 August 2009 Mark D. Robinson Bioinformatics, Walter+Eliza Hall Institute Epigenetics Laboratory, Garvan Institute (with many slides taken from Gordon Smyth) 2 Limma = linear models for microarray data o Morning Theory Introduction Background correction Moderated t-tests Simple linear models o Morning Practical Demonstration of smoothing Limma objects (beta7) Background correction and normalization (beta7) Simple experimental designs 2-colour example (beta7) Affymetrix example (cancer) o Afternoon Theory More advanced designs / linear modeling Moderated F-tests Gene set tests Other analyses limma can do o Afternoon practical Factorial design (estrogen) Gene set testing (cancer) Time course experiment (SAHA/depsipeptide) 3 Expression measures Two-colour Affymetrix Illumina log-intensity (summarized over probes) log-intensity (summarized over beads) probe or gene array y ga = log 2 (R/G) y ga = y ga = 4 Questions of Interest o What genes have changed in expression? (e.g. between disease/normal, affected by treatment) Gene discovery, differential expression o Is a specified group of genes all up-regulated in a particular condition? Gene set differential expression o Can the expression profile predict outcome? Class prediction, classification o Are there tumour sub-types not previously identified? Do my genes group into previously undiscovered pathways? Class discovery, clustering Today will cover first two questions - differential expression 5 Microarray Differential Expression Studies o 10 3 – 10 6 genes/probes/exons on a chip or glass slide o Inputs to limma: log-intensities (1-colour data) or log(R/G) log-ratios (for 2-colour data) o Several steps to go from raw data to table of “expression”: background correction, normalization o Idea: Fit a linear model to the expression data for each gene 6 Two colour microarrays http://en.wikipedia.org/wiki/DNA_microarray

Upload: others

Post on 19-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 1

1

Linear models and Limma

Københavns Universitet,19 August 2009

Mark D. RobinsonBioinformatics, Walter+Eliza Hall InstituteEpigenetics Laboratory, Garvan Institute

(with many slides taken from Gordon Smyth)

2

Limma = linear models formicroarray data

o Morning Theory Introduction Background correction Moderated t-tests Simple linear models

o Morning Practical Demonstration of smoothing Limma objects (beta7) Background correction and

normalization (beta7) Simple experimental designs

2-colour example (beta7) Affymetrix example (cancer)

o Afternoon Theory More advanced designs /

linear modeling Moderated F-tests Gene set tests Other analyses limma can do

o Afternoon practical Factorial design (estrogen) Gene set testing (cancer) Time course experiment

(SAHA/depsipeptide)

3

Expression measures

Two-colour

Affymetrix

Illumina

log-intensity(summarizedover probes)

log-intensity(summarizedover beads)

probe or gene

array

yga = log2(R/G)

yga =

yga =

4

Questions of Interesto What genes have changed in expression? (e.g.

between disease/normal, affected by treatment)Gene discovery, differential expression

o Is a specified group of genes all up-regulated in aparticular condition?Gene set differential expression

o Can the expression profile predict outcome?Class prediction, classification

o Are there tumour sub-types not previouslyidentified? Do my genes group into previouslyundiscovered pathways?Class discovery, clustering

Today will cover first two questions - differential expression

5

MicroarrayDifferential Expression Studies

o 103 – 106 genes/probes/exons on a chip orglass slide

o Inputs to limma: log-intensities (1-colour data)or log(R/G) log-ratios (for 2-colour data)

o Several steps to go from raw data to table of“expression”: background correction,normalization

o Idea: Fit a linear model to the expression datafor each gene

6

Two colourmicroarrays

http://en.wikipedia.org/wiki/DNA_microarray

Page 2: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 2

7

Two-colour data: Log-Intensities

*

2log ( )f bR R R= !

*

2log ( )f bG G G= !

For each probe:

Various ways to calculate background.Will often modify to ensure: Rf – Rb >0 and Gf – Gb > 0. 8

Two-colour data: Means andDifferences

M R G= !

( ) / 2A R G= +

For each probe:

“minus”

“add”

9

Data Summaries

G1 R1 G2 R2 G3 R3 G4 R4

A1M1 A2M2 A3M3 A4M4

For each gene

10

MA Plot

11

Log-Ratios orSingle Channel Intensities?

o Tradition analysis, treats log-ratios M=log(R/G) asthe primary data, i.e., gene expression measurementsare relative

o Alternative approach treats individual channelintensities R and G as primary data, i.e., geneexpression measures are absolute (Wolfinger,Churchill, Kerr)

o Single channel approach makes new analyses possiblebut- make stronger assumptions- requires more complex models (mixed models in place of

ordinary linear models) to accommodate correlationbetween R and G on same spot

12

BG correction affects DE results

o Importance of careful pre-processing and qualitycontrol cannot be over-emphasized for microarraydata

o Can have dramatic effect on differential expressionresults

o Consider here the normexp method of adaptivebackground correction- background correction step of the RMA algorithm- Can also be applied to two colour data

Page 3: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 3

13

Additive + multiplicative errormodel

Observe intensity for one probe on one array

additiveerrors

multiplicativeerrors

Intensity = background + signal

This idea underlies variance stabilizing transformationsvsn (two colour data) and vst (for Illumina data)

I = B + S

14

normexp convolution model

Intensity = Background + Signal

N(μ,σ2) Exponential(α)

+=

15

Conditional expectation undernormexp model

Then

with

16

normexp background correction

o Estimate the three parameterso Replace I with E(S|I)o For Affymetrix data, I is the “Perfect Match” data

intensityo For two-colour data, I=Rf-Rb or I=Gf-Gb

o In the RMA algorithm, parameter estimation uses anad hoc density kernel method

o In limma (two colour), parameter estimationmaximises the saddlepoint approximation to thelikelihood

17

PM data on log2 scale: raw and fitted model

18

Background corrected intensity

= E(Signal | Observed Intensity)

Observed Intensity

E( S

igna

l | In

tens

ity)

Adaptivebackgroundcorrectionproducespositive signal

Page 4: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 4

19

Offsets to stabilise the variance

Offset reduces variability at low intensities

Log-ratios

Background correction

20

Why do offsets stabilize thevariance?

21

Why do offsets stabilize thevariance?

o log2( 800/100 ) = log2 ( 8/1 ) = 3 (8-fold)o Additive noise affects small numbers moreo Offsets introduce bias:

o log2[(80+10)/(10+10)] = 2.17

o But the tradeoff (drop in variance for increasein bias) is usually worth it

22

A self-self experiment:two background methods

553 spots not plotted

23

Comparison of 2-colour BG correctionmethods

Fals

e di

scov

erie

s (li

mm

a)

Genes selectedRitchie et al. 2007

24

References

o Silver et al. (2009). Microarray backgroundcorrection: maximum likelihood estimation for thenormal-exponential convolution. Biostatistics. [completemathematical development of the saddle point approximation]

o Ritchie et al. (2007). A comparison of backgroundcorrection methods for two-colour microarrays.Bioinformatics. [shows “normexp” performs best for 2-colour data]

o Irizarry et al. (2003). Exploration, normalization andsummaries of high density oligonucleotide arrayprobe level data. Biostatistics. [Describes RMA BG correction,but doesn't give much detail of the normexp convolution model.]

Page 5: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 5

25

Normalization

26

NormalizationTwo-colour

BG correction andnormalization areclosely connected

Even after BGcorrection, someeffects remain.

27

Normalization One-colour

Similarly for singlechannel data,adjustments needto be made for allsamples to becomparable.

28

Moderated t-tests

29

Borrowing information acrossgenes

o Small data sets: few arrays, inference forindividual genes is uncertain

o Curse of dimensionality: many tests,need to adjust for multiple testing, loss ofpower

o Benefit of parallelism: same model isfitted for every gene. Can borrow informationfrom one gene to another

30

Hard and soft shrinkage

o Hard: simplest way to borrow information isto assume that one or more parameters areconstant across genes

o Soft: smooth genewise parameters towards acommon value in a graduated way, e.g., Bayes,empirical Bayes, Stein shrinkage …

Page 6: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 6

31

A very common experiment

Wild-type mouse x 2 Mutant mouse x 2

Which genes are differentially expressed?

n1 = n2 = 2 Affymetrix arrays

25,000 probe-sets

Gene X

32

Ordinary t-tests

give very high false discovery rates

Residual df = 2

33

Another very commonexperiment

Which genes are differentially expressed?

n = 2 two-colour arrays

30,000 probes

Wild-type mouse 1 Mutant mouse 1

Wild-type mouse 2 Mutant mouse 2

34

Ordinary t-tests

give very high false discovery rates

Residual df = 1

35

Small sample size, many tests

o These experiments would be under-poweredeven with just one gene

o Yet we want to test differential expression foreach of 50k genes, hence lots of multipletesting and further loss of power

The problem:

The solution:The same statistical model is being fitted for everygene in parallel. Can borrow strength from othergenes.

36

t-tests with common variance

across geneswith residual standard deviation pooled

More stable, but ignores gene-specific variability

Page 7: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 7

37

A better compromise

Moderated t-statistics

Shrink standard deviations towards common value

= degrees offreedom

38

Gs%

0s

1s%

1s

2s%

2s

L

Gs

,pooledgt

gt%

gt

d

0d

Shrinkage of standard deviations

L

The data decides whethergt% should be closer to

,pooledgt or to

gt

39

Why does it work?

o We learn what is the typical variability level bylooking at all genes, but allow some flexibilityfrom this for individual genes

o Adaptive

40

Hierarchical model for variances

Data

Prior

Posterior

41

Posterior Statistics

Moderated t-statistics

Posterior variance estimators

Baldi & Long 2001, Wright & Simon 2003, Smyth 2004 42

Exact distribution for moderatedt

0 ggddtt +%:

An unexpected piece of mathematics shows that, underthe null hypothesis,

The degrees of freedom add!

The Bayes prior in effect adds d0 extra arrays forestimating the variance.

Wright and Simon 2003, Smyth 2004

Page 8: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 8

43

More on empirical Bayes statistics

44

Hierarchical model for means

Data

Prior

Lönnstedt and Speed 2002, Smyth 2004

45

Posterior Odds

Posterior odds of differential expression

Lönnstedt and Speed 2002, Smyth 2004

Monotonic function of t%

Hence t% gives the best possible ranking of genes

46

Estimating Hyper-Parameters

Closed form estimators with good propertiesare available:

for c0 in terms of quantiles of the | |gt%

for s0 and d0 in terms of the first twomoments of log s2

47

Marginal Distributions

0

00

with prob 1-

1 / with prob

d d

g

d d

t pt

c c t p

+

+

!"#

+"$

%

Under the hierarchical model, sg is independent ofthe moderated t-statistics instead

Under usual likelihood model, sg is independent of theestimated coefficients.

48

Moment estimators for s0 and d0

Marginal moments of log s2 lead to estimators ofs0 and d0:

Estimate d0 by solving

where

Finally

Page 9: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 9

49

Quantile Estimation of c0

Let r be rank of | |gt% in descending order, and let F(;)

be the distribution function of the t-distribution. Canestimate c0 by equating empirical to theoretical quantiles:

Get overall estimator of c0 by averaging the individualestimators from the top p/2 proportion of the | |

gt%

50

Short note on multiple testing

51

Multiple testing and adjusted p-values

o Traditional method in statistics is to control familywise error rate, e.g., by Bonferroni.

o Holm’s method is improved (step-down)modification of Bonferroni.

o Controlling the false discovery rate (FDR) is moreappropriate in microarray studies

o Benjamini and Hochberg method controls expectedFDR for independent or weakly dependent teststatistics. Simulation studies support use formicroarray data.

o All methods can be implemented in terms ofadjusted p-values.

52

End of morning theory - Summary

o Background correction, normalization areimportant considerations -- normexp, offsets

o Moderation will generally always help --moderation of variances is very effective

o Convenient model gives known nulldistribution

o Multiple testing

53

Linear models

54

More complex experiments

o More complex microarray experiments can berepresented by linear models

o For one-channel platforms, the linear modelcan be set up using the usual univariate linearmodel formulae

o For two-colour platforms, the linear modelshave some special properties

Page 10: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 10

55

Linear Models

o In general, need to specify:- Dependent variable- Explanatory variables (experimental design,

covariates, etc.)

o More generally:

vector ofobserveddata

designmatrix

Vector ofparameters toestimate

56

Linear Models for microarrays

o Analyse all arrays together combininginformation in optimal way

o Combined estimation of precisiono Extensible to arbitrarily complicated

experimentso Design matrix: specifies RNA targets used

on arrayso Contrast matrix: specifies which

comparisons are of interest

57

Design → Linear models

Wild-type mouse x 2 Mutant mouse x 2

β1 = wt log-expression

β2 = mutant − wt

Design matrices for 1-colour arrays are easier to specify!

E[y1]=E[y2]=β1 E[y3]=E[y4]= β1+ β2

58

Designs → Linear Models

A B1

2

1

1

y

y!

" # " #=$ % $ %&' (' (

RefA

B

B A! " #

1

1

2

2

3

1 0

1 0

1 1

y

y

y

!

!

" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %

' ( ' (

A B2log ( / )y R G B A= ! "

1

2

RefA

B A

!

!

" #

" #

A B

C

1

1

2

2

3

1 0

1 1

0 1

y

y

y

!

!

" # " #" #$ % $ %= & $ %$ % $ %' ($ % $ %&' ( ' (

1

2

B A

C A

!

!

" #

" #

59

Matrix Multiplication

A B1

2

1

1

y

y

!!

!

" # " # " #= =$ % $ % $ %& &' ( ' (' (

RefA

B

1 1

1

2 1

2

3 1 2

1 0

1 0

1 1

y

y

y

!!

!!

! !

" # " # " #" #$ % $ % $ %= & = &$ %$ % $ % $ %' ($ % $ % $ %+' ( ' ( ' (

A B

C

1 1

1

2 1 2

2

3 2

1 0

1 1

0 1

y

y

y

!!

! !!

!

" # " # " #" #$ % $ % $ %= & = & +$ %$ % $ % $ %' ($ % $ % $ %& &' ( ' ( ' (

1

23

1

2

RefA

B A

!

!

" #

" #

1

2

B A

C A

!

!

" #

" #

B A! " #

Contrast:

60

Linear Model Estimates

Obtain a linear model for each gene g

Estimate models to get

coefficients

standard deviations

standard errors

Page 11: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 11

61

Contrasts

A contrast is any linear combination of the coefficientsαj which we want to test equal to zero.

Define contrasts

Want to test

vs

were C is the contrast matrix.

62

Pax5: example of saturateddesign

Pax5-/-Wt

IL-7 removed Rag1-/-

3

2

7

5

11

1

12

46

8

910

63

Regression Analysis

Choose 3 comparisons between the 4 RNAsources to be the coefficients of the linearmodel, e.g.,- PW: Pax5-/- vs Wt- RW: Rag1-/- vs Wt- IW: IL-7 withdrawn vs Wt

For each gene, fit a linear model with acoefficient for each contrastAny other comparisons of interest can beextracted from the linear model as contrasts

64

!!!

"

#

$$$

%

&

!!!!!!!!!!!!!!!!!

"

#

$$$$$$$$$$$$$$$$$

%

&

'

'

'

'

'

'

'

'

'

=

!!!!!!!!!!!!!!!!!

"

#

$$$$$$$$$$$$$$$$$

%

&

IW

RW

PW

m

m

m

m

m

m

m

m

m

m

m

m

E

110

110

010

010

011

011

101

101

100

100

001

001

12

11

10

9

8

7

6

5

4

3

2

1 Exercise: Fill in the designmatrix.

65

!!!!!!

"

#

$$$$$$

%

&

'''''''''

(

)

*********

+

,

=

!!!!!!!!!

"

#

$$$$$$$$$

%

&

ba

ba

b

a

a

y

y

y

y

y

y

y

2

1

2

1

7

6

5

4

3

2

1

11100

10010

00010

01100

01001

00001

00100

WT.P11 µ + a1

MT.P21µ + (a1+a2) + b + (a1+a2)b

MT.P11µ +a1+b+a1.b

WT.P21µ + a1 + a2

WT.P1 µ

MT.P1 µ + b

1

2

3

4

5

6

7

Example of factorial design

66

Moderated F-statistics

Page 12: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 12

67

Moderated F-Statistic

Moderated F-statistic

MST=Mean Sum of squares between Treatments

The idea of shrinking the variance extendsimmediately to multiple contrasts

Wright & Simon 2003, Smyth 2004 68

Doubly shrunk F statistic

The moderated F is not a monotonically function ofthe posterior odds

A doubly shrunk F statistic can be shown have tothe desired relationship to the posterior odds

Tai and Speed 2006, 2007

Improves further the gene ranking

69

Single or double shrinkage

o Shrinking the variances only is enough whencomparing two groups

o When comparing 3 or more groups, furthergain can be had by shrinking the β also (recallStein estimator needs at least 3 means)

70

Functional category analysis

71

Functional category analysis

o Used on a set of genes deemed to bedifferentially expressed

o Asks the question: is my set of genes enrichedfor a particular molecular function?

o Useful for establishing what pathways / typesof genes are affected

o Nowadays largely superceded by gene settests

72

Overlap statistics

o Question: Say you have a set of 85 genes (ofa total 20000 genes) known to be associatedwith function X. Calculate the probability ofrandomly selecting 40 or more of those genesin a list of 100 DE genes.

o Answer: ?

Page 13: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 13

73

Overlap statistics

o Question: Say you have a set of 85 genes (of a total 20000genes) known to be associated with some function. Calculatethe probability of randomly selecting 40 or more of thosegenes in a list of 100 DE genes.

o Answer: Hypergeometric (i.e. the “urn”problem).

N=19915 “white”m=85 black

n=100k=40

74

Gene set tests

75

Gene sets

o Test significance of a (apriori specified) groupof genes

o The genes might belong to a known pathwayor might be the top genes from a relatedexperiment

o The set might be significant even if individualgenes are not

o Gene set enrichment analysis (GSEA)originated by Mootha et al PNAS 2003 andSubramanian et al PNAS 2005

76

Available gene set methods

o GSEA: gene set enrichment analysis. Complexmethod using Kolmogorov-Smirnov type tests andsample permutation. Needs two-groups, manyarrays, many genes and many sets.

o GSA: gene set analysis. Uses combination ofpermutation of samples and standardization acrossgenes. More powerful. Still needs two-groups, manygenes and many sets.

o GST: gene set tests using Wilcoxon test. Userandomization over genes. Applicable to linearmodels and small samples, but can be over-optimisticif the genes in the set are highly correlated.

o Now, rotation-based gene set tests.

77

Gene set tests

All microarray probes,ranked by a test statisticof interest

t1t2t3t4:

A priori subsetof genes

X1, X2, X3 … Xn

Look for ranks for set genes amongst test statistics

78

Viewing gene sets

Cell adhesion genes

Genes regulated by MYB

Page 14: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 14

79

What’s the hypothesis?

o Two major types of gene set tests:competitive or self-contained

o Competitive:Genes in the set tend to be more strongly DE thanrandomly chosen genes

o Self-contained:At least some genes in the set are truly DE

Goeman & Bühlmann, Bioinformatics 2007 80

Permutation

o Competitive gene set tests are usually testedby permuting genes, but this ignores inter-gene correlations

o Self-contained gene set tests are usually testedby permuting arrays, but this is limited to two-group comparisons with large numbers

81

Rotation gene set tests (ROAST)

o Self-contained hypothesiso The first test suitable for small samples which

correctly accounts for inter-gene correlationo Can handle complex linear model designs,

including array weights, random effects etc

82

Two steps: projection androtation

o Project data onto space orthogonal tonuisance parameters in the linear model

o Random rotation of the orthogonal residualsprovides fractional permutation, avoidsgranularity of p-values

o Assumes multivariate normality, but proves tobe highly robust against deviations fromnormality

83

Set summary statistics

o Compute empirical Bayes t-statistics for each gene inset

o Convert to z-statisticso Mean of z2:

– good power, even when only a subset of genes respond– not robust against non-normality

o Mean50: mean of top half of |z|:– good power– robust

84

References

o Subramanian et al (2005). A knowledge-basedapproach for interpreting genome-wideexpression profiles. PNAS 102, 15545-15550.

o Efron and Tibshirani (2007). On testing thesignificance of sets of genes. Ann Appl Stat 1,107-129.http://www-stat.stanford.edu/~tibs/GSA/

Page 15: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 15

85

Multi-level models I:duplicate spots

86

Hard shrinking: examples

o Common correlation model for within-arrayduplicates spotsSmyth et al (2005). The use of within-array replicate spots for assessingdifferential expression in microarray experiments. Bioinformatics 21, 2067-2075.

o Common correlation models for singlechannel analysis of two-color microarray dataSmyth, G. K. (2005). Individual channel analysis of two-colour microarraydata. 55th Session of the International Statistics Institute, 5-12 April 2005,Sydney, Australia.

87

Common correlation model

Given a blocking factor with variance component σb2,

focus on within-block correlations

Common correlation model assumes

Has proved effective for technical blocking factorsfor which correlations are high 88

Duplicate Spots

o If the clone library is not too large, it is oftenpossible to print each gene more than onceon an array

o Duplicates are always side-by-side or a fixeddistance apart

89

Genes printed induplicate pairs

One pair

90

Duplicate spots are correlated

o Duplicate spots are a form of technicalreplication, share lots of common causes

o Cannot be treated as replicates on separatearrays, log-ratios from duplicate spots arecorrelated

o How best to use duplicate spots? Usualapproach is simply to average them

Page 16: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 16

91

Common Correlation Model

o Assume the between-duplicate correlation isthe same for every gene

o Justified by the belief that the correlationsprings mainly from spatial proximity

o Improves estimation of variances

92

Consequences for individual genes

o If the number of genes is large, then theestimator of ρ is very accurate, so ρ may betreated as known as far as inference for eachindividual gene is concerned

o This doesn’t change estimation of μg butgreatly changes estimation of σg

93

Validation with Spike-In Data

o Does the idea of using common correlationswork in practice?

o Check the ability of the common-correlationt-statistic to distinguish calibration from ratiospike-in spots

o Scorecard system includes calibrationcontrols, 3-fold up and down ratio controls,and 10-fold up and down ratio controls

94

95 96

Multilevel models II:Separate channel analysis

of two-colour data

Page 17: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 17

97

Separate channel analysis of two-colour microarray data

G1 R1 G2 R2 G3 R3 G4 R4

Each spot is a block of size 2

The two channels give correlated pair of values

Common correlation model sets intra-spotcorrelation equal across genes

98

Why Use Means and Differences?

cor( , ) 0M A =

An old idea dating back to Tukey, Altman & Bland

Ifvar( ) var( )R G=

then

regardless of the correlation between R andG.

99

Common Reference Experiment

Ref B

Ref C

Ref B

Ref C

RefM

Bµ = !}

} RefM

Cµ = !

( Ref)/2A

Bµ = +

( Ref)/2A

Cµ = +

Why not use the A-values as well as M-values?100

A simple normal model

g! is the intra-spot correlation

Gene g, array i

101

Models in terms of M and A

102

M and A parameters

Mgi Rgi Ggiµ µ µ= !

( ) / 2Agi Rgi Ggi

µ µ µ= +

2 22 (1 )Mgi g g

! ! "= #

2 2 (1 ) / 2Agi g g

! ! "= +

Page 18: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 18

103

Correlation

2

1

2

41tanh log

2

Ag

g

Mg

!"

!

#=

M and A are independent but have differentvariances.

Have converted a correlated problem into aheteroscedastic problem

Can estimate correlation by estimatingvariances:

104

Common correlation model

o Assume the intra-spot correlation is constantacross genes

o Justified by (i) variance components areobserved to be positively correlated and (ii)standard errors for coefficients are notsensitive to correlation value

o Common correlation can then be assumed toknown at individual gene level

o Converts a mixed model into a weightedregression

lmscFit() in limma

105

ApoAI Experiment

8 wild type arrays and 8 ApoAI-/- arrays, all relativeto a common reference

Median intraspot correlation is estimated as

ˆ 0.85! =

The efficiency gain from using A-values is

ˆ10.08

ˆ1

!

!

"=

+106

Individual Channel Normalization

Using A-values in the analysis requires that theybe normalized to have comparable valuesbetween arrays: “single channel normalization”

For the ApoAI data:

Within-array loess normalization only

A-quantile normalization

Quantile normalization

ˆ 0.85! =

ˆ 0.89! =

ˆ 0.84! =

107

Ignoring the Reference

Why not ignore the common reference channel? If weuse only the red channel in the common referenceexperiment,

2

1 2

1 1var( )

C BR R

n n!

" #$ = +% &

' (

var( ) 1

var( ) 2(1 )

C B

C B

R R

M M !

"=

" "

so

so adjusting for a common reference is worthwhilewhenever ρ > 0.5 108

Disconnected Design

B C

D E

A-values make no contribution to estimating BvsCcomparison (direct)

M-values make no contribution to estimating BvsDcomparison (indirect)

Relative efficiency of indirect comparison comparedto direct is 1 !"

Page 19: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 19

109

Observed false discovery rates

0 1000 2000 3000 4000

01

00

20

03

00

40

0

Tests Rejected

Know

n F

als

e P

ositiv

es

Mixed modelIgnore CorSpotCor

Data from Holloway et al (2006) BMC Bioinformatics 7, Article 511.110

Why common correlation iseffective

o Bias introduced into the variance estimationseems to be offset by increase in precision

o Great simplification of mathematical modelo Penalizes genes with large within-block

variances

111

A Shrinkage Hierarchy

o Fold changes – shrinkage may not be required(unless more than 2 groups)

o Genewise variances – soft smoothing givesspectacular improvement

o Technical replicate correlations – hardsmoothing has proved successful

As we move from parameters of interest to higherorder nuisance parameters, bias decreases inimportance relative to noise

112

Other bits and pieces

113

Some other common variations

o Technical replicateso Paired sampleso Array weights

114

Summary

o Borrowing strength is essential in small-scalemicroarray experiments

o Information can be shared across genes oracross arrays

o Parameters may be set common betweengenes (correlations) or shrunk in a graduatedway (standard errors)

o Power can be increased by testing hypothesesfor sets of genes

Page 20: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 20

115

Acknowledgements

WEHI Bioinformatics

o Gordon Smytho Matt Ritchieo Alicia Oshlacko Terry Speed

Garvan Epigeneticso Susan Clarko Aaron Stathamo Marcel Coolen

116

Computer Laboratories

117

Proposed schedule

a.m.

o ModerationDemo.Ro LimmaObjects.Ro BGNormalization.Ro SimpleExperiments.R

p.m.

o FactorialDesign.Ro GeneSetAnalysis.Ro TimeCourse.R

118

Getting started

o Grab the files from the ‘KU-August2009-LIMMA’directory (or archive) and copy/move to aconvenient location on your computer

o Set the variable ‘rootDir’ to that directory. Forexample:

rootDir <- “~/Desktop/KU-August2009-LIMMA/data”

o Make rootDir the working directory of your Rsession

119

limma package documentation

o Function help pages?lmFit, ?eBayes

o Class help pages?"RGList-class"?"MArrayLM-class"

o Group help pageshelp("06.LinearModels")

o User’s Guide

limmaUsersGuide()

o The R html help system is a good top view120

Moderation Demo

o Illustration of sampling from the modelo Reduction in false discoverieso Empirical Bayes differential expression

Page 21: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 21

121

Limma Objects

o RGListo MAListo MArrayLM

122

BG correction / normalization demo

o Various procedures for BG correction,normalization

o Jurkat data:

same vs. same

123

Simple Experiment 1:Integrin beta7+ vs beta7–

beta7- beta7+

o Reading two-color datao Control spotso Background correction

o Dye-swapso Empirical Bayes differential expression

124

Simple Experiment 2: Cancer cellsversus normal cells

Normalcell line

Cancercell line

o Reading Affymetrix datao Simple design versus contrasts

125

Factorial Experiment: Estrogen

10 hours 48 hours

AbsentEstrogen

Present

o Reading Affymetrix datao Factorial designs, contrasts

126

Gene set analysis

o Convenient “functional category analysis”o Newer, flexible set-based testing

Page 22: Linear models and limma Copenhagen, 19 August 2009web.math.ku.dk/~richard/courses/bioconductor2009/... · ad hoc density kernel method oIn limma (two colour), parameter estimation

Linear models and limma Copenhagen, 19 August 2009

Mark D. Robinson, WEHI/Garvan 22

127

Targets of SAHA and depsipeptideA time course experiment

o Study effects of SAHA and depsipeptide onthe acute T-cell leukemia cell line CEM

o SAHA and depsipeptide are structurallydifferent but have similar biological effects(induce death through intrinsic apoptoticpathway)

o Prising out subtle differences is of greatinterest

128

SAHA/depsipeptide:Experimental design

SAHA Vehicle only depsipeptide

0hr 0hr 0hr

1hr 1hr 1hr

2hr 2hr 2hr

4hr 4hr 4hr

8hr 8hr 8hr

16hr 16hr 16hr

129

Aims of experiment

o Identify common responders: genes whichrespond similarly to SAHA and depsipeptide

o Identify specific responders: genes whichrespond to one of SAHA or depsipeptide, butnot to the other

o Different responders, genes which respond toboth SAHA and depsipeptide but differently,are of lesser interest

130

Linear model analysis

o Fit genewise linear models to all the arrayssimultaneously

o Include effects for drug x timeo Allow for probe-specific dye-effectso Treat each time series of 6 arrays as a

randomized block, i.e., allow arrays hybridizedtogether to be correlated