lecture 6 design matrices and anova and how this is done in limma

29
Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Upload: ernest-caldwell

Post on 18-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Lecture 6

Design Matrices and ANOVA and how this is done in LIMMA

Page 2: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

ANOVA: Some Examples

• Is there a difference in the mean hourly wages for three different ethnic groups?

• Is there a difference in the mean sugar content in five different brands on cereal?

• IS there a difference between Mutant and Wild Type version of the organisms

• IS there a dye effect, as well as a treatment effect?• For a time course experiment are there significant

differences in gene expression for the different time points?

Page 3: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Model for ANOVA

• The general linear model which applies for ANOVA, Regression as well as ANCOVA is written as:

• Y = X + b e• (nX1) (nXp) (pX1) (nX1)

• This is the matrix formulation of the model.• Y: response vector (observed)• X: design matrix (observed)• b: parameter vector (to be estimated)• e: error vector (unobserved, randomness)

Page 4: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

How to write a Design Matrix

• Consider a data set where we are looking at comparing 3 different fertilizers, A, B and C. For each fertilizer we have two plot of lands.

• Data: Plot Fertilizer Yield (TONNES)1 A 122 A 153 B 214 B 185 C 106 C 9

Page 5: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Models: cell means model• We can write this as:• Yij = mi + eij• This is the cell-means model

• The corresponding design matrix is:1 0 01 0 00 1 00 1 00 0 10 0 1

Each row corresponding to the unit, each column corresponding to the Treatment

Page 6: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Model: Factor effect Model• We can write this as:• Yij = + m ti + eij• This is the factor effect model, here we have an OVERALL mean and the ti are

the differences of each treatment level /factor from the overall mean. Here we put the added requirement that S ti = 0

• The corresponding design matrix is:1 0 01 0 01 1 01 1 01 -1 -11 -1 -1

Each row corresponding to the unit, each column corresponding to the Treatment, but the last treatment is expressed in terms of the other treatments.

Page 7: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Parameter Vectors

• For the cell means model:

= ( 1 2 3)b m m m ’

HO: 1= 2= 3m m m • For the factor effects model:

= ( 1 2)b m t t ’

HO: 1= 2= 3t t t =0

Page 8: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Usage

• Most of Statistics uses the Factor effects model as it makes the interpretation of the hypothesis easier as we are testing our null that all the treatment effects are 0.

• However, in LIMMA in R we will use the easier cell-means model for design matrix construction and we need to define a contrast matrix.

Page 9: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

LIMMA and Design Matrices

• This is what LIMMA says about constructing design Matrices:• • “The package limma uses an approach called linear models to

analyse designed microarray experiments. This approach allows very general experiments to be analysed just as easily as a simple replicated experiment.

• The approach requires one or two matrices to be specified. The first is the design matrix which indicates in effect which RNA samples have been applied to each array. The second is the contrast matrix which specifies which comparisons you would like to make between the RNA samples. For very simple experiments, you may not need to specify the contrast matrix.”

Page 10: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

More on Design Matrices

• The philosophy of the approach is as follows. You have to start by fitting a linear model o your data which fully models the systematic part of your data. The model is specified by the design matrix. Each row of the design matrix corresponds to an array in your experiment and each column corresponds to a coefficient which is used to describe the RNA sources in our experiment. With Affymetrix or single-channel data, or with two-color with a common reference, you will need as many coefficients as you have distinct RNA sources, no more and no less.

• • With direct-design two-color data you will need one fewer coefficient than you

have distinct RNA sources, unless you wish to estimate a dye-effect for each gene, in which case the number of RNA sources and the number of coefficients will be the same. Any set of independent coefficients will do, providing they describe all your treatments. The main purpose of this step is to estimate the variability in the data, hence the systematic part needs to be modeled so it can be distinguished from random variation.

Page 11: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

LIMMA: contrasts

• In practice the requirement to have exactly as many coefficients as RNA sources is too restrictive in terms of questions you might want to answer. You might be interested in more or fewer comparisons between the RNA source. Hence the contrasts step is provided so that you can take the initial coefficients and compare them in as many ways as you want to answer any questions you might have, regardless of how many or how few these might be.

Page 12: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Writing out Design and Contrast Matrices:

• Example 1:

• This a one-factor ANOVA with 4 levels. • The model is Yij = mi + eij, i =1,…,4, j=1…3. • Write out the contrast matrix if we were interested in

comparing level 1 to level 2, and level 3 to the mean of level 1 and 2.

Page 13: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Example 1: Designs and Contrast Matrices

array m1 m2 m3 m4

1 1 0 0 0

2 1 0 0 0

3 1 0 0 0

4 0 1 0 0

5 0 1 0 0

6 0 1 0 0

7 0 0 1 0

8 0 0 1 0

9 0 0 1 0

10 0 0 0 1

11 0 0 0 1

12 0 0 0 1

• The contrast matrix for comparing: so that

• B= C’Dcomparing level 1 to level 2, level 3 to the mean of level 1

and 2.

c1 1 -1 0 0c2 -1/2 -1/2 1 0

Page 14: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Example 2

• This a two-factor ANOVA with 3 levels for Factor A and 2 levels for Factor B.

• The model is Yij = mi + bj+ eij, i =1,…,3, j=1…2. • Write out the contrast matrix for comparing

Factor 1, levels 2 and 3 and Factor 2 levels 1 and 2.

Page 15: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Example 2: Design and Contrast MatrixThe Design Matrix

array a1b1 a1b2 a2b1 a2b2 a3b1 a3b2

1 1 0 0 0 0 0

2 1 0 0 0 0 0

3 0 1 0 0 0 0

4 0 1 0 0 0 0

5 0 0 1 0 0 0

6 0 0 1 0 0 0

7 0 0 0 1 0 0

8 0 0 0 1 0 0

9 0 0 0 0 1 0

10 0 0 0 0 1 0

11 0 0 0 0 0 1

Write out the contrast matrix for comparing :Factor 1, levels 2 and 3 Factor 1: levels 1 and 3Factor 2 levels 1 and 2.

• Contrast:C1: 0 0 -1 -1 1 1C2: -1 -1 0 0 1 1C2: -1 1 -1 1 -1 1

Page 16: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Differential Expressions for Factorial Designs: Design Matrices and Contrasts, using R.

• Example The Estrogen Data set:• Let us consider the Estrogen Data set, and look at how we use R to look at

differential expressions using design matrices.• • Name FileName Target• Abs10.1 low10-1.cel EstAbsent10• Abs10.2 low10-2.cel EstAbsent10• Pres10.1 high10-1.cel EstPresent10• Pres10.2 high10-2.cel EstPresent10• Abs48.1 low48-1.cel EstAbsent48• Abs48.2 low48-2.cel EstAbsent48• Pres48.1 high48-1.cel EstPresent48• Pres48.2 high48-2.cel EstPresent48

Page 17: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Description of Experiment

• There are 8 files in all, coming from a 2X2 factorial design. This is a design where there are 2 factors each at 2 levels. The study was done to measure the changes in gene expression for breast cancer patients due to estrogen (two levels Presence and Absence) at two time points (10hr and 48hr). This experiment data is available at the Bioconductor website.

Page 18: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Contrasts of Interest

• It is of interest to compare:1. the effect of estrogen at 10 hours (compare

presence to absence at 10 hours), 2. the effect of estrogen at 48 hours (compare

presence and absence at 48 hours)3. the effect of time in the absence of estrogen

(compare Absent 10 to Absent 48).

Page 19: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Targets File Method

• To do this in R we can use different ways. Lets use the Targets file method as we did in 2 condition comparison before.

• So lets first put together a tab-delimited text file like the one above. I call it EstrogenTargets.txt so it describes a name, the filename and the targets containing the factor level infromation

Page 20: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Design matrix method• One way to do this in R (to me it’s the simplest one in terms of Design matrices), is to

write a Design Matrix using the factor combinations, WITHOUT the intercept term.• • R (at least LIMMA) writes the Design matrix as:

• EstAbsent10 EstPresent10 EstAbsent48 EstPresent48• 1 0 0 0• 1 0 0 0• 0 1 0 0• 0 1 0 0• 0 0 1 0• 0 0 1 0• 0 0 0 1• 0 0 0 1 • So our model is Y = Xag + e

Page 21: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Contrast Matrix

• Now to define the contrast we need to look at the transformation:• • bg = C’ag

• • so, we define C as:• • C’ = -1 1 0 0• 0 0 –1 1• -1 0 1 0• • This will define: (EstPresent10-EstAbsent10)• (EstPresent48-EstAbsent48)• (EstAbsent48-EstAbsent10)

Page 22: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

In R using Targets file

design=model.matrix(~-1+factor(targets$Target,level=unique(targets$Target)))

colnames(design)=unique(targets$Target)numParameters=ncol(design)parameterNames=colnames(design)

contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,-1,0,1,0),nrow=ncol(design))

Using the Targets file, efficient if you know how R works and you don’t have to put in the Matrix.

Page 23: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

In R using the design matrix directly

• design<-matrix(c(1,0,0, 0,1,0,0,0,0,1, 0,0,0,1,0,0,0,0,1,0,0,0,1,0,0 ,0,0,1,0,0,0,1),nrow=8)

• contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,-1,0,1,0),nrow=ncol(design))

• R constructs the matrices using the columns.

Page 24: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

An example for Optimal Designs

• Suppose we have 12 arrays in a single channel framework and we have 5 conditions that we want to compare.

• Because of the unbalance it is harder to design orthogonal designs here.

• Sometimes people use classes of design that are already available and have properties like orthogonality.

• Designs in this class include: Margolin Designs (less than 6 conditions), Plackett-Burman designs and other such designs.

Page 25: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Consider the following Margolin Design: orthogonal for 6 conditions and 12 arrays

• 1 1 1 1 1 1 1 1• 1 1 -1 -1 -1 -1 1 2• 1 -1 1 -1 -1 -1 1 3• 1 -1 -1 1 1 -1 1 4• 1 -1 -1 1 -1 1 1 5• 1 -1 -1 -1 1 1 1 6• 1 -1 -1 -1 -1 -1 -1 7• 1 -1 1 1 1 1 -1 8• 1 1 -1 1 1 1 -1 9• 1 1 1 -1 -1 1 -1 10• 1 1 1 -1 1 -1 -1 11• 1 1 1 1 -1 -1 -1 12

Page 26: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

What if I have 5 conditions

• In some ways we could drop one column and use the Design matrix with the dropped column to preserve some optimality conditions.

• Question is which column to drop?

• The following R-code helps us decide whether we drop column 2 or 3 or 4.

Page 27: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

• A<-matrix(c(1,1,1,1,1,1,1,• + 1,1,-1,-1,-1,-1,1,• + 1,-1,1,-1,-1,-1,1,• + 1,-1,-1,1,1,-1,1,• + 1,-1,-1,1,-1,1,1,• + 1,-1,-1,-1,1,1,1,• + 1,-1,-1,-1,-1,-1,-1,• + 1,-1,1,1,1,1,-1,• + 1,1,-1,1,1,1,-1,• + 1,1,1,-1,-1,1,-1,• + 1,1,1,-1,1,-1,-1,• + 1,1,1,1,-1,-1,-1), nrow=12)• > B<-t(A)• > C<-B%*%A• > D<-solve(C) • > det(D)• [1] 5.353961e-07• > sum(diag(D))• [1] 2.065789

Page 28: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

• > A1<-A[,-2]• > A2<-A[,-3]• > A4<-A[,-4]• > A1t=t(A1)• > A2t=t(A2)• > a3t=t(A4)• > A4t=t(A4)• > a1ta1=A1t%*%A1• > a2ta2=A2t%*%A2• > a4ta4=A4t%*%A4• > b1=solve(a1ta1)• > b2=solve(a2ta2)• > b3=solve(a4ta4)• > aa1=sum(diag(b1))• > aa2=sum(diag(b2))• > aa4=sum(diag(b3))

Page 29: Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

Results from dropping columns

• >aa1• [1] 1.256966 (trace after dropping col 2)• > aa2• [1] 0.8231631 (trace after dropping col 3)• > aa4• [1] 1.322289 (trace after dropping col 4)• > det(b1)• [1] 3.023413e-06 (determinant after dropping col 2)• > det(b2)• [1] 1.216143e-06 (determinant after dropping col 3)• > det(b3)• [1] 2.941453e-06 (determinant after dropping col 4)