lecture 6 design matrices and anova and how this is done in limma

Lecture 6

Design Matrices and ANOVA and how this is done in LIMMA

ANOVA: Some Examples

• Is there a difference in the mean hourly wages for three different ethnic groups?

• Is there a difference in the mean sugar content in five different brands on cereal?

• IS there a difference between Mutant and Wild Type version of the organisms

• IS there a dye effect, as well as a treatment effect?• For a time course experiment are there significant

differences in gene expression for the different time points?

Model for ANOVA

• The general linear model which applies for ANOVA, Regression as well as ANCOVA is written as:

• Y = X + b e• (nX1) (nXp) (pX1) (nX1)

• This is the matrix formulation of the model.• Y: response vector (observed)• X: design matrix (observed)• b: parameter vector (to be estimated)• e: error vector (unobserved, randomness)

How to write a Design Matrix

• Consider a data set where we are looking at comparing 3 different fertilizers, A, B and C. For each fertilizer we have two plot of lands.

• Data: Plot Fertilizer Yield (TONNES)1 A 122 A 153 B 214 B 185 C 106 C 9

Models: cell means model• We can write this as:• Yij = mi + eij• This is the cell-means model

• The corresponding design matrix is:1 0 01 0 00 1 00 1 00 0 10 0 1

Each row corresponding to the unit, each column corresponding to the Treatment

Model: Factor effect Model• We can write this as:• Yij = + m ti + eij• This is the factor effect model, here we have an OVERALL mean and the ti are

the differences of each treatment level /factor from the overall mean. Here we put the added requirement that S ti = 0

• The corresponding design matrix is:1 0 01 0 01 1 01 1 01 -1 -11 -1 -1

Each row corresponding to the unit, each column corresponding to the Treatment, but the last treatment is expressed in terms of the other treatments.

Parameter Vectors

• For the cell means model:

= ( 1 2 3)b m m m ’

HO: 1= 2= 3m m m • For the factor effects model:

= ( 1 2)b m t t ’

HO: 1= 2= 3t t t =0

Usage

• Most of Statistics uses the Factor effects model as it makes the interpretation of the hypothesis easier as we are testing our null that all the treatment effects are 0.

• However, in LIMMA in R we will use the easier cell-means model for design matrix construction and we need to define a contrast matrix.

LIMMA and Design Matrices

• This is what LIMMA says about constructing design Matrices:• • “The package limma uses an approach called linear models to

analyse designed microarray experiments. This approach allows very general experiments to be analysed just as easily as a simple replicated experiment.

• The approach requires one or two matrices to be specified. The first is the design matrix which indicates in effect which RNA samples have been applied to each array. The second is the contrast matrix which specifies which comparisons you would like to make between the RNA samples. For very simple experiments, you may not need to specify the contrast matrix.”

More on Design Matrices

• The philosophy of the approach is as follows. You have to start by fitting a linear model o your data which fully models the systematic part of your data. The model is specified by the design matrix. Each row of the design matrix corresponds to an array in your experiment and each column corresponds to a coefficient which is used to describe the RNA sources in our experiment. With Affymetrix or single-channel data, or with two-color with a common reference, you will need as many coefficients as you have distinct RNA sources, no more and no less.

• • With direct-design two-color data you will need one fewer coefficient than you

have distinct RNA sources, unless you wish to estimate a dye-effect for each gene, in which case the number of RNA sources and the number of coefficients will be the same. Any set of independent coefficients will do, providing they describe all your treatments. The main purpose of this step is to estimate the variability in the data, hence the systematic part needs to be modeled so it can be distinguished from random variation.

LIMMA: contrasts

• In practice the requirement to have exactly as many coefficients as RNA sources is too restrictive in terms of questions you might want to answer. You might be interested in more or fewer comparisons between the RNA source. Hence the contrasts step is provided so that you can take the initial coefficients and compare them in as many ways as you want to answer any questions you might have, regardless of how many or how few these might be.

Writing out Design and Contrast Matrices:

• Example 1:

• This a one-factor ANOVA with 4 levels. • The model is Yij = mi + eij, i =1,…,4, j=1…3. • Write out the contrast matrix if we were interested in

comparing level 1 to level 2, and level 3 to the mean of level 1 and 2.

Example 1: Designs and Contrast Matrices

array m1 m2 m3 m4

1 1 0 0 0

2 1 0 0 0

3 1 0 0 0

4 0 1 0 0

5 0 1 0 0

6 0 1 0 0

7 0 0 1 0

8 0 0 1 0

9 0 0 1 0

10 0 0 0 1

11 0 0 0 1

12 0 0 0 1

• The contrast matrix for comparing: so that

• B= C’Dcomparing level 1 to level 2, level 3 to the mean of level 1

and 2.

c1 1 -1 0 0c2 -1/2 -1/2 1 0

Example 2

• This a two-factor ANOVA with 3 levels for Factor A and 2 levels for Factor B.

• The model is Yij = mi + bj+ eij, i =1,…,3, j=1…2. • Write out the contrast matrix for comparing

Factor 1, levels 2 and 3 and Factor 2 levels 1 and 2.

•

Example 2: Design and Contrast MatrixThe Design Matrix

array a1b1 a1b2 a2b1 a2b2 a3b1 a3b2

1 1 0 0 0 0 0

2 1 0 0 0 0 0

3 0 1 0 0 0 0

4 0 1 0 0 0 0

5 0 0 1 0 0 0

6 0 0 1 0 0 0

7 0 0 0 1 0 0

8 0 0 0 1 0 0

9 0 0 0 0 1 0

10 0 0 0 0 1 0

11 0 0 0 0 0 1

Write out the contrast matrix for comparing :Factor 1, levels 2 and 3 Factor 1: levels 1 and 3Factor 2 levels 1 and 2.

• Contrast:C1: 0 0 -1 -1 1 1C2: -1 -1 0 0 1 1C2: -1 1 -1 1 -1 1

Differential Expressions for Factorial Designs: Design Matrices and Contrasts, using R.

• Example The Estrogen Data set:• Let us consider the Estrogen Data set, and look at how we use R to look at

differential expressions using design matrices.• • Name FileName Target• Abs10.1 low10-1.cel EstAbsent10• Abs10.2 low10-2.cel EstAbsent10• Pres10.1 high10-1.cel EstPresent10• Pres10.2 high10-2.cel EstPresent10• Abs48.1 low48-1.cel EstAbsent48• Abs48.2 low48-2.cel EstAbsent48• Pres48.1 high48-1.cel EstPresent48• Pres48.2 high48-2.cel EstPresent48

Description of Experiment

• There are 8 files in all, coming from a 2X2 factorial design. This is a design where there are 2 factors each at 2 levels. The study was done to measure the changes in gene expression for breast cancer patients due to estrogen (two levels Presence and Absence) at two time points (10hr and 48hr). This experiment data is available at the Bioconductor website.

Contrasts of Interest

• It is of interest to compare:1. the effect of estrogen at 10 hours (compare

presence to absence at 10 hours), 2. the effect of estrogen at 48 hours (compare

presence and absence at 48 hours)3. the effect of time in the absence of estrogen

(compare Absent 10 to Absent 48).

Targets File Method

• To do this in R we can use different ways. Lets use the Targets file method as we did in 2 condition comparison before.

• So lets first put together a tab-delimited text file like the one above. I call it EstrogenTargets.txt so it describes a name, the filename and the targets containing the factor level infromation

Design matrix method• One way to do this in R (to me it’s the simplest one in terms of Design matrices), is to

write a Design Matrix using the factor combinations, WITHOUT the intercept term.• • R (at least LIMMA) writes the Design matrix as:

• EstAbsent10 EstPresent10 EstAbsent48 EstPresent48• 1 0 0 0• 1 0 0 0• 0 1 0 0• 0 1 0 0• 0 0 1 0• 0 0 1 0• 0 0 0 1• 0 0 0 1 • So our model is Y = Xag + e

Contrast Matrix

• Now to define the contrast we need to look at the transformation:• • bg = C’ag

• • so, we define C as:• • C’ = -1 1 0 0• 0 0 –1 1• -1 0 1 0• • This will define: (EstPresent10-EstAbsent10)• (EstPresent48-EstAbsent48)• (EstAbsent48-EstAbsent10)

In R using Targets file

design=model.matrix(~-1+factor(targets$Target,level=unique(targets$Target)))

colnames(design)=unique(targets$Target)numParameters=ncol(design)parameterNames=colnames(design)

contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,-1,0,1,0),nrow=ncol(design))

Using the Targets file, efficient if you know how R works and you don’t have to put in the Matrix.

In R using the design matrix directly

• design<-matrix(c(1,0,0, 0,1,0,0,0,0,1, 0,0,0,1,0,0,0,0,1,0,0,0,1,0,0 ,0,0,1,0,0,0,1),nrow=8)

• contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,-1,0,1,0),nrow=ncol(design))

• R constructs the matrices using the columns.

An example for Optimal Designs

• Suppose we have 12 arrays in a single channel framework and we have 5 conditions that we want to compare.

• Because of the unbalance it is harder to design orthogonal designs here.

• Sometimes people use classes of design that are already available and have properties like orthogonality.

• Designs in this class include: Margolin Designs (less than 6 conditions), Plackett-Burman designs and other such designs.

Consider the following Margolin Design: orthogonal for 6 conditions and 12 arrays

• 1 1 1 1 1 1 1 1• 1 1 -1 -1 -1 -1 1 2• 1 -1 1 -1 -1 -1 1 3• 1 -1 -1 1 1 -1 1 4• 1 -1 -1 1 -1 1 1 5• 1 -1 -1 -1 1 1 1 6• 1 -1 -1 -1 -1 -1 -1 7• 1 -1 1 1 1 1 -1 8• 1 1 -1 1 1 1 -1 9• 1 1 1 -1 -1 1 -1 10• 1 1 1 -1 1 -1 -1 11• 1 1 1 1 -1 -1 -1 12

What if I have 5 conditions

• In some ways we could drop one column and use the Design matrix with the dropped column to preserve some optimality conditions.

• Question is which column to drop?

• The following R-code helps us decide whether we drop column 2 or 3 or 4.

• A<-matrix(c(1,1,1,1,1,1,1,• + 1,1,-1,-1,-1,-1,1,• + 1,-1,1,-1,-1,-1,1,• + 1,-1,-1,1,1,-1,1,• + 1,-1,-1,1,-1,1,1,• + 1,-1,-1,-1,1,1,1,• + 1,-1,-1,-1,-1,-1,-1,• + 1,-1,1,1,1,1,-1,• + 1,1,-1,1,1,1,-1,• + 1,1,1,-1,-1,1,-1,• + 1,1,1,-1,1,-1,-1,• + 1,1,1,1,-1,-1,-1), nrow=12)• > B<-t(A)• > C<-B%*%A• > D<-solve(C) • > det(D)• [1] 5.353961e-07• > sum(diag(D))• [1] 2.065789

• > A1<-A[,-2]• > A2<-A[,-3]• > A4<-A[,-4]• > A1t=t(A1)• > A2t=t(A2)• > a3t=t(A4)• > A4t=t(A4)• > a1ta1=A1t%*%A1• > a2ta2=A2t%*%A2• > a4ta4=A4t%*%A4• > b1=solve(a1ta1)• > b2=solve(a2ta2)• > b3=solve(a4ta4)• > aa1=sum(diag(b1))• > aa2=sum(diag(b2))• > aa4=sum(diag(b3))

Results from dropping columns

• >aa1• [1] 1.256966 (trace after dropping col 2)• > aa2• [1] 0.8231631 (trace after dropping col 3)• > aa4• [1] 1.322289 (trace after dropping col 4)• > det(b1)• [1] 3.023413e-06 (determinant after dropping col 2)• > det(b2)• [1] 1.216143e-06 (determinant after dropping col 3)• > det(b3)• [1] 2.941453e-06 (determinant after dropping col 4)

lecture 6 design matrices and anova and how this is done in limma

Documents

corresponding design

treatment model

design matrix observedb

design matrix construction

contrast matrix

factor effects model

design matricesthis

design matrixconsider