design of microarray experiments - college of agriculture ... · pdf filesame rna; multiple...
TRANSCRIPT
Design of Microarray Experiments
Xiangqin Cui
Experimental design
• Experimental design: is a term used about efficient methods for planning the collection ofdata, in order to obtain the maximum amount of information for the least amount of work. Anyone collecting and analyzing data, be it in the lab, the field or the production plant, can benefit from knowledge about experimental design.http://www.stat.sdu.dk/matstat/Design/index.html
Experimental Design Principles
• Randomization• Replication• Blocking• Use of factorial experiments instead of the
one-factor-at-a-time methods.• Orthogonality
Randomization
• The experimental treatments are assigned to the experimental units (subjects) in a random fashion. It helps to eliminate effect of "lurking variables", uncontrolled factors which might vary over the length of the experiment.
Commonly Used Randomization Methods
• Number the subjects to be randomized and then randomly draw the numbers using paper pieces in a hat or computer random number generator.
1 2 3 4 5 6
Hormone treatment : (1,3,4); (1,2,6)
Control : (2,5,6); (3,4,5)
Plan 1 plan 2
Randomization in Microarray Experiments
• Randomize arrays in respect to samples• Randomize samples in respect to
treatments• Randomize the order of performing the
labeling, hybridizations, scanning ……
Replication• Replication is repeating the creation of a
phenomenon, so that the variability associated with the phenomenon can be estimated.
Replications should not be confused with repeated measurements which refer to literally taking several measurements of a single occurrence of a phenomenon.
Replication in microarray experiments
• What to replicate?– Biological replicates (replicates at the experimental
unit level, e.g. mouse, plant, pot of plants…)• Experimental unit is the unit that the experiment treatment or
condition is directly applied to, e.g. a plant if hormone is sprayed to individual plants; a pot of seedlings if different fertilizers are applied to different pots.
– Technical replicates• Any replicates below the experimental unit, e.g. different
leaves from the same plant sprayed with one hormone level; different seedlings from the same pot; Different aliquots of the same RNA extraction; multiple arrays hybridized to the same RNA; multiple spots on the same array.
Power: the probability for a real positive to be identified
It depends on: 1) The size of difference (log ratio) to be detected
The bigger the difference, the higher the power.
2) The sample sizeThe larger the sample size, the higher the power.
3) The error variance of the treatmentThe bigger the error variance, the lower the power.
How Many Replicates
How Many Replicates
* Minimum number of replication (r) per group: 3 – 5 (two groups)
(reliable variance estimates, permutation test, df error, etc.)
rrB2
rrB1
A2A1
Factorial 2 x 2
1173Total
840Error
111A*B
111B
111A
(r=3)(r=2)(r=1)S.V.
* Sample size calculation for experiments with two groups
REFERENCE:
BALANCED BLOCK:
( )2
2)1()2/1(
)/(zz4
nσδ
+= β−α−
( )2
2)1()2/1(
)/(zz
nτδ
+= β−α−
# arrays (total)log(FC)
SD
(in general, τ > σ)
Example (Dobbin and Simon, 2003):
δ=1, α=0.001, β=0.05:σ ≈ 0.50 → n = 30 (i.e. 15A + 15B + 30R)
τ ≈ 0.67 → n = 17 (i.e. 17A + 17B)
Error variance (EV) of the estimated treatment effect:
)(2 22
422
eA
eeM
mnmnmEV
σσσσσ
+−+=
m : number of mice per treatmentn : number of array pairs per mouse
Reference DesignTrt A Trt B
M1 M2 M3 M4
R R R R
Error variance of the estimated treatment effect:
mnmEV eM
22 σσ+=Approximately
Resource allocation:
AM nCmmCCost ⋅+=
m mice / trtn array pairs / mouse
CM cost / mouseCA cost / array pair
The optimum number of array pairs per mouse:
A
M
M
e
CCn ⋅= 2
2
σσ
mnmEV eM
22 σσ+=
Examples for resource allocation
Mouse price Array price/pair # of array pairsper mouse
$15 $600 1
$300 $600 1
$1500 $600 2
• Using variance components estimated from kidney in Project Normal data.• r = 1 ( no replicated spots on array )• Reference design
More efficient array level designs, such as direct comparisons and loop designs, can reduce the optimum number of arrays per mouse.
Pooling Samples
22 1Mpool k
σσ α=
k : pool size
α : constant for the effect of pooling. 0 < α < 1α = 0, pooling has no effect.α = 1, pooling has maximum effect.
Pooling can reduce the biological variance but not the technicalvariances. The biological variance will be replaced by:
Note: More pools with fewer arrays per pool is better than fewer pools with more arrays per pools when fixed number of arrays and mice are to be used.
Power Increase to Detect 2 fold change by Pooling
2 pools / trt4 pools / trt
6 pools / trt8 pools / trt
2 mice / trt4 mice / trt
6 mice / trt8 mice / trt
( Pool size k = 3, α = 1 )
Significance level:0.05 after Bonferroni correction
PowerAtlas• It is an online tool to calculate sample size based on
similar data set in GEO or user provided pilot data. http://www.poweratlas.org/
Page et al., BMC Bioinformatics. 2006 Feb 22;7:84.
Gadbury GL, et al. Stat Meth Med Res 13: 325-338, 2004.
EDR is the proportion of genes that are truly differentially expressed that will be called significant at the significance level of 0.05. This is analogous to average power. TP is the proportion of genes called 'significant' that are truly differentially expressed. TN is the proportion of genes called 'not significant' that are truly not differentially expressed between the conditions.
Blocking
• Some identified uninteresting but varying factors can be controlled through blocking.
• Completely randomized design
• Complete block design
• Incompletely block designs
Completely Randomized Design
There is no blocking
Example
Compare two hormone treatments (trt and control) using 6 Arabidopsis plants.
1 2 3 4 5 6
Hormone trt: (1,3,4); (1,2,6)Control : (2,5,6); (3,4,5)
Note: Design with one-color microarrays is often completely randomized design.
Complete Block Design
There is blocking and the block size is equal to the number of treatments.
Example:
Compare two hormone treatments (trt and control) using 6 Arabidopsis plants. For some reason plant 1 and 2 are taller, plant 5 and 6 are thinner.
Randomization within blocks
1 2 3 4 5 6Hormone treatment: (1,4,5); (1,3,6)Control : (2,3,6); (2,4,5)
Incomplete Block Design
There is blocking and the block size is smaller than the number of treatments.
Example:
Compare three hormone treatments (hormone level 1, hormone level 2, and control) using 6 Arabidopsis plants. For some reason plant 1 and 2 are taller, plant 5 and 6 are thinner.
Randomization within blocks
1 2 3 4 5 6Hormone level1: (1,4); (2,4)Hormone level2: (2,5); (1,6)Control : (3,6); (3.5)
Designs for two-color microarrays
• Experimental design for one color microarray array is straight forward with out the required blocking at the array level.
• For two color microarrays, we need to pair the samples and label then with Cy3 and Cy5 for hybridization to one array.
• How do we pair?
Blocking in microarray experiments• In two-color platform, the arrays vary due to the loading of DNA quantity, spot morphology, hybridization condition, scanning setting…It is treated as a block of size 2, the samples are compared within each array.
• If there are two lots of chips in the experiment and there is large variation across chip lots. We can treat chip lots as a blocking factor.
Example: compare three samples: A, B, C
Block (array) 1 Block (array) 2 Block (array) 3
A
B
B
C
C
A
A1 B1Cy5 Cy3
A1 B1
Dye-swap
• Dye swappingSeparates the dye effect from the sample effect.Results are less sensitive to the data transformation processes
Kerr and Churchill(2001) Biostatistics, 2:183-201.
A1 B1
Replicated Dye-Swap
A2 B2
Reference designAll samples are compared to a single reference sample.
Single reference
A B C
R
Double reference
A B C
R
Example Microarray Experiment
comparing 3 treatments
Note: Arrows are used to represent a two-color microarray. The arrow head represents the red channel and the tail represents the green channel. T1 to T3, the three treatments; M1 to M15, the 15 plants; R, reference sample.
T1 T2 T3
M1 M2M3 M4M5 M6 M7 M8M9 M10 M11 M12M13 M14 M15
R R R R R R R R R R R R R R R
Loop Design
A
C
D B
Basic types of loop designs
A B
C
Single loop Double loop
A B
C
Complex loop designs
A1 B1
C1 F1
D1 E1
Pera I DBAHigh fat
Low fat
A B C
FE D
A2 B2
C2 F2
D2 E2
Two-factor factorial design at the high level
Reference or Loop
• Reference design: It has lower overall efficiency, but it has equal efficiency for all sample comparisons, and is easy to extend.
• Loop design: higher overall efficiency, harder to extend, unequal efficiency for all comparisons,
Comparing the efficiency of reference versus loop designs
Yang and Speed (2002) Nat. Rev. Genet 3:579
A B
C
Loop:
Reference:
A
R
A
R
B
R
B
R
A1 B1
C1
Connected Loop
A1 B2
C1 C2
A2 B1
Regular Loop
A1
R1
A2
R2
B1
R3
B2
R4
Classical Reference
A1
R1
A2
R1
B1
R1
B2
R1 Common Reference
Alternative Designs
Experiments With Two Groups
Reference with dye-swap
A1 B1
R R R R
A2 B2
A1 B1
R R R R
A2 B2A3 B3
R R R R
A4 B4Reference with alternating dye
Is alternating dye really necessary ?
Experiments with Two Groups (Direct comparison)
“Complete” block with dye-swap
“Complete” block design; Row-column design
A1 A2
B1 B2
A3 A4
B3 B4
A1 A5A2 A6A3 A7A4 A8
B1 B5B2 B6B3 B7B4 B8
Experiments with Three Groups
Reference with dye-swap
Reference with alternating dye
A1 B1
R R R R
A2 B2 C1
R R
C2
A1 B1
R R R R
A2 B2A3 B3
R R R R
A4 B4 C1
R R
C2 C3
R R
C4
Experiments With Three Groups
A1 B1
C1
A3 B3
C3
A4 B4
C4
A2 B2
C2
A1 B2
C1 C2
A2 B1A1 B1
B2 C2
C1
A2
=
Multiple connected loops
Multiple regular loops = Incomplete block structure
Pooling mice
22 1Mpool k
σσ α=
k : pool size
α : constant for the effect of pooling. 0 < α < 1α = 0, pooling has no effect.α = 1, pooling has maximum effect.
Pooling can reduce the mouse variance but not the technical variances. The mouse variance will be replaced by:
Note: More pools with fewer arrays per pool is better than fewer pools with more arrays per pools when fixed number of arrays and mice are to be used.
Power Increase to Detect 2 fold change by Pooling
2 pools / trt4 pools / trt
6 pools / trt8 pools / trt
2 mice / trt4 mice / trt
6 mice / trt8 mice / trt
( Pool size k = 3, α = 1 )
Significance level:0.05 after Bonferroni correction
• Use of factorial experiments instead of the one-factor-at-a-time methods.
• Orthogonality: each level of one factor has all levels of the other factor.
Example: To compare two treatments (T1, T2) and two strains (S1, S2)
T2S2T1S2S2
T2S1T1S1S1
T2T1
Design Principle Continues
ReferencesChurchill GA. Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32: 490-495, 2002.
Cui X and Churchill GA. How many mice and how many arrays? Replication in mouse cDNA microarray experiments, in "Methods of Microarray Data Analysis III", Edited by KF Johnson and SM Lin. Kluwer Academic Publishers, Norwell, MA. pp 139-154, 2003.
Gadbury GL, et al. Power and sample size estimation in high dimensional biology. Stat Meth Med Res 13: 325-338, 2004.
Kerr MK. Design considerations for efficient and effective microarray studies. Biometrics 59: 822-828, 2003.
Kerr MK and Churchill GA. Statistical design and the analysis of gene expression microarray data. Genet. Res. 77: 123-128, 2001.
Kuehl RO. Design of experiments: statistical principles of research design and analysis, 2nd ed., 1994, (Brooks/cole) Duxbry Press, Pacific Grove, CA.
Page GP et al. The PowerAtlas: a power and sample size atlas for microarray experimental design and research. BMC Bioinformatics. 2006 Feb 22;7:84.
Rosa GJM, et al. Reassessing design and analysis of two-colour microarray experiments using mixed effects models. Comp. Funct. Genomics 6: 123-131. 2005.
Wit E, et al. Near-optimal designs for dual channel microarray studies. Appl. Statist. 54: 817-830, 2005.
Yand YH and Speed T. Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3: 570-588, 2002.