Empirical Bayes Analysis of Variance Component Models
for Microarray Data
S. Feng,1 R.Wolfinger,2 T.Chu,2 G.Gibson,3 L.McGraw4
1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.
Microarray: Genome-wide gene expression
2. Introduced to Genetics/Genomics in 1996:
...Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously.
1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon)
Scientific Papers (ISI) with Title Including "Microarray"
0500
1000150020002500300035004000
1996 1997 1998 1999 2000 2001 2002 2003
In JSM 2004,- More than 30 sections- More than 100 stat.Papers/posters
Two major types of Microarray
Nature cell biology Aug. 2001 v3 (8)
More refs: Nature Review of Genetics, May. 2004 v5 (5)
Oligonucleotide ArraycDNA Array
PMMM
16-20 Probe pairs / Probe Set1Probe Set / Gene
This is “per gene”. The PM/MM effectsare considered as fixed effects.
Chu, T., Weir, B., and Wolfinger, R. (02, 04).
Lipshutz et al; 1999; Nature Genetics, 21(1): 20-24.
Oligonucleotide Array
Statistical problems in Microarray?
- Multiple testing: P-values. - Variable selection. - Discrimination. - Clustering of samples. of genes. - Time course experiments.- Clinical trails Merck, GSK … - Gene networks - pathway
Terry Speed homepage: www.stat.berkeley.edu/users/terry/zarray/Html
- Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array.- Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised
Statistical Computing
Significance Analysis (challenge)
Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed?
Gene n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Gene 3
.Gene 2
.Gene 1
Trt k.Trt 2Trt 1
Oligonucleotide array (supervised)
An example: common problem in genome-wide studies:The “Large p, small n” problems.
Small n: number of replications – low statistical powerLarge p: number of features (genes, probes, bio-markers…)
– multiple-testing problems ? Computation …?
Data from McGraw Lab., Cornell Univ.Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies
… … … … … …
Chip1 Chip2 Chip19 Chip20… … … … … …(Random effect 2)
Male
Female
line1 line2 line3 line4 line5
X X X X X X5 Trt
~15000 genes, for each gene: PMMM
(fixed effect)
(…Random effect 3)
Female flies killed, mRNA prepared (random effect 1)
(1) (2) (3) (4) (10)(9)
Data from McGraw Lab., Cornell
Indices i: Trt1 Trt2 Trt3 Trt4 Trt5
Indices j: Prep1 Prep2
Indices k: Chip1 Chip2
…Indices l: 1, 2, 3,…, 19, 20
Gene g
Gene 1
Gene 2
Gene 15000
.
.
.
.
.
.
.
.
.
.
Total: 5x2x2x20=400 points for each gene g
……σgij
………..σgijk
……………..σgijkl
ygijkl
Linear Mixed Model: (for each gene g)
Ygijkl = Gg + (G*trt)gi + (G*Probe)gl
+ (G*trt*prep)gij + (G*trt*prep*chip)gijk + γgijkl.
Significant Expressed Genes: by SGA
10 possibleContrasts
Number of SignificantlyExpressed Genes
Bonferroni(.05) F.D.R.(.05)
Trt1 vs. Trt2 0 0Trt1 vs. Trt3 0 0Trt1 vs. Trt4 0 0Trt1 vs. Trt5 0 0Trt2 vs. Trt3 0 0Trt2 vs. Trt4 0 0Trt2 vs. Trt5 0 0Trt3 vs. Trt4 0 0Trt3 vs. Trt5 0 0Trt4 vs. Trt5 0 0
Possible Reasons:
2.Large p: Multiple Testing problems
(15000x10 tests) FWR vs. FDR?
(not addressed in this study.)
3. Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing …
In this study, trying to improve power in each single test…
1. … lower level analysis …
Our Idea: Taking advantage from “large p”
This plot does contain useful “global” information on each VC(range, “HDR”…).
Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene 15000 (VC1, VC2, VC3);
Note: Not the “density” of each VC …
Dens i t y
0
10
20
30
40
50
x
0 0. 05000 0. 1000 0. 1500 0. 2000 0. 2500
Black: 15000 estimated VC1Red: 15000 estimated VC2 Blue: 15000 estimated VC3
The “global” infor. is taken as the “prior”. (SGA – pilot analysis)
Our Empirical Bayes ApproachA 7-step algorithm:
1. Apply SGA to get 15000 VC estimates;2. Transform to the “ANOVA Components (AC)” ;3. Apply Jeffrey’s prior (non-informative) ;4. Fit Inverted Gamma (IG) to each AC (prior density);-----------------------------------------------------------------------------5. Derive the posterior density (and the posterior estimate)
of each AC;6. Transform the posterior estimate of AC back to VC
(reverse step 2);7. Mixed model analysis: fix the VC value to be the
posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.
Real Data Example: Cornell Data
Number of significant genes Bonferroni (.05) False Discovery Rate (.05)
Contrast S.G.A. E. B. S.G.A. E. B.Trt1 vs. Trt2 0 5 0 38
Trt1 vs. Trt3 0 24 0 91
Trt1 vs. Trt4 0 9 0 28
Trt1 vs. Trt5 0 7 0 48
Trt2 vs. Trt3 0 33 0 276
Trt2 vs. Trt4 0 22 0 209
Trt2 vs. Trt5 0 37 0 183
Trt3 vs. Trt4 0 50 0 184
Trt3 vs. Trt5 0 48 0 190
Trt4 vs. Trt5 0 9 0 68
Significance Test:
Simulation Studies
Design – structure mimic the true data: Parameters are set to be the estimated value from the
true data set. For the 3 VC, σgij=0.01, σgijk=0.015, σgijkl=0.072
5500 genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non-
significantly expressed”, with Trt mean:
Trt1 Trt2 Trt3 Trt4 Trt5Significant Expressed 0 0.15 0.30 0.45 0.60
Non-significantly Expressed 0 0 0 0 0
Simulation Results (1)
VC1(0.01) VC2(0.015) VC3(0.072)
SGA EB SGA EB SGA EB
Bias 1.3x10-3 4.3x10-4 9.2x10-4 1.4x10-4 4.4x10-5 4.2x10-5
Variance 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6
MSE 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6
EB estimator vs. REML estimator:Bias, Variance and MSE:
The bias, variance and MSE of EB are only fractions of those of SGA
Simulation Results (2):
Dens i t y
0. 0
0. 5
x
- 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6
The null distribution of the test t statistics:1. SGA (red, expected to be t distribution with df=5);2. EB with df=30 (blue);3. EB with df=1000 (green);4. Truth (black, expected to be standard normal distribution ).
Simulation Results (3):
Size Power (% Power)Contrast (Trt. Diff) SGA EB(30) EB(1000) Truth SGA EB (30) EB (1000) Truth (100%)
T1 vs. T2 (0.15) 0.047 0.044 0.051 0.044 0.116 (66.7%) 0.172 (98.9%) 0.176 (101.1%) 0.174 (100%)
T1 vs. T3 (0.30) 0.051 0.051 0.060 0.053 0.418 (70.1%) 0.558 (93.6%) 0.596 (100%) 0.596 (100%)
T1 vs. T4 (0.45) 0.050 0.051 0.057 0.047 0.692 (76.5%) 0.870 (96.2%) 0.894 (98.9%) 0.904 (100%)
T1 vs. T5 (0.60) 0.054 0.047 0.055 0.048 0.890 (91.8%) 0.964 (99.4%) 0.972 (100.2%) 0.970 (100%)
T2 vs. T3 (0.15) 0.051 0.054 0.063 0.054 0.126 (67.7%) 0.164 (88.2%) 0.188 (101.1%) 0.186 (100%)
T2 vs. T4 (0.30) 0.048 0.051 0.060 0.049 0.388 (68.3%) 0.530 (93.3%) 0.554 (97.5%) 0.568 (100%)
T2 vs. T5 (0.45) 0.050 0.048 0.057 0.052 0.672 (83.4%) 0.772 (95.8%) 0.796 (98.6%) 0.806 (100%)
T3 vs. T4 (0.15) 0.050 0.051 0.060 0.051 0.158 (81.4%) 0.194 (100%) 0.202 (104.1%) 0.194 (100%)
T3 vs. T5 (0.30) 0.049 0.047 0.055 0.050 0.380 (78.2%) 0.446 (91.8%) 0.468 (96.3%) 0.486 (100%)
T4 vs. T5 (0.15) 0.048 0.046 0.054 0.049 0.120 (75.0%) 0.144 (90.0%) 0.156 (97.5%) 0.160 (100%)
Test Size and Power Calculation:
Mean: 0.0498 0.0490 0.0572 0.0497 75.91% 94.72% 99.53% 100%
DiscussionWhy EB estimator “beats” REML estimator?
- Prior density contains “truth” information.
Q: How to control (large p vs. small p)? (Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE)
However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good …
… for “small p”, EB estimator can be biased! Gain in MSE not guaranteed!
Applications
Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide);
– Proteomics
Extension to general data sets (Mixed model), if controlling system built (in the near future).