empirical bayes analysis of variance component models for microarray data s. feng, 1 r.wolfinger, 2...
DESCRIPTION
Two major types of Microarray Nature cell biology Aug v3 (8) More refs: Nature Review of Genetics, May v5 (5) Oligonucleotide ArraycDNA ArrayTRANSCRIPT
Empirical Bayes Analysis of Variance Component Models
for Microarray Data
S. Feng,1 R.Wolfinger,2 T.Chu,2 G.Gibson,3 L.McGraw4
1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.
Microarray: Genome-wide gene expression
2. Introduced to Genetics/Genomics in 1996:
...Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously.
1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon)
Scientific Papers (ISI) with Title Including "Microarray"
0500
1000150020002500300035004000
1996 1997 1998 1999 2000 2001 2002 2003
In JSM 2004,- More than 30 sections- More than 100 stat.Papers/posters
Two major types of Microarray
Nature cell biology Aug. 2001 v3 (8)
More refs: Nature Review of Genetics, May. 2004 v5 (5)
Oligonucleotide ArraycDNA Array
PMMM
16-20 Probe pairs / Probe Set1Probe Set / Gene
This is “per gene”. The PM/MM effectsare considered as fixed effects.
Chu, T., Weir, B., and Wolfinger, R. (02, 04).
Lipshutz et al; 1999; Nature Genetics, 21(1): 20-24.
Oligonucleotide Array
Statistical problems in Microarray?
- Multiple testing: P-values. - Variable selection. - Discrimination. - Clustering of samples. of genes. - Time course experiments.- Clinical trails Merck, GSK … - Gene networks - pathway
Terry Speed homepage: www.stat.berkeley.edu/users/terry/zarray/Html
- Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array.- Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised
Statistical Computing
Significance Analysis (challenge)
Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed?
Gene n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Gene 3
.Gene 2
.Gene 1
Trt k.Trt 2Trt 1
Oligonucleotide array (supervised)
An example: common problem in genome-wide studies:The “Large p, small n” problems.
Small n: number of replications – low statistical powerLarge p: number of features (genes, probes, bio-markers…)
– multiple-testing problems ? Computation …?
Data from McGraw Lab., Cornell Univ.Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies
… … … … … …
Chip1 Chip2 Chip19 Chip20… … … … … …(Random effect 2)
Male
Female
line1 line2 line3 line4 line5
X X X X X X5 Trt
~15000 genes, for each gene: PMMM
(fixed effect)
(…Random effect 3)
Female flies killed, mRNA prepared (random effect 1)
(1) (2) (3) (4) (10)(9)
Data from McGraw Lab., Cornell
Indices i: Trt1 Trt2 Trt3 Trt4 Trt5
Indices j: Prep1 Prep2
Indices k: Chip1 Chip2
…Indices l: 1, 2, 3,…, 19, 20
Gene g
Gene 1
Gene 2
Gene 15000
.
.
.
.
.
.
.
.
.
.
Total: 5x2x2x20=400 points for each gene g
……σgij
………..σgijk
……………..σgijkl
ygijkl
Linear Mixed Model: (for each gene g)
Ygijkl = Gg + (G*trt)gi + (G*Probe)gl
+ (G*trt*prep)gij + (G*trt*prep*chip)gijk + γgijkl.
Significant Expressed Genes: by SGA
10 possibleContrasts
Number of SignificantlyExpressed Genes
Bonferroni(.05) F.D.R.(.05)
Trt1 vs. Trt2 0 0Trt1 vs. Trt3 0 0Trt1 vs. Trt4 0 0Trt1 vs. Trt5 0 0Trt2 vs. Trt3 0 0Trt2 vs. Trt4 0 0Trt2 vs. Trt5 0 0Trt3 vs. Trt4 0 0Trt3 vs. Trt5 0 0Trt4 vs. Trt5 0 0
Possible Reasons:
2.Large p: Multiple Testing problems
(15000x10 tests) FWR vs. FDR?
(not addressed in this study.)
3. Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing …
In this study, trying to improve power in each single test…
1. … lower level analysis …
Our Idea: Taking advantage from “large p”
This plot does contain useful “global” information on each VC(range, “HDR”…).
Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene 15000 (VC1, VC2, VC3);
Note: Not the “density” of each VC …
Dens i t y
0
10
20
30
40
50
x
0 0. 05000 0. 1000 0. 1500 0. 2000 0. 2500
Black: 15000 estimated VC1Red: 15000 estimated VC2 Blue: 15000 estimated VC3
The “global” infor. is taken as the “prior”. (SGA – pilot analysis)
Our Empirical Bayes ApproachA 7-step algorithm:
1. Apply SGA to get 15000 VC estimates;2. Transform to the “ANOVA Components (AC)” ;3. Apply Jeffrey’s prior (non-informative) ;4. Fit Inverted Gamma (IG) to each AC (prior density);-----------------------------------------------------------------------------5. Derive the posterior density (and the posterior estimate)
of each AC;6. Transform the posterior estimate of AC back to VC
(reverse step 2);7. Mixed model analysis: fix the VC value to be the
posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.
Real Data Example: Cornell Data
Number of significant genes Bonferroni (.05) False Discovery Rate (.05)
Contrast S.G.A. E. B. S.G.A. E. B.Trt1 vs. Trt2 0 5 0 38
Trt1 vs. Trt3 0 24 0 91
Trt1 vs. Trt4 0 9 0 28
Trt1 vs. Trt5 0 7 0 48
Trt2 vs. Trt3 0 33 0 276
Trt2 vs. Trt4 0 22 0 209
Trt2 vs. Trt5 0 37 0 183
Trt3 vs. Trt4 0 50 0 184
Trt3 vs. Trt5 0 48 0 190
Trt4 vs. Trt5 0 9 0 68
Significance Test:
Simulation Studies
Design – structure mimic the true data: Parameters are set to be the estimated value from the
true data set. For the 3 VC, σgij=0.01, σgijk=0.015, σgijkl=0.072
5500 genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non-
significantly expressed”, with Trt mean:
Trt1 Trt2 Trt3 Trt4 Trt5Significant Expressed 0 0.15 0.30 0.45 0.60
Non-significantly Expressed 0 0 0 0 0
Simulation Results (1)
VC1(0.01) VC2(0.015) VC3(0.072)
SGA EB SGA EB SGA EB
Bias 1.3x10-3 4.3x10-4 9.2x10-4 1.4x10-4 4.4x10-5 4.2x10-5
Variance 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6
MSE 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6
EB estimator vs. REML estimator:Bias, Variance and MSE:
The bias, variance and MSE of EB are only fractions of those of SGA
Simulation Results (2):
Dens i t y
0. 0
0. 5
x
- 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6
The null distribution of the test t statistics:1. SGA (red, expected to be t distribution with df=5);2. EB with df=30 (blue);3. EB with df=1000 (green);4. Truth (black, expected to be standard normal distribution ).
Simulation Results (3):
Size Power (% Power)Contrast (Trt. Diff) SGA EB(30) EB(1000) Truth SGA EB (30) EB (1000) Truth (100%)
T1 vs. T2 (0.15) 0.047 0.044 0.051 0.044 0.116 (66.7%) 0.172 (98.9%) 0.176 (101.1%) 0.174 (100%)
T1 vs. T3 (0.30) 0.051 0.051 0.060 0.053 0.418 (70.1%) 0.558 (93.6%) 0.596 (100%) 0.596 (100%)
T1 vs. T4 (0.45) 0.050 0.051 0.057 0.047 0.692 (76.5%) 0.870 (96.2%) 0.894 (98.9%) 0.904 (100%)
T1 vs. T5 (0.60) 0.054 0.047 0.055 0.048 0.890 (91.8%) 0.964 (99.4%) 0.972 (100.2%) 0.970 (100%)
T2 vs. T3 (0.15) 0.051 0.054 0.063 0.054 0.126 (67.7%) 0.164 (88.2%) 0.188 (101.1%) 0.186 (100%)
T2 vs. T4 (0.30) 0.048 0.051 0.060 0.049 0.388 (68.3%) 0.530 (93.3%) 0.554 (97.5%) 0.568 (100%)
T2 vs. T5 (0.45) 0.050 0.048 0.057 0.052 0.672 (83.4%) 0.772 (95.8%) 0.796 (98.6%) 0.806 (100%)
T3 vs. T4 (0.15) 0.050 0.051 0.060 0.051 0.158 (81.4%) 0.194 (100%) 0.202 (104.1%) 0.194 (100%)
T3 vs. T5 (0.30) 0.049 0.047 0.055 0.050 0.380 (78.2%) 0.446 (91.8%) 0.468 (96.3%) 0.486 (100%)
T4 vs. T5 (0.15) 0.048 0.046 0.054 0.049 0.120 (75.0%) 0.144 (90.0%) 0.156 (97.5%) 0.160 (100%)
Test Size and Power Calculation:
Mean: 0.0498 0.0490 0.0572 0.0497 75.91% 94.72% 99.53% 100%
DiscussionWhy EB estimator “beats” REML estimator?
- Prior density contains “truth” information.
Q: How to control (large p vs. small p)? (Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE)
However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good …
… for “small p”, EB estimator can be biased! Gain in MSE not guaranteed!
Applications
Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide);
– Proteomics
Extension to general data sets (Mixed model), if controlling system built (in the near future).