empirical bayes analysis of variance component models for microarray data s. feng, 1 r.wolfinger, 2...

Empirical Bayes Analysis of Variance Component Models

for Microarray Data

S. Feng,1 R.Wolfinger,2 T.Chu,2 G.Gibson,3 L.McGraw4

1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.

Microarray: Genome-wide gene expression

2. Introduced to Genetics/Genomics in 1996:

...Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously.

1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon)

Scientific Papers (ISI) with Title Including "Microarray"

0500

1000150020002500300035004000

1996 1997 1998 1999 2000 2001 2002 2003

In JSM 2004,- More than 30 sections- More than 100 stat.Papers/posters

Two major types of Microarray

Nature cell biology Aug. 2001 v3 (8)

More refs: Nature Review of Genetics, May. 2004 v5 (5)

Oligonucleotide ArraycDNA Array

PMMM

16-20 Probe pairs / Probe Set1Probe Set / Gene

This is “per gene”. The PM/MM effectsare considered as fixed effects.

Chu, T., Weir, B., and Wolfinger, R. (02, 04).

Lipshutz et al; 1999; Nature Genetics, 21(1): 20-24.

Oligonucleotide Array

Statistical problems in Microarray?

- Multiple testing: P-values. - Variable selection. - Discrimination. - Clustering of samples. of genes. - Time course experiments.- Clinical trails Merck, GSK … - Gene networks - pathway

Terry Speed homepage: www.stat.berkeley.edu/users/terry/zarray/Html

- Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array.- Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised

Statistical Computing

Significance Analysis (challenge)

Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed?

Gene n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.Gene 3

.Gene 2

.Gene 1

Trt k.Trt 2Trt 1

Oligonucleotide array (supervised)

An example: common problem in genome-wide studies:The “Large p, small n” problems.

Small n: number of replications – low statistical powerLarge p: number of features (genes, probes, bio-markers…)

– multiple-testing problems ? Computation …?

Data from McGraw Lab., Cornell Univ.Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies

… … … … … …

Chip1 Chip2 Chip19 Chip20… … … … … …(Random effect 2)

Male

Female

line1 line2 line3 line4 line5

X X X X X X5 Trt

~15000 genes, for each gene: PMMM

(fixed effect)

(…Random effect 3)

Female flies killed, mRNA prepared (random effect 1)

(1) (2) (3) (4) (10)(9)

Data from McGraw Lab., Cornell

Indices i: Trt1 Trt2 Trt3 Trt4 Trt5

Indices j: Prep1 Prep2

Indices k: Chip1 Chip2

…Indices l: 1, 2, 3,…, 19, 20

Gene g

Gene 1

Gene 2

Gene 15000

.

.

.

.

.

.

.

.

.

.

Total: 5x2x2x20=400 points for each gene g

……σgij

………..σgijk

……………..σgijkl

ygijkl

Linear Mixed Model: (for each gene g)

Ygijkl = Gg + (G*trt)gi + (G*Probe)gl

+ (G*trt*prep)gij + (G*trt*prep*chip)gijk + γgijkl.

Significant Expressed Genes: by SGA

10 possibleContrasts

Number of SignificantlyExpressed Genes

Bonferroni(.05) F.D.R.(.05)

Trt1 vs. Trt2 0 0Trt1 vs. Trt3 0 0Trt1 vs. Trt4 0 0Trt1 vs. Trt5 0 0Trt2 vs. Trt3 0 0Trt2 vs. Trt4 0 0Trt2 vs. Trt5 0 0Trt3 vs. Trt4 0 0Trt3 vs. Trt5 0 0Trt4 vs. Trt5 0 0

Possible Reasons:

2.Large p: Multiple Testing problems

(15000x10 tests) FWR vs. FDR?

(not addressed in this study.)

3. Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing …

In this study, trying to improve power in each single test…

1. … lower level analysis …

Our Idea: Taking advantage from “large p”

This plot does contain useful “global” information on each VC(range, “HDR”…).

Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene 15000 (VC1, VC2, VC3);

Note: Not the “density” of each VC …

Dens i t y

0

10

20

30

40

50

x

0 0. 05000 0. 1000 0. 1500 0. 2000 0. 2500

Black: 15000 estimated VC1Red: 15000 estimated VC2 Blue: 15000 estimated VC3

The “global” infor. is taken as the “prior”. (SGA – pilot analysis)

Our Empirical Bayes ApproachA 7-step algorithm:

1. Apply SGA to get 15000 VC estimates;2. Transform to the “ANOVA Components (AC)” ;3. Apply Jeffrey’s prior (non-informative) ;4. Fit Inverted Gamma (IG) to each AC (prior density);-----------------------------------------------------------------------------5. Derive the posterior density (and the posterior estimate)

of each AC;6. Transform the posterior estimate of AC back to VC

(reverse step 2);7. Mixed model analysis: fix the VC value to be the

posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.

Real Data Example: Cornell Data

Number of significant genes Bonferroni (.05) False Discovery Rate (.05)

Contrast S.G.A. E. B. S.G.A. E. B.Trt1 vs. Trt2 0 5 0 38

Trt1 vs. Trt3 0 24 0 91

Trt1 vs. Trt4 0 9 0 28

Trt1 vs. Trt5 0 7 0 48

Trt2 vs. Trt3 0 33 0 276

Trt2 vs. Trt4 0 22 0 209

Trt2 vs. Trt5 0 37 0 183

Trt3 vs. Trt4 0 50 0 184

Trt3 vs. Trt5 0 48 0 190

Trt4 vs. Trt5 0 9 0 68

Significance Test:

Simulation Studies

Design – structure mimic the true data: Parameters are set to be the estimated value from the

true data set. For the 3 VC, σgij=0.01, σgijk=0.015, σgijkl=0.072

5500 genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non-

significantly expressed”, with Trt mean:

Trt1 Trt2 Trt3 Trt4 Trt5Significant Expressed 0 0.15 0.30 0.45 0.60

Non-significantly Expressed 0 0 0 0 0

Simulation Results (1)

VC1(0.01) VC2(0.015) VC3(0.072)

SGA EB SGA EB SGA EB

Bias 1.3x10-3 4.3x10-4 9.2x10-4 1.4x10-4 4.4x10-5 4.2x10-5

Variance 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6

MSE 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6

EB estimator vs. REML estimator:Bias, Variance and MSE:

The bias, variance and MSE of EB are only fractions of those of SGA

Simulation Results (2):

Dens i t y

0. 0

0. 5

x

- 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6

The null distribution of the test t statistics:1. SGA (red, expected to be t distribution with df=5);2. EB with df=30 (blue);3. EB with df=1000 (green);4. Truth (black, expected to be standard normal distribution ).

Simulation Results (3):

Size Power (% Power)Contrast (Trt. Diff) SGA EB(30) EB(1000) Truth SGA EB (30) EB (1000) Truth (100%)

T1 vs. T2 (0.15) 0.047 0.044 0.051 0.044 0.116 (66.7%) 0.172 (98.9%) 0.176 (101.1%) 0.174 (100%)

T1 vs. T3 (0.30) 0.051 0.051 0.060 0.053 0.418 (70.1%) 0.558 (93.6%) 0.596 (100%) 0.596 (100%)

T1 vs. T4 (0.45) 0.050 0.051 0.057 0.047 0.692 (76.5%) 0.870 (96.2%) 0.894 (98.9%) 0.904 (100%)

T1 vs. T5 (0.60) 0.054 0.047 0.055 0.048 0.890 (91.8%) 0.964 (99.4%) 0.972 (100.2%) 0.970 (100%)

T2 vs. T3 (0.15) 0.051 0.054 0.063 0.054 0.126 (67.7%) 0.164 (88.2%) 0.188 (101.1%) 0.186 (100%)

T2 vs. T4 (0.30) 0.048 0.051 0.060 0.049 0.388 (68.3%) 0.530 (93.3%) 0.554 (97.5%) 0.568 (100%)

T2 vs. T5 (0.45) 0.050 0.048 0.057 0.052 0.672 (83.4%) 0.772 (95.8%) 0.796 (98.6%) 0.806 (100%)

T3 vs. T4 (0.15) 0.050 0.051 0.060 0.051 0.158 (81.4%) 0.194 (100%) 0.202 (104.1%) 0.194 (100%)

T3 vs. T5 (0.30) 0.049 0.047 0.055 0.050 0.380 (78.2%) 0.446 (91.8%) 0.468 (96.3%) 0.486 (100%)

T4 vs. T5 (0.15) 0.048 0.046 0.054 0.049 0.120 (75.0%) 0.144 (90.0%) 0.156 (97.5%) 0.160 (100%)

Test Size and Power Calculation:

Mean: 0.0498 0.0490 0.0572 0.0497 75.91% 94.72% 99.53% 100%

DiscussionWhy EB estimator “beats” REML estimator?

- Prior density contains “truth” information.

Q: How to control (large p vs. small p)? (Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE)

However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good …

… for “small p”, EB estimator can be biased! Gain in MSE not guaranteed!

Applications

Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide);

– Proteomics

Extension to general data sets (Mixed model), if controlling system built (in the near future).

empirical bayes analysis of variance component models for microarray data s. feng, 1 r.wolfinger, 2...

Documents