Download - Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,

Empirical Bayes Analysis of Variance Component Models

for Microarray Data

S. Feng,1 R.Wolfinger,2 T.Chu,2 G.Gibson,3 L.McGraw4

1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.

Microarray: Genome-wide gene expression

2. Introduced to Genetics/Genomics in 1996:

...Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously.

1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon)

Scientific Papers (ISI) with Title Including "Microarray"

0500

1000150020002500300035004000

1996 1997 1998 1999 2000 2001 2002 2003

In JSM 2004,- More than 30 sections- More than 100 stat.Papers/posters

Two major types of Microarray

Nature cell biology Aug. 2001 v3 (8)

More refs: Nature Review of Genetics, May. 2004 v5 (5)

Oligonucleotide ArraycDNA Array

PMMM

16-20 Probe pairs / Probe Set1Probe Set / Gene

This is “per gene”. The PM/MM effectsare considered as fixed effects.

Chu, T., Weir, B., and Wolfinger, R. (02, 04).

Lipshutz et al; 1999; Nature Genetics, 21(1): 20-24.

Oligonucleotide Array

Statistical problems in Microarray?

- Multiple testing: P-values. - Variable selection. - Discrimination. - Clustering of samples. of genes. - Time course experiments.- Clinical trails Merck, GSK … - Gene networks - pathway

Terry Speed homepage: www.stat.berkeley.edu/users/terry/zarray/Html

- Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array.- Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised

Statistical Computing

Significance Analysis (challenge)

Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed?

Gene n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.Gene 3

.Gene 2

.Gene 1

Trt k.Trt 2Trt 1

Oligonucleotide array (supervised)

An example: common problem in genome-wide studies:The “Large p, small n” problems.

Small n: number of replications – low statistical powerLarge p: number of features (genes, probes, bio-markers…)

– multiple-testing problems ? Computation …?

Data from McGraw Lab., Cornell Univ.Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies

… … … … … …

Chip1 Chip2 Chip19 Chip20… … … … … …(Random effect 2)

Male

Female

line1 line2 line3 line4 line5

X X X X X X5 Trt

~15000 genes, for each gene: PMMM

(fixed effect)

(…Random effect 3)

Female flies killed, mRNA prepared (random effect 1)

(1) (2) (3) (4) (10)(9)

Data from McGraw Lab., Cornell

Indices i: Trt1 Trt2 Trt3 Trt4 Trt5

Indices j: Prep1 Prep2

Indices k: Chip1 Chip2

…Indices l: 1, 2, 3,…, 19, 20

Gene g

Gene 1

Gene 2

Gene 15000

.

.

.

.

.

.

.

.

.

.

Total: 5x2x2x20=400 points for each gene g

……σgij

………..σgijk

……………..σgijkl

ygijkl

Linear Mixed Model: (for each gene g)

Ygijkl = Gg + (G*trt)gi + (G*Probe)gl

+ (G*trt*prep)gij + (G*trt*prep*chip)gijk + γgijkl.

Significant Expressed Genes: by SGA

10 possibleContrasts

Number of SignificantlyExpressed Genes

Bonferroni(.05) F.D.R.(.05)

Trt1 vs. Trt2 0 0Trt1 vs. Trt3 0 0Trt1 vs. Trt4 0 0Trt1 vs. Trt5 0 0Trt2 vs. Trt3 0 0Trt2 vs. Trt4 0 0Trt2 vs. Trt5 0 0Trt3 vs. Trt4 0 0Trt3 vs. Trt5 0 0Trt4 vs. Trt5 0 0

Possible Reasons:

2.Large p: Multiple Testing problems

(15000x10 tests) FWR vs. FDR?

(not addressed in this study.)

3. Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing …

In this study, trying to improve power in each single test…

1. … lower level analysis …

Our Idea: Taking advantage from “large p”

This plot does contain useful “global” information on each VC(range, “HDR”…).

Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene 15000 (VC1, VC2, VC3);

Note: Not the “density” of each VC …

Dens i t y

0

10

20

30

40

50

x

0 0. 05000 0. 1000 0. 1500 0. 2000 0. 2500

Black: 15000 estimated VC1Red: 15000 estimated VC2 Blue: 15000 estimated VC3

The “global” infor. is taken as the “prior”. (SGA – pilot analysis)

Our Empirical Bayes ApproachA 7-step algorithm:

1. Apply SGA to get 15000 VC estimates;2. Transform to the “ANOVA Components (AC)” ;3. Apply Jeffrey’s prior (non-informative) ;4. Fit Inverted Gamma (IG) to each AC (prior density);-----------------------------------------------------------------------------5. Derive the posterior density (and the posterior estimate)

of each AC;6. Transform the posterior estimate of AC back to VC

(reverse step 2);7. Mixed model analysis: fix the VC value to be the

posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.

Real Data Example: Cornell Data

Number of significant genes Bonferroni (.05) False Discovery Rate (.05)

Contrast S.G.A. E. B. S.G.A. E. B.Trt1 vs. Trt2 0 5 0 38

Trt1 vs. Trt3 0 24 0 91

Trt1 vs. Trt4 0 9 0 28

Trt1 vs. Trt5 0 7 0 48

Trt2 vs. Trt3 0 33 0 276

Trt2 vs. Trt4 0 22 0 209

Trt2 vs. Trt5 0 37 0 183

Trt3 vs. Trt4 0 50 0 184

Trt3 vs. Trt5 0 48 0 190

Trt4 vs. Trt5 0 9 0 68

Significance Test:

Simulation Studies

Design – structure mimic the true data: Parameters are set to be the estimated value from the

true data set. For the 3 VC, σgij=0.01, σgijk=0.015, σgijkl=0.072

5500 genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non-

significantly expressed”, with Trt mean:

Trt1 Trt2 Trt3 Trt4 Trt5Significant Expressed 0 0.15 0.30 0.45 0.60

Non-significantly Expressed 0 0 0 0 0

Simulation Results (1)

VC1(0.01) VC2(0.015) VC3(0.072)

SGA EB SGA EB SGA EB

Bias 1.3x10-3 4.3x10-4 9.2x10-4 1.4x10-4 4.4x10-5 4.2x10-5

Variance 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6

MSE 1.6x10-4 4.6x10-5 6.9x10-5 2.1x10-5 3.2x10-5 8.0x10-6

EB estimator vs. REML estimator:Bias, Variance and MSE:

The bias, variance and MSE of EB are only fractions of those of SGA

Simulation Results (2):

Dens i t y

0. 0

0. 5

x

- 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6

The null distribution of the test t statistics:1. SGA (red, expected to be t distribution with df=5);2. EB with df=30 (blue);3. EB with df=1000 (green);4. Truth (black, expected to be standard normal distribution ).

Simulation Results (3):

Size Power (% Power)Contrast (Trt. Diff) SGA EB(30) EB(1000) Truth SGA EB (30) EB (1000) Truth (100%)

T1 vs. T2 (0.15) 0.047 0.044 0.051 0.044 0.116 (66.7%) 0.172 (98.9%) 0.176 (101.1%) 0.174 (100%)

T1 vs. T3 (0.30) 0.051 0.051 0.060 0.053 0.418 (70.1%) 0.558 (93.6%) 0.596 (100%) 0.596 (100%)

T1 vs. T4 (0.45) 0.050 0.051 0.057 0.047 0.692 (76.5%) 0.870 (96.2%) 0.894 (98.9%) 0.904 (100%)

T1 vs. T5 (0.60) 0.054 0.047 0.055 0.048 0.890 (91.8%) 0.964 (99.4%) 0.972 (100.2%) 0.970 (100%)

T2 vs. T3 (0.15) 0.051 0.054 0.063 0.054 0.126 (67.7%) 0.164 (88.2%) 0.188 (101.1%) 0.186 (100%)

T2 vs. T4 (0.30) 0.048 0.051 0.060 0.049 0.388 (68.3%) 0.530 (93.3%) 0.554 (97.5%) 0.568 (100%)

T2 vs. T5 (0.45) 0.050 0.048 0.057 0.052 0.672 (83.4%) 0.772 (95.8%) 0.796 (98.6%) 0.806 (100%)

T3 vs. T4 (0.15) 0.050 0.051 0.060 0.051 0.158 (81.4%) 0.194 (100%) 0.202 (104.1%) 0.194 (100%)

T3 vs. T5 (0.30) 0.049 0.047 0.055 0.050 0.380 (78.2%) 0.446 (91.8%) 0.468 (96.3%) 0.486 (100%)

T4 vs. T5 (0.15) 0.048 0.046 0.054 0.049 0.120 (75.0%) 0.144 (90.0%) 0.156 (97.5%) 0.160 (100%)

Test Size and Power Calculation:

Mean: 0.0498 0.0490 0.0572 0.0497 75.91% 94.72% 99.53% 100%

DiscussionWhy EB estimator “beats” REML estimator?

- Prior density contains “truth” information.

Q: How to control (large p vs. small p)? (Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE)

However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good …

… for “small p”, EB estimator can be biased! Gain in MSE not guaranteed!

Applications

Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide);

– Proteomics

Extension to general data sets (Mixed model), if controlling system built (in the near future).

Download - Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,

Top Related