test of significance for small samples javier cabrera
Post on 09-Jan-2016
35 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Test of significance for small samples
Javier CabreraDirector, Biostatistics Institute Rutgers University
Dhammika Amaratunga,Johnson & Johnson Pharmaceutical Research & Development
2
Outline
• Microarray Experiments and Differential expression
• Small sample size issues• Conditional t approach• Comparison with other methods• Extensions
Reference: Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley.2004. Amaratunga, Cabrera.
Software: DNAMR and DNAMRweb
http://www.rci.rutgers.edu/~cabrera/DNAMR
3
A gene is expressed via the process:
DNA mRNA protein transcription translation
replication
The central dogma of molecular biology
Genes: A gene is a segment of DNA whose sequence of bases (nucleotides) codes for a specific protein.
AKAP6: CATCATGCAGCAGGTCAAACAAGGCATCTCCTAGTATTGCATCCTACA……
4
cDNA oroligonucleotide
preparation
Glass slide Biological sample
mRNA
Reverse transcribeand label
SampleMicroarray +
Image
Quantify spot intensities
Gene expression data
5k-50k genes arrayed in rectangular grid; one spot per gene
Microarray experiment
Hybridize, wash and scan
Print or synthesize
5
Differential gene expression
An organism’s genome is the complete set of genes in each of its cells. Given an organism, every one of its cells has a copy of the exact same genome, but
different cells express different genes
different genes express under different conditions
differential gene expression leads toaltered cell states
6
C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27G10 5.12 4.85 3.79 4.13 3.12 4.79G11 4.67 3.50 4.77 4.09 3.86 2.88G12 6.22 6.42 5.02 6.38 6.54 6.80G13 2.88 3.76 2.78 2.98 4.81 4.15.......
Differential Expression for small samples
1. Preprocessed data.2. Perform a t-test for each gene.3. Select the most significant subset.
7
The t test statistic for testing for a mean effect is: 1/ 2
2 1 1 2( ) /( (1/ 1/ ) )g g g gT X X s n n
where sg, the pooled standard error, is the positive square root of: 2 2 2
1 1 2 2 1 2(( 1) ( 1) ) /( 2)g g gs n s n s n n
If there is no mean effect,
1 2( 2)~g n nT t
(Student / Fisher)
The pooled variances T-test
8
300 21983
Plot t vs sp Distribution of sp
Random Data
Differentially expressed genes have smaller sp.
Is this effect Statistical or Biological?
9
500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)
100 genes are differentially express with mean diff = +1 or -1
2=1 CONSTANT, False Discoveries True DiscoveriesT-test 44 22 z-test 43 29
2 from Chi-square(df=3), False Discoveries True DiscoveriesT-test 43 28 z-test 53 13
10
The effect of small sample size
Often the sample size per group is small.
unreliable variances (inferences)
dependence between the test statistics (tg) and the standard error estimates (sg)
borrow strength across genes (LPE/EB)
regularize the test statistics (SAM)
work with tg|sg (Conditional t).
11
Analysis results
Top 10 genes (sorted by t-test p-value)
Gene Fold Dir p p(Bonf) G6546 2.36 D 0.000004 0.0964G19945 3.25 U 0.000005 0.1102G21586 1.64 U 0.000008 0.1765G18970 2.52 U 0.000019 0.4220 G7432 3.70 D 0.000033 0.7248G19057 1.85 U 0.000046 1.0000G17361 4.34 D 0.000067 1.0000 G8525 5.57 D 0.000067 1.0000 G425 18.11 D 0.000078 1.0000 G8524 4.74 D 0.000109 1.0000
12
SAM: Determining c
v1 () =mad{ Tg}
v2() v3() v4() v5() v6() v7()
Tg
sg
cv()
( ) g
g
g
rT s
s s
For each
cv(1
)s1
cv(2
)s2
cv(3
)s3
cv(4
)s4
cv(5
)s5
cv(6
)s6
cv(7
)s7
Min c
13
-3 -2 -1 0 1 2 3
-50
5
db
d
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Pooled Sd
P(|
SA
M|>
t)
ˆ( )gT c
( ) ˆ( )gT c
SAM: Gene selection
( ) ˆ( )gT c ˆ( )gT c= Expected value of under permutations
14
Let Xgij denote the preprocessed intensity measurement for gene g in array i of group j.
Model: Xgij = gj + g gij
Effect of interest:g= g2 - g1
Error model:gij ~ F(location=0, scale=1)
Gene mean-variance model:(g1,g2)
~ F
with marginals: g1 ~ Fand g2 ~ F
Conditional t: Basic Model
15
Parametric: Assume functional forms for F and F and apply either a Bayes or Empirical Bayes procedure.
Nonparametric:1. or
For small samples is not a good estimator of F Use method of moments = Target estimation
2. Proceed via resampling and estimate the distribution: t |sp (Conditional t).
Estimate F: edf , ,F, of {( 1gX, sg2)}
Estimate F: edf , F, of {( )/gij gj gX X s }
Possible approaches
16
(1) D raw a gene, g , at random from {1, … , G }.
C all it g*. ( * 1gX , *
2
gs ) ~ ,F .
(2) Take a random sam ple (w ith replacem ent)
of size n1+n2 from F : * ˆ~ijr F
(3) C om bine these to form pseudo-data:
* *
* *
1ij ijg gX X s r
(4) C alculate the pooled standard error s* and t test statistic t* for the pseudo-data {X ij
*}.
Procedure
17
(5) Repeat steps (1)-(4) a large number (10,000) of times. (6) Given , estimate the “critical envelope”, t(sg), as the (/2) and (1-/2) quantile curves in the tg vs sg relationship. (7) Genes that fall outside the critical envelope defined by t(sg) are deemed significant at level . (Overall unconditional Type I error rate = )
Procedure (cont.)
18
ˆ ( ) is not a good estimator of ( ) F t F t
Let {Xij} be a sample from the model with F
and let the variance obtained from the {Xij} be s2
Then Var(s2) > Var(2)
For example, if we assume that F = 32, n=4 and
~ N(0,1), then Var(2)=6 and Var(s2)=15.
Fix by target estimation: Method of moments.
Shrink towards the center
Roadblock
19
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Case 1
S1
E7
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
Case 2
S2
E7
1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Case 3
S3
E7
E7 Data
Sp
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
020
040
060
080
010
00
Case 1
Sp
Fre
quen
cy
0.0 0.5 1.0 1.5 2.0 2.5 3.0
050
010
0015
00
Case 2
Sp
Fre
quen
cy
0 1 2 3 4
010
020
030
040
050
0
Case 3
Sp
Fre
quen
cy
0 1 2 3 4 5 6
020
040
060
080
010
00
Example: Checking for the distribution of g
1. Df=0.5
2. Df=2 3. Df=6
1. Df=0.5 2. Df=2
3. Df=6
Mice Data
2 2 2 2 2 20.5 2 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:
20
Tox Data
Sp
Fre
quen
cy
0.0 0.1 0.2 0.3 0.4 0.5
020
060
010
00
Case 1
Sp
Fre
quen
cy
0.2 0.4 0.6 0.8 1.0 1.2 1.4
020
040
060
0
Case 2
Sp
Fre
quen
cy
0 1 2 3 4
020
060
010
00
Case 3
Sp
Fre
quen
cy
0 100 200 300 400
050
015
0025
00
Another Example
0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.0
0.1
0.2
0.3
0.4
0.5
Case 1
S1
Tox
0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
0.5
Case 2
S2
Tox
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
Case 3
S3
Tox
0.0 0.1 0.2 0.3 0.4 0.5
-3-2
-10
12
3
mean diff vs Sp
Sp
Mea
n di
ff
Df=0.5
Df=3 Df=6
Df=0.5
Df=3
Df=3
Df=6Df=6
2 2 2 2 2 20.5 3 61. ~ , 2. ~ , 3. ~ Compare the distr. of sg vs simulation with:
21
Fixing the variance distribution
The idea is to estimate the function h:[0:1] [0,1] defined by
h(F(x)) = F (x). Since h is strictly monotonic, it can be inverted
in order to obtain an estimate of F(x). Procedure:
(1) Assume that F (x) is the true distribution of and draw a
random sample, s*2, from F .
(2) Take a random sample (with replacement) of size N from F : * ˆ~ijr F for i=1,…, nj, j=1,2.
(3) Combine these to form pseudo-data: * * *ij ijX s r
22
( 4 ) C a l c u l a t e t h e p o o l e d s t a n d a r d e r r o r s * * f o r t h e p s e u d o - d a t a { X i j
* } . ( 5 ) R e p e a t s t e p s ( B 1 ) - ( B 4 ) a l a r g e n u m b e r ( s a y 1 0 0 , 0 0 0 ) o f t i m e s a n d r e c o r d , f o r e a c h i t e r a t i o n , t h e p a i r o f v a l u e s { ( s * 2 , s * * 2 ) } .
( 6 ) L e t *ˆF ( x ) b e t h e e m p i r i c a l d i s t r i b u t i o n o f t h e s * * 2
g ’ s . T h e n t h e
e s t i m a t o r o f h i s o b t a i n e d b y m a p p i n g t h e e m p i r i c a l d i s t r i b u t i o n ˆF i n t o *
ˆF . M o r e p r e c i s e l y 1
*ˆ ˆ ˆ ˆ( ( ) ) ( ( ) )h y F x F F y
a n d 1 1*
ˆ ˆ ˆ( ) ( ( ) )h y F F y .
H e n c e t h e b i a s - c o r r e c t e d e s t i m a t o r o f F i s : 1
*ˆ ˆ ˆ( ) ( ( ( ) ) )F x F F F x
.
Fixing the variance distribution (contd)
Proceed as before …
23
191 22092
Plot t vs sp
Differentially expressed genes may have large sp
24
500 Simulation: 1000 Genes 4 Controls + 4 Treats iid Normal(0, 2)
100 genes are differentially express with mean diff = +1 or -1
2=1 CONSTANT False Discoveries True DiscoveriesT-test 44 22 z-test 43 29C-t 45 30
2 from Chi-square(df=3) False Discoveries True DiscoveriesT-test 43 28 z-test 53 13C-t 42 38
25
Using 8 iid samples from Khan Data, we make changes to 50 genes to make them differentially expressed for high level.
T-testSAM
Ct
26
To generate p-values, recall that the Ct procedure generates curves, c(s). Start with a set of curves,
1( ) ( )
kg gc s c s , for a set of
prespecified values, 1 k .
Now consider the relationship between vi=log(-log(i)) and ui=log( ( ))
i gc s
To assign an approximate p-value to the gth gene, if |tg | ( )k gc s ,
interpolate the relationship between the {ui} and the {vi}.
Generating p-values
27
Extensions F test: - Condition on the sqrt(MSE) Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE) Gene Ontology. - Test for the significance of groups.
- Use Hypergeometric Statistic, mean t, mean p-value, or other.
- Condition on log of the number of genes per group
28
Conditional F
0.2 0.4 0.6 0.8 1.0 1.2
01
02
03
0
Sqrt(MSE)
Sq
rt(F
)
29
0 2 4 6
0.0
0.5
1.0
1.5
2.0
Sd
|T|
GO Ontology: Conditioning on log(n)
Abs(T)
Log(n)
30
The Details:ReferenceExploration and Analysis of DNA Microarray and Protein Array Data. Wiley . Jan 2004.Amaratunga, Cabrera.
Emailcabrera@stat.rutgers.edudamaratu@prdus.jnj.com
Webpage for DNAMR and DNAMRwebhttp://www.rci.rutgers.edu/~cabrera/DNAMR
31
Target Estimation:
Cabrera, Fernholz (1999)
- Bias Reduction.
- MSE reduction.
Recent Applications:
- Ellipse Estimation (Multivariate Target).
- Logistic Regression:
• Cabrera, Fernholz, Devas (2003)
• Patel (2003) Target Conditional MLE (TCMLE)
Implementation in StatXact (CYTEL) and
logXact Proc’s in SAS(by CYTEL).
Target Estimation
32
Target Estimation
T(x1,x2,…,xn)
E(T)
E(T) =
g(
33
Target Estimation:1
1
ˆSuppose we have an estimator ( ,..., ) of a paramter
ˆTarget estimator : Solve ( )
ˆ ( ) ( ) then ( )
nT x x
E T
h E T h
Algorithms: - Stochastic approximation.
- Simulation and iteration.
- Exact algorithm for TCMLE
top related