division of human cancer genetics ohio state university
DESCRIPTION
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays. William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright. Division of Human Cancer Genetics Ohio State University. polyA. Coding portion of gene X. Perfect Match (PM) - PowerPoint PPT PresentationTRANSCRIPT
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays
Division of Human Cancer GeneticsOhio State University
William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright
Measuring gene expression with the Affymetrix GeneChip
Perfect Match (PM)Mismatch (MM)
PM - 25 bases complementary to region of geneMM - Middle base is different
...
Coding portion of gene X polyA
•cRNA from sample mRNA is put on the chip
•intensity of binding reflects gene expression
Reproducibility of Probe Sensitivities
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
The Li-Wong Model
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
Li-Wong Full (LWF)
Li-Wong Reduced (LWR)
),0(~
,2
Ne
eMM
ePM
ijjij
ijijjij
222 2),,0(~
,
N
MMPMy ijijijij
Identifiability constraint j
j J2
The Li-Wong Model
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
Li-Wong Full (LWF)
Li-Wong Reduced (LWR)
),0(~
,2
Ne
eMM
ePM
ijjij
ijijjij
222 2),,0(~
,
N
MMPMy ijijijij
Identifiability constraint j
j J2
ith arrayjth probe pair
Total no. probe pairs
The Li-Wong Model
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
Li-Wong Full (LWF)
Li-Wong Reduced (LWR)
),0(~
,2
Ne
eMM
ePM
ijjij
ijijjij
222 2),,0(~
,
N
MMPMy ijijijij
Identifiability constraint j
j J2
ith arrayjth probe pair
Total no. probe pairs
expression
sensitivities
How to compare gene expression indexes?
•We get maximum likelihood estimates for using either full data (LWF) or reduced data (LWR)
•The Affymetrix software computes:
Average Difference (AD)
Log-Average (LA)
•We gain insight by assuming Li-Wong model is true. Then what are the consequences?
•For large sample sizes, the ’s and ’s will be well-estimated
.ˆ j
j Jy
JMMPMj
jj /)/log(10
Compare LW estimators directly:
0.2)(
2)ˆvar(
)ˆvar(),(
22
JreducedfullRE j
jjj
j
full
reduced
Comparing to AD is tricky, but with a correction factor AD is also an unbiased estimate of :
ˆˆ̂
jjJ
0.1)var(1
1)ˆvar(
)ˆ̂var(),(
reduced
ADreducedRE
•This also gives insight into “perfect match only” analyses:
RE(full, PM-only)=
jjj
jj
full
PM2
2
)(1
)ˆvar()ˆvar(
21 REand
Furthermore, PM-only is always at least twice as efficient as LWR
Empirical Comparisons
•We propose that an expression index is “good” if it has a high correlation with the underlying true expression (which is usually unknown).
•this correlation can be estimated using a specially designed mixing experiment
•if r is the correlation coefficient between the measured index and true expression, the “relative efficiency” of two indexes and can be estimated as
)1/()1/(
22
22
rrrr
Experimental Design
Human Fibroblasts(GM 08330)
20% FBS
48h
24hHarvest total RNA
Lys, PheDap, Thr
50:50
Add Bacterial Control Genes
StimulatedStarved
5 passages
Dap, Thr,Lys, Phe
Produce 50:50 group
Produce duplicates each day for 3d
Synthesize cDNA, cRNA; fragment
Add Hybridization Control Genes
BioB, BioC, BioD, Cre
Hybridize HuGeneFL
0.1% FBS
Serum starvation
Cell culture
Serum stimulation0.1%
20%
Harvest total RNA
Gene Expression IndexesData Reduction
RNA extraction
20% FBS
(6 replicates for each condition)
Mean probe intensity per array
Stim 50:50 Starved
Overall intensity higher in Stimulated
BIN1 expression
Stim 50:50 StarvedTrue expression = average of Stim, Starved
full̂
Coefficients of variation for assay (individual probes) and gene expression indexes
0.0 0.5 1.0 1.5 2.0
020
000
6000
010
0000
Assay Stim
CV
# P
robe
s
0.121
0.0 0.5 1.0 1.5 2.0
050
010
0015
0020
0025
00
LWF Stim
CV
# ge
nes
0.149
0.0 0.5 1.0 1.5 2.0
020
040
060
080
0
Affymetrix AD Stim
CV
# ge
nes
0.293
Stim 50:50 Starved Stim 50:50 Starved
Stim
50:50
Starved
Stim
50:50
Starved
LWF
AD
LWR
LA
Correlation matrix of 18 arrays as a colorized image for each expression index.
Comparing ModelsCluster Analysis
Affymetrix Log Ave
Full Model Reduced Model
Affymetrix Ave Diff
Strv
1St
rv 4
Strv
2St
rv 5
Strv
3St
rv 6
50:5
0 3
50:5
0 5
50:5
0 4
50:5
0 2
50:5
0 1
50:5
0 6
Stim
4St
im 6
Stim
5St
im 3
Stim
1St
im 2
Stim
2St
rv 1
Strv
3St
rv 2
Strv
6St
rv 5
Strv
4St
im 1
Stim
6St
im 3
Stim
5St
im 4
50:5
0 5
50:5
0 4
50:5
0 3
50:5
0 2
50:5
0 1
50:5
0 6
Strv
3St
rv 4
Strv
6St
rv 5
Strv
2St
rv 1
Stim
2St
im 1
Stim
4St
im 5
Stim
6St
im 3
50:5
0 5
50:5
0 4
50:5
0 2
50:5
0 1
50:5
0 6
50:5
0 3
Strv
2St
rv 3
Strv
1St
rv 6
Strv
5St
rv 4
Stim
2St
im 4
50:5
0 1
Stim
1St
im 6
Stim
3St
im 5
50:5
0 3
50:5
0 5
50:5
0 4
50:5
0 2
50:5
0 6
Relative Efficiency
0.0
0.5
1.0
1.5
LWF
LWR
AD LA
Med
ian(
r2 /(1-
r2 ))
LWF
LWR
AD LA
Unscaled Scaled
Correlation of duplicate measurements of 149 genes
LWF median r=.74
LWR median r=.43
LWF median r=.08
LWF median r=.17
Number of unexpressed genes•Only 0.2% of the LW estimates are negative
•50:50 group has fewest negative estimates
•could this indicate very few unexpressed genes?
Stim 50:50 Starved
A conservative approach to estimating number of unexpressed genes
•Let U denote number of unexpressed genes
•genes are ranked according to expression index
)genes all amonggenesofrankmedian(2 UU
•This is useful if we can get a random sample of unexpressed genes
Unexpressed population
Gene expression index
•We use the spiked-out bacterial control genes as a sample of “unexpressed” genes
•the 4 genes are are represented 3 times each (different portions of mRNA), for a total of 12 probe sets
•Based on this reasoning, we estimate that greater than 88% of the genes are expressed, even in the Starved samples
Rank of expression index variance across the 6 Stimulated arrays versus rank of index mean
Truly absent in stim group
Rank(mean)
Ran
k(va
r)
0 2000 4000 6000
020
0040
0060
00
Rank(mean)
Ran
k(va
r)
0 2000 4000 6000
2000
4000
6000
DapThrPheLys
ADLWF
Very low estimated expression for truly absent genes when using LWF
Present/absent calls
•We use the statistic
)ˆ(
ˆ
SEz
to declare genes present/absent (absolute call)
•we find the vast majority of genes on the array appear to be present
•for the spiked in/out genes, we find vastly improved present/absent calling using LW estimates
False Positive Rate0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(1 - Specificity)
(Sen
sitiv
ity)
1 - F
alse
Neg
ativ
e R
ate LWF-Z
LWR-Z
Untrimmed AD
Untrimmed LA
LA
AD
Absolute Call
ROC curve - spiked in/out genes
Variability in estimatesFull Model Reduced Model
log(
varia
nce)
log(mean)
Stim
50:50
Starved
Conclusions
• Model-based estimators are superior to simple averaging
• we have demonstrated this using both analytic considerations and experimental data
• a carefully designed experiment can be used to address many issues
• Many more genes may be expressed than previously thought
Other issues/ future work
•Spiking genes might be used to calibrate and normalize arrays
•relationship between variance and mean of expression indexes may be useful in planning experiments
•our data may be useful for future work, especially in producing indexes that are resistant to probe saturation
•all primary data, this Powerpoint presentation and a preprint are available at http://thinker.med.ohio-state.edu