a statistical framework for expression trait loci …kendzior/forjd/prelim/mc...a statistical...
TRANSCRIPT
A Statistical Framework for Expression Trait Loci (ETL) Mapping
Meng Chen
Prelim Paper in partial fulfillment of the requirements
for the Ph.D. program
in the
Department of Statistics
University of Wisconsin-Madison
Committee Members:Professor Christina KendziorskiProfessor Alan AttieProfessor Michael NewtonProfessor Brian Yandell
1
Contents
1 Introduction 2
2 ETL mapping experiments 3
3 QTL Mapping Methods 5
3.1 Single Phenotype - Single QTL Models . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Single Phenotype - Multiple QTL Models . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Multiple Phenotype - Single or Multiple QTL Models . . . . . . . . . . . . . . . . 8
4 ETL Mapping Methods 9
4.1 Transcript Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Transcript Based Approach with FDR control . . . . . . . . . . . . . . . . . . . . 9
4.3 Marker Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Mixture Over Markers Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Current Status of ETL Mapping Methods . . . . . . . . . . . . . . . . . . . . . . 11
5 Research Plan 13
5.1 ETL Interval Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.1 Pseudomarker-MOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.2 Two-Stage Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1.3 Theoretical Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Multiple ETL mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Future Research Questions 20
References 23
Appendix 27
1
1 Introduction
Identifying the genetic loci responsible for variation in quantitative traits is of great importance to
biologists. Although quantitative trait loci (QTL) mapping studies have been going on for over 80
years starting with Sax in 1923 (Sax 1923; Rasmusson 1933; Thoday 1961), where he proposed
that the association between seed weight and seed coat color in beans was due to the linkage
between the genes controlling weight and the genes controlling color, the vast majority of studies
have taken place in the last 20 years. The increased rate was due largely to two major advances in
the 1980s: the advent of restriction fragment length polymorphisms (RFLPs) (Botstein et al. 1980)
so that it’s possible to genotype markers on a large scale and the advent of statistical methods for
data analysis (Lander and Botstein 1989).
A recent advance of comparable significance has been made in the area of phenotyping.
With high throughput technologies now widely available, investigators can measure thousands
of phenotypes at once. Gene expression measurements are particularly amenable to QTL mapping
and much excitement abounds for this field of “genetical genomics” (Jansen and Nap 2001; Jansen
2003; Cox 2004; Broman 2005).
The so called expression QTL (eQTL) or expression trait loci (ETL) studies have been used to
identify candidate genes (Dumas et al. 2000; Eaves et al. 2002; Karp et al. 2000; Wayne et al.
2003; Schadt et al. 2003; Brstrykh et al. 2005; Hubner et al. 2005), to infer not only correlative
but also causal relationships among modulator and modulated genes (Brem et al. 2002; Schadt et
al. 2003; Yvert et al. 2003), to better define traditional phenotypes (Schadt et al. 2003), and to
serve as a bridge between genetic variation and the traditional complex traits of interest (Schadt et
al. 2003).
Although successful in many ways, the results obtained from ETL studies to date are limited.
In the early published studies, the ETL mapping problem had been addressed by treating each
transcript separately as a phenotype for QTL mapping. Single trait QTL analysis was then carried
out thousands of times (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003). Notably, although
adjustments were made for multiple tests across the genome, no adjustments had been considered
for multiple tests across transcripts. There are hundreds of test locations across the genome but
tens of thousands of transcripts leading to a potentially serious multiple testing problem and an
inflated false discovery rate (FDR). For some labs, an inflated FDR is tolerable as many genes can
be tested quickly for certain properties and discarded if found to be false positives. However, for
many labs, such tests are prohibitively expensive. Statistical methods that control error rates and
2
that are more sensitive and more specific are needed. In a few recent studies, there has been some
effort in attempting to account for both sets of multiplicities (Chesler et al. 2005; Hubner et al.
2005; Bystrykh et al. 2005). Permutation tests were performed to derive the genome-wide LOD
score threshold and q-values were computed for the set of transcripts declared significant using the
corresponding genome-wide empirical p-values. As discussed in Section 4.2, this last approach
may not properly control FDR and may suffer from very low power.
The main aim of the proposed thesis is to develop a statistical framework for ETL mapping
that properly accounts for multiplicities while maintaining or improving upon the operating
characteristics of currently used approaches. Section 2 provides a brief background on the
questions addressed and data collected in ETL mapping experiments. Statistical methods for ETL
mapping are reviewed in Section 4. As discussed there, the mixture over markers (MOM) model is
the only statistically rigorous ETL mapping method developed to account for multiplicities across
both markers and transcripts. However, MOM has a number of shortcomings. For experiments
with sparse maps, the MOM model is lacking as information between markers is not available.
When dense maps are available, the MOM model by itself may not be applicable as the number of
mixture components is too big to fit. Finally, MOM does not allow for multiple ETL. Statistical
methods to address each of these shortcomings are detailed in Section 5 and preliminary results
are demonstrated on simulated data and data from a study of diabetes in mouse. Future directions
are discussed in Section 6.
2 ETL mapping experiments
The general data collected in an ETL mapping experiment consists of a genetic map, marker
genotypes, and microarray data (phenotypes) collected on a set of individuals. A genetic marker is
a region of the genome of known location. These locations make up the genetic map. The distance
between markers is given by genetic distance, in the unit of centimorgan (cM). It is defined as the
expected percentage of crossovers between two loci during meiosis. At each marker, genotypes
are obtained. ETL mapping studies take place in both human and experimental populations. We
focus on the latter. For these populations, the possibilities of marker genotypes are simplified.
For example, studies with experimental populations most often involve arranging a cross
between two inbred strains differing substantially in some trait of interest to produce F1 offspring.
Segregating progeny are then typically derived from a B1 backcross (F1 x Parent) or an F2
3
intercross (F1 x F1). Repeated intercrossing (Fn x Fn) can also be done to generate so-called
recombinant inbred (RI) lines. For simplicity of notation, we focus on a backcross population.
Consider two inbred parental populations P1 and P2, genotyped as AA and aa, respectively, at M
markers. The offspring of the first generation (F1) have genotype Aa at each marker (allele A from
parent P1 and a from parent P2). In a backcross, the F1 offspring are crossed back to a parental
line, say P1, resulting in a population with genotypes AA or Aa at a given marker. We denote AA
by 0 and Aa by 1.
For each member of the backcross population, phenotypes are collected via microarrays.
Microarrays allow us to snapshot the expressions of thousands of genes at the same time. The
oligonucleotide and cDNA microarrays are the two types of technology that are most widely used.
A nice review of the microarray technologies can be found in Nguyen et al. (2002). We present a
very brief, by no means complete review here.
Affymetrix is one company that produces oligonucleotide chips which contain tens of
thousands of probe sets, or DNA sequences related to a gene. We will refer to these sequences
throughout this paper as “transcripts”. Each gene is represented by some number (usually 11-
20) of features. Many Affymetrix arrays use 20 features (Nguyen et al. 2002). Each feature is
a short sequence of oligonucleotides. Present in the features are pairs of perfect match (PM) and
mismatch (MM) sequences. The PM is a piece of gene, 25 nucleotides in length; the corresponding
MM is identical to PM except for the middle (i.e., 13th) position. After some pre-processing and
normalization, one summary score is derived for each probe set. There are a number of methods
for processing the probe set intensities and for normalization, such as DNA chip Analyzer by Li
and Wong (2001), and Robust Multi-array Analysis (RMA) by Irizarry et al. (2003).
In a cDNA array experiment, a gene is represented by a long cDNA fragment (500 to 1000
bases). The experimental sample of interest is often labeled with a red fluorescent dye, and a
reference sample is labeled green. The amount of cDNA hybridized to each probe can be captured
through some imaging device which measures the amount of the fluorescent intensity. Image files
are processed to give a summary expression score, log2(R/G). Yang et al. (2002) propose methods
for cDNA array data normalization and compare their methods with a number of other approaches.
With proper pre-processing and normalization, from either technology we obtain a single summary
score of expression for each transcript on each array.
4
3 QTL Mapping Methods
As noted above, ETL studies are very similar to QTL studies, but with thousands of phenotypes.
Perhaps it is not surprising then that early ETL studies repeatedly applied methods developed for
QTL mapping to each transcript. The literature on QTL mapping methods is quite large. We here
review only those methods relevant to this proposal and refer the interested reader to Doerge et al.
(1997) or Lynch and Walsh (1998) for more information on QTL mapping methods.
3.1 Single Phenotype - Single QTL Models
Consider a backcross with n progeny with univariate phenotypes yj measured on all the individuals,
j = 1, . . . , n, together with genotypes for a set of M markers. Let mij = 0 or 1 according to
whether the individual j has genotype AA or Aa at the ith marker, i = 1, . . . , M , The simplest
method to test for trait-marker association is marker regression (MR), to test mean trait value
differences between different marker groups for a particular marker. Specifically, for a test at the
ith marker, the single QTL model is:
yj = µ + βi mij + εj (3.1)
where εj are independent and identically distributed (iid) as Normal(0, σ2) and one can test
H0 : βi = 0 vs. H1 : βi 6= 0
or equivalently H0 : µ0 = µ1 vs. µ0 6= µ1 where µ0 = µ and µ1 = µ + βi.
This is equivalent to an analysis of variance (ANOVA) at each marker (Soller et al. 1976)
when there are more than two marker genotype groups (a t-test for two genotype groups as in a
backcross). Usually, instead of F or t-statistics, geneticists prefer to report a LOD score, which
is defined as the (base 10) log-likelihood ratio comparing the two hypotheses. A LOD score is
calculated at each marker position, and marker loci giving significant LOD scores are identified
as putative QTLs. For these putative QTLs, we loosely say the phenotype is “linked” to them.
This approach is conceptually very simple, and clearly there are problems with it (Lander and
Botstein 1989). First, if the true QTL is not located exactly at the marker, its effect will likely be
underestimated because of recombination between the marker and the true QTL. Second, because
of the confounding effects, the power for QTL detection will decrease, especially when the markers
are widely spread, requiring more individuals for the test. Third, this approach considers one
5
marker loci at a time, which is not very powerful comparing with multiple QTL models in the
presence of more than one QTL.
Lander and Botstein (1989) proposed interval mapping (IM) which addresses the first two
problems above. Their approach also assumes a single segregating QTL, but allows for tests
between markers where genotypes are not known. Specifically, for a backcross, the proposed
model to test for a QTL in the marker i and i + 1 interval is
yj = µ + β∗ m∗kj + ej (3.2)
where m∗k is the genotype at the test position between marker i and i + 1. It takes value 0 or 1 with
probability depending on the genotypes of the flanking markers and the test position (see Table 1).
β∗ is the effect of the putative QTL. Technically, (3.2) is a mixture of two normal distributions,
since p(yj) = p(m∗j = 0|m, r)p(yj|m
∗kj = 0) + p(m∗
j = 1|m, r)p(yj|m∗kj = 1) (here, r denotes
flanking marker distances). As in MR, tests are done at each location to test
H0 : β∗ = 0 vs. H1 : β∗ 6= 0
The test compares the hypothesis of a single QTL at the current locus to the null hypothesis of
no QTL. The two likelihood functions must be maximized over their respective parameters. The
procedure described above is repeated for each locus in the genome. In practice, test loci are set
up every 1cM or some other user-defined distance. The likelihood under the alternative varies with
the test locus, so the EM algorithm (Dempster et al. 1977) must be applied at each locus. A LOD
score profile can be constructed by plotting the LOD scores against the test positions.
The LOD score is then compared to a genome-wide threshold. Whenever the LOD profile
exceeds such threshold, we infer there exists a QTL. Generally, the genome-wide threshold is
obtained using the 95th percentile of the distribution of the maximum (genome-wide) LOD score,
under the null hypothesis of no segregating QTLs. Much effort has been expended to derive the
appropriate genome-wide LOD score cutoff value (Churchill and Doerge 1994; Dupuis etl al.
1995; Dupuis and Siegmund 1999; Feingold et al. 1993; Lander and Botstein 1989; Rebai et al.
1994; Rebai et al. 1995).
The major advantage of IM over MR at marker loci is that it gives more precise estimates of the
QTL locations and effects. However, the computational cost involved in IM is bigger, and in the
case of dense genetic markers and complete genotype data, the advantage tends to be very little.
More importantly, IM, like MR at marker loci, is still a single-QTL model.
6
3.2 Single Phenotype - Multiple QTL Models
The principal reasons for modelling multiple QTls are to increase sensitivity and to achieve better
separation of linked QTLs. Also, epistasis (i.e., interactions between alleles at different QTLs) can
only be identified through multiple-QTL models. When a single QTL model is used when really
there are multiple QTLs, the genetic variation due to other segregating QTLs is incorporated into
the environmental variation. This reduces sensitivity.
A straight-forward extension of MR to model multiple QTLs is multiple regression, which
includes a number of different markers in the model, rather than looking at them one at a time. Let
mij = 0 or 1, according to whether individual j has genotype Aa or AA at marker i. The model
becomes
yj = µ +∑
i∈S
βi mij + ej (3.3)
where S is the set of markers for which βi is not 0. To implement this, one must find a way to
search through the model space to find such a set of S. As the number of markers gets larger,
it would be impossible to consider every possible model in the model space. In addition, there
remains a question of how to choose between candidate models; some form of criterion is needed.
Broman (2002) looked at this problem in a model selection framework.
Direct use of the multiple regression analysis is not easy. Moreover, Zeng (1993) showed
that the partial regression coefficient is generally a biased estimate of the relevant QTL effect.
An approach for multiple QTL mapping that combines ideas from IM and multiple regression is
composite IM (CIM) (Zeng 1993, 1994). The method attempts to reduce the multidimensional
search for QTLs to a series of one-dimensional searches. It conditions on markers outside the
region of interest while performing IM to control for the effects of QTLs in other intervals. That
way, there will be better power for QTL detection and also the QTL effects can be estimated more
accurately.
CIM can be described as follows. One chooses a subset of markers, S, to control background
genetic variation. As in IM, suppose one wants to test for QTL between marker i and i + 1. For a
test at the kth location, between markers i and i + 1, the statistical model can be written as
yj = µ + β m∗kj +
∑
l 6=i,i+1
βl mlj + ej
where β is the effect of the putative QTL. A general guideline for practice would be to drop those
markers that are within 10cM of the test position (Broman 2002 calls this subset of markers S ∗).
7
Under this model, the contribution of each individual to the likelihood has the form of a mixture
of two normal distributions with means µ +∑
l∈S∗ βl mlj and µ + β +∑
l∈S∗ βl mlj with mixing
proportions equal to the conditional probabilities of the individual having QTL genotype 0 or 1,
given flanking marker genotypes and test position. Zeng (1994) used the ECM algorithm (Meng
and Rubin 1993) to obtain the maximum likelihood estimates.
As in IM, a LOD score is calculated at each test position, comparing the likelihood assuming
there is a QTL at the putative test locus, to the likelihood assuming that there is no QTL there.
The LOD score is then plotted as a function of test positions in the genome, and is compared to a
genome-wide threshold to declare significance.
Jansen independently developed a similar approach to handle multiple QTL combining IM and
MR, multiple-QTL mapping (MQM) (Jansen 1993; Jansen and Stam 1994). It fits single QTL
models with selected markers as cofactors in the regression to eliminate the effects of possible
QTLs in other intervals.
The major problem with these approaches is how to choose the set of markers to be included
in the model. Too many markers will give low power for QTL detection, and too few will cause
low accuracy. Zeng (1994) compared the performance of including all other markers to including
only unlinked markers through simulation. He then recommended some combinations of deleting
or inserting some linked markers in practice. Jansen (1993) and Jansen and Stam (1994) used
backward elimination with AIC (Akaike 1969), or a slight variant, to pick the subset of markers
in the model. Broman (2002) recommended the use of the BICδ criterion, with the value δ chosen
by the approximate correspondence between BICδ and a genome-wide threshold on the LOD
score. Kao et al. (1999) proposed multiple IM (MIM) which uses multiple marker intervals
simultaneously to construct multiple QTL in the model. In MIM, Kao et al. (1999) adopted
stepwise selection with Likelihood Ratio Test (LRT) as the selection criterion to identify QTLs.
3.3 Multiple Phenotype - Single or Multiple QTL Models
In many QTL mapping studies there is more than one trait being measured. Performing single
trait analysis repeatedly is not optimal clearly because it doesn’t take into account the correlation
structure among the traits. It has been shown that analyzing traits jointly will increase the power of
QTL detection (Jiang and Zeng 1995; Knott and Haley 2000). In these studies, joint distribution
of the multi-trait is imposed, which requires specification of the covariance structure of the traits
(for a review of multi-trait QTL mapping methods, see Lund et al. 2003 and references therein.)
8
Multi-trait methods are very attractive in that they try to capture the inner structure among traits,
because many traits are genetically or environmentally correlated. However, as the number of traits
gets bigger, so does the number of parameters that need to be estimated.
4 ETL Mapping Methods
To this date, most ETL mapping methods consider single transcripts at a time. Multi-trait mapping
using available methods has not been attempted. Most recognize that this would be impossible
with thousands of traits since estimation of a phenotype covariance matrix is not feasible. We
detail exact methods used below.
4.1 Transcript Based Approach
The earliest ETL mapping studies applied single phenotype-single QTL mapping methods to every
transcript (Brem et al. 2003; Schadt et al. 2003). We call this type of approach transcript
based (TB). In Brem et al. (2002), a Wilcoxon-Mann-Whitney rank sum test was applied to
every transcript and marker pair. Nominal p-values were reported and the number of linkages
expected by chanced was estimated by permuation tests (Churchill and Doerge 1994). In Schadt
et al. (2003), transcript specific LOD score profiles were obtained using standard QTL IM. A
common genome-wide LOD score threshold was chosen to account for the potential increase of
type I error induced by testing across multiple markers. Neither study accounted for multiple tests
across transcripts. We view this as a problem that needs serious attention because in ETL studies,
we usually deal with thousands of traits.
4.2 Transcript Based Approach with FDR control
Recently, investigators have made attempts to account for multiplicities across transcripts (Chesler
et al. 2005; Hubner et al. 2005). They first computed genome-wide empirical p-values of the
maximum LOD score for every transcript using permutation tests and then estimated the q-values
(Storey and Tibshirani 2003) accordingly. It has to be pointed out that even though effort to adjust
for multiple testings across the transcripts was presented, this is by no means a systematic way
to approach the multiple testing problem. Preliminary results from simulations such as those
described in Kendziorski et al. (2004) show that FDR is not properly controlled and power is
9
relatively lower than other approaches. This is consistent with the results of Chesler et al. (2005),
where the q-value threshold was increased to 0.25 so that a reasonable number of transcripts could
be identified.
4.3 Marker Based Approach
Instead of conducting the analysis at every transcript, ETL mapping could be done by conducting
the analysis at every marker. We call these marker based (MB) approaches. They consist of
identifying differentially expressed (DE) transcripts across groups of animals where groups are
determined by the genotype at a given marker. Any DE methods could be used. Usually, the DE
evidence threshold can be chosen such that multiple testing performed across the transcripts can
be accounted for. However, the MB approach does not consider multiplicities across the genome.
As was shown in Kendziorski et al. (2004), both TB and MB approaches share similar flaws.
In TB, separate tests are conducted for each transcript. In MB, each marker is tested separately.
And for both, the evidence that a transcript maps to a marker is measured against the evidence
that it doesn’t map there. Since in reality a transcript can map to any of the marker locations, the
evidence that a transcript maps to a particular marker should be judged relative to the possibility
that it maps nowhere or to some other marker. This idea motivates what we call the Mixture Over
Markers (MOM) model (Kendziorski et al. 2004).
4.4 Mixture Over Markers Model
Let yt be the expression level for tth transcript, yt = {yt1, yt2, . . . , ytn}, where n is the number of
animals in the ETL study. The MOM model assumes a transcript t maps nowhere with probability
p0 and maps to marker m with probability pm, such that∑M
m=0 pm = 1, where M denotes the total
number of markers. The marginal distribution of yt is then given by
p0f0(yt) +
M∑
m=1
pmfm(yt) (4.4)
where fm is the predictive density of the data if transcript t maps to marker m; f0 is the
predictive density when the transcript maps to nowhere. Specifically, suppose transcript abundance
measurements ytj arise independently from some observation distribution fobs(·|µt,·, θ). The
dependence among the underlying means µt,· is captured by a distribution π(µ). With this setting,
f0(yt) =∫
(
∏nj=1 fobs(ytj|µ)
)
π(µ)dµ. For a transcript that maps to marker m say, the underlying
10
expression means defined by the marker genotype groups are not equal (µt,0 6= µt,1), but they both
are assumed to come from π(µ). The governing distribution for yt is then:
fm(yt) = f0(y0t ) f0(y
1t )
where y0(1)t denotes the set of transcript t values for animals with genotype 0(1).
Model fit proceeds via the EM algorithm. Once the parameter estimates are obtained, posterior
probabilities of mapping nowhere or to any of the M locations can be calculated via Bayes rule.
For instance, the posterior probability that transcript t maps to location l, l = 0, . . . , M is given by
plfl(yt)
p0f0(yt) +∑M
m=1 pmfm(yt)(4.5)
With the MOM approach, a transcript is identified to be DE if the posterior probability of
DE exceeds some threshold, where the threshold is chosen to control the expected posterior false
discovery rate (Newton et al. 2004). In order to make a transcript specific call, the highest posterior
density (HPD) region can be constructed in a straightforward fashion. A 1 − α HPD region is
obtained by including those marker locations until the sum of the posterior probability exceeds
1 − α.
4.5 Current Status of ETL Mapping Methods
Here we present some of our early simulation results comparing different approaches. We
considered a single ETL simulation with two chromosomes. Marker data was obtained from chr 2
and chr 3 from the F2 data from Dr. Alan Attie’s lab on campus. Chromosome 2 has 17 markers
and there are 6 markers on chromosome 3. A single ETL was simulated at marker 5 of chromosome
2. We generated 20 data sets for each of the seven values of ν0, which is a tuning parameter to
control the variance pattern in the simulated data (for details, see Kendziorski et al. 2004) so that
operating characteristics could be evaludated without biasing towards one method.
We consider applying MR for every transcript as transcript based MR (TB-MR). For TB-MR,
the genome wide type I error rate per transcript is controlled at 5% (Dupuis and Siegmund 1999).
We also have four marker based approaches. The first is EBarrays with LogNormal-Normal (LNN)
model where posterior probabilities of DE are computed for all the transcripts at every marker
(MB-EB). See Kendziorski et al. (2003) for details of the LNN model. The second one is a t-test
of equal means for every transcript, followed by calculations of q-values (Storey and Tibshirani
11
2003) at every marker (MB-Q). The third and fourth methods calculate moderated t-statistics using
SAM (Tusher et al. 2003) and LIMMA (Smyth 2004), followed by q-value calculation at every
marker. We refer to these as MB-SAM and MB-LIMMA. For MB-EB, MB-Q, MB-SAM, and
MB-LIMMA, the false discovery rate per marker is controlled at 5%.
A fifth method considered attempts to test all the transcripts and all the markers simultaneously.
P-values from t-tests for every transcript and marker pair were used to calculate q-values (Storey
and Tibshirani 2003) at once. Note that by doing so, we assume that a certain dependence structure
among tests is satisfied (Storey 2003), which is likely to be not true here. We, nevertheless, include
this method (Q-ALL) in our simulation, because we’d like to see the performance of this kind of
ad-hoc procedure. In addition, Storey and Tibshirani (2003) use the ETL mapping data of Brem
et al. (2002) as motivation for considering calculation of all q-values simultaneously. For Q-ALL,
FDR control is targeted at 5%.
Power is defined as the probability of calling marker 4, 5 or 6 (flanking region of the true
ETL) on chromosome 2 for mapping transcripts. FDR is the proportion of transcripts identified
incorrectly as mapping to chromosome 2 or 3, i.e., they were EE or they were DE but mapped
outside the flanking region of the true ETL. Figure 1 shows the average power and FDR (over the
20 simulated data sets) against each ν0 value, together with 95% point-wise confidence interval.
As can be seen, the power is around 80% and 90% for all of the methods; however, for all methods
except MOM, the FDR is well above 5% for most values of ν0. These simple and somewhat ad-hoc
approaches fail to control the FDR at their claimed level, because they couldn’t adjust for multiple
tests across the markers and the transcripts simultaneously.
Although MOM does adjust appropriately for these multiplicities, the approach has a number
of shortcomings. For experiments with sparse maps, the MOM model is lacking as information
between markers is not available. Also, MOM does not allow for multiple ETL. Some dense
maps are currently available or under development, particularly those that use single nucleotide
polymorphisms (SNPs) (see The SNP Consortium). Because of its proximity, SNPs may be shared
among groups of people with harmful but unknown mutations and serve as markers for them.
Such markers can help to reveal the mutations and expedite therapeutic drug discovery. When
dense maps are available, the MOM model as it is may not be applicable as the number of mixture
components to fit will be huge.
12
5 Research Plan
The main objective of the proposed thesis is to develop a powerful statistical framework within
which ETL can be localized. The framework relies on the MOM model, in that it simultaneously
controls for tests done across the genome locations and across all the transcripts. However, the
proposed methods, detailed below, significantly increase the utility and applicability of MOM by
addressing the first two shortcomings listed above. The effort of developing a method to handle
dense maps is ongoing.
5.1 ETL Interval Mapping
The genomic regions identified using MOM are limited by their size, which may be large as
analysis is conducted at genotyped markers only. When dense maps are not available (Attie lab
data, Schadt et al. 2003), this limitation can be a serious one. The biological techniques currently
available to search for genes in large genomic regions (e.g. candidate gene approach, congenic
lines) can take years and, as a result, additional statistical methods capable of narrowing down
regions are necessary.
We here propose a method for IM of ETL. Consider for simplicity of notation a backcross
population genotyped as 0 or 1 at M markers. The observed phenotype data y is a T × n matrix
of transcript abundance levels for transcript t = 1, . . . , T and individuals j = 1, . . . , n; m is an
M × n matrix of marker genotypes for markers i = 1, . . . , M and individuals j = 1, . . . , n.
Consider a set of L locations spanning the entire genome, we model the expression data for
transcript t as a L + 1 component mixture. To be specific, we imagine that the transcript may map
to nowhere with probability p0, and to any of the L locations with probability pl, l = 1, 2, . . . , L.
The p’s are mixing proportions. As noted in Section 3.1, transcript t is said to be linked (mapped)
to location l if µ0t,l 6= µ1
t,l, where µ0(1)t,l denotes the latent mean level of expression for transcript
t for the population of individuals with genotype 0(1) at location l. Let z lt be an indicator of
whether transcript t maps to location l. If l is at the markers, then we can decompose the predictive
density under the alternative hypothesis such that fl(yt) = f0(y0t ) f0(y
1t ), where the grouping is
determined by genotypes at that marker. However, when l is between markers, the decomposition
13
is no longer valid. Instead, we have
fl(yt) =
∫ ∫ ∫
∏
j∈Gl0
fobs(ytj|µt,0)∏
j∈Gl1
fobs(ytj|µt,1) π(µt,0) π(µt,1) p(gl|m) dµt,0 dµt,1 dgl
=
∫
fl(yt|gl) p(gl|m) dgl
where gl = (gl1, g
l2, . . . , g
ln) denotes the unknown genotype vector at location l; G l
0(1) denotes the
set of population having genotype 0(1) at location l. Under the null hypothesis, the predictive
density of the data, f0(yt) can be calculated as before since it doesn’t rely on genotype groupings.
Parameter estimates for mixing proportions and hyper parameters can be obtained via the EM
algorithm (see Appendix A). We show that the posterior probability for transcript t to be mapped
to location l, after integrating out the µ’s, is given by
p(zlt = 1|y, m) =
p(zlt = 1)
∫
fl(yt|gl)p(gl|m)dgl
p(yt|m)(5.6)
where p(zlt = 1) is the prior probability that transcript t maps to location l.
At a particular location l, the conditional distribution of genotype given the expression and
marker data p(gl|m) is assumed to only depend on the two markers flanking l. Notice that g l is a
vector of length n. Theoretically, there are 2n possible genotype vectors. So the integral in (5.6)
is a huge mixture. In practice, one can restrict to consider a smaller number (Table 1, Zeng 1994,
reproduced here as Table 2) since a lot of them have small probabilities. However, as the number
of individuals in the study is large, this 2n problem quickly becomes computationally infeasible.
5.1.1 Pseudomarker-MOM
To get around with the 2n problem, here we propose a general framework of ETL IM using
importance sampling and pseudomarker generation. The idea of pseudomarkers was introduced
by Sen and Churchill (2001) (see Appendix B). Let us introduce some slightly more general
notation. Recall that a transcript t is linked to location l if µ0t,l 6= µ1
t,l, where µ0(1)t,l denotes the
latent mean level of expression for transcript t for the populations of individuals with genotype
0(1) at location l. When this is the case, we say that transcript t is in expression pattern P1
at location l, denoted as P1lt; similarly P0l
t denotes the null pattern of expression at location
l (µ0t,l = µ1
t,l). It is useful to introduce specific patterns in our framework, since when an F2
population is considered, for example, there are numerous patterns. Each defines a way in which
the latent means can be different across genotype groups at a location. All of the non-null patterns
14
imply linkage to that location). Two T × L matrices, θ0 and θ1, contain the latent mean levels
of expression (θ = (θ0, θ1)); L denotes the total number of locations considered. Let z lt = 1 if
transcript t is in expression pattern P1 at location l and 0 otherwise. Then z is a T × L indicator
matrix specifying QTL locations for each transcript.
Let us first consider a simple case where a transcript is associated with at most one genomic
location l and consider inference at location l. This assumption simplifies algebraic development
and will be relaxed later. At location l, of primary interest is the posterior probability that z lt∗ = 1
for transcript t∗. Reexpressing (5.6) in terms of the patterns, we have
p(Pklt∗|y, m) ∝ pl
t∗,Pk
∫
fPk
(
yt∗|gl)
p(gl|m)dgl (5.7)
where plt∗,Pk denotes the prior probability that transcript t∗ is in pattern k at location l and fPk
describes the predictive density of the data. k = 0, 1 for backcross.
A normalizing constant is not required if further calculations of operational characteristics such
as FDR is not of interest, which may be the case if a simple ranking of the genes is desired. For the
model we propose here, we are interested in the calculation of estimated FDR. Therefore, we must
specify the normalizing constant p(yt∗|m). Expanding the probability in terms of different possible
mapping locations, we have p(yt∗|m) = p(yt∗|m, z·t∗ = 0)p(z·t∗ = 0) +∑L
l′=1 p(yt∗|m, zl′
t∗ =
1)p(zl′
t∗ = 1), where p(z·t∗) +∑L
l′=1 p(zl′
t∗ = 1) = 1 and p(z·t∗ = 0) implies that the transcript does
not map to any of the L locations, i.e., the transcript is in pattern P0 at every location. Therefore,
the exact form of (5.7) is given by
p(Pklt∗|y, m) =
plt∗,Pk
∫
fPk
(
yt∗|gl)
p(gl|m) dgl
∫ (
pt∗,P0fP0 (yt∗) +∑L
l′=1 pl′
t∗,P1fP1 (yt∗|gl′))
p(gl′|m) dgl′(5.8)
The pt∗,P0 and plt∗,P1’s are unknown. We estimate them from the data, with the average of posterior
probability that each transcript belongs to a particular pattern. Due to the 2n problem, calculation of
p(gl|y, m), the posterior distribution of the unobserved genotype at location l given the expression
and marker data, becomes computationally prohibitive. Here we use importance sampling by first
simulating multiple versions of pseudomarkers from p(g l|m), and then replace the exact integral
with its Monte Carlo approximation.
In simulating the pseudomarkers, one can use a simple Markov Chain structure where the
putative QTL genotype at a given location only depends on two flanking markers. However, this
might not work well if there is genotyping error or noninformative markers in the marker data.
A hidden Markov Model (HMM) can be considered where the “true” marker genotypes follow a
15
Markov Chain, and the observed marker genotypes are characterized by distributions conditional
on the underlying state process. Using an HMM, one can account for genotyping error and missing
marker data in a coherent way. R/qtl (Broman et al. 2003) has this option as well. Figure 11 gives
an example. The upper stripe gives the marker data for one animal. Blue shows AA, and yellow
for Aa. There are some missing marker data, represented by light blue. The lower panel has 20
realizations of sampling from the HMM model trained by the marker data.
Suppose for each location l′, Q genotype vectors are sampled from the proposal distribution
p(g|m), where g = {g1, g2, . . . , gL} to give (gl′
1 , gl′
2 , . . . , gl′
Q) for l = 1, . . . , L. Then equation (5.8)
can be approximated by Monte Carlo integration using importance samples (see Appendix C.)
p(Pklt∗|y, m) ≈
plt∗,Pk
∑Q
q=1 fPk
(
yt∗|glq
)
pt∗,P0
∑Q
q=1 fP0 (yt∗) +∑L
l′=1 pl′
t∗,P1
∑Q
q=1 fP1
(
yt∗|gl′q
) (5.9)
This approach is an extension of the MOM model evaluated by averaging over the pseudomarkers.
We call it pseudomarker-MOM.
5.1.2 Two-Stage Approach
Pseudomarker-MOM scans the genome at some small distance step in order to find potential
locations to which transcripts are mapped. But applying it over the entire genome, even for each
chromosome separately, can be computationally prohibitive. We propose a two-stage approach
where we first apply MOM at markers to identify interesting regions, then follow up by applying
pseudomarker-MOM to those regions to better localize ETLs.
MOM calculates for each transcript, the posterior probability that it maps to a particular marker,
or it doesn’t map at all. To identify potential ETL regions, we average the linkage evidence across
all the transcripts, to give a marker specific linkage score. We can choose those hot-spot markers
by thresholding the marker specific linkage evidence, using an HPD region.
5.1.3 Theoretical Result
We here justify the fact that picking those regions with the highest average linkage evidence is
indeed the correct thing to do under simplified conditions.
Theorem 1: If we assume transcripts map to at most one genomic location and the prior
probability of mapping to a particular marker is known and equal for all the markers, and
hyperparameter values are known, then the expected posterior probability of a transcript mapping
16
to a particular marker is a non-increasing function of the recombination frequency between that
marker and the ETL.
Proof: See Appendix D.
As a result of Theorem 1, we can be sure that under these conditions, the interesting marker
regions picked by using MOM will be those regions that are the closest to the ETL. Once these
regions are defined, we set up some equally-spaced pseudomarker grid and use pseudomarker-
MOM to help localize the ETL with greater accuracy. We show by simulation studies in the next
section that this approach works quite well, even under more general conditions.
5.1.4 Simulation Results
A simulation was set up where there are 5000 transcripts and 100 individuals. The proportion
of differential expression (DE) was 10%. The hypothetical marker map is composed of one
chromosome and is equally spaced with 10cM in between. There are 10 markers in total (i.e,
from 0cM to 90cM). Intensity values are simulated as described in Section 4.4. Two simulations
were considered: one with a single ETL at 35cM and one with two ETLs: one at 35cM and one at
75cM.
The two-stage approach was applied in the simulation. Specifically, two hotspot marker regions
were selected from the average posterior probability profile at every marker, obtained by MOM.
Pseudomarker-MOM was then applied across a 2cM grid within the hot-spot regions, with 50
pseudomarker realizations (Q = 50). For comparison, we applied traditional QTL IM transcript
by transcript on the same data set and obtained genome-wide cutoffs on LOD scores based on an
approximation formula from Rebai et al. (1994). Here we show results from the two simulations.
Figure 2 (left panel) shows the average posterior probability profile, averaged across the
mapping transcripts. The ETL region is identified both by MOM and pseudomarker-MOM.
However, MOM picks up a wide peak between marker 4 and marker 5, whereas pseudomarker-
MOM identifies the ETL location with much better accuracy. To ensure that non-mapping
transcripts were not falsely identified, we considered the posterior probability profiles averaged
across non-mapping transcripts. They show little structure, as expected (figure not shown).
Figure 2 (right panel) shows the 96.8% HPD regions for true mapping transcripts (96.8% is
used to compare with IM results below). As shown, the ETL is identified correctly for most of
the mapping transcripts. Just as shown in left panel, pseudomarker-MOM provides good ETL
localization.
17
In comparison, we also considered IM on the same simulated data set. This was implemented
using R/qtl (Broman et al. 2003). Figure 3 (left panel) is the average LOD score, averaged across
mapping transcripts. The region containing the ETL has the highest average LOD, but the average
LOD scores are overall very high for mapping transcripts, and it’s not clear what cutoff one should
use in order to correctly identify the ETL region. For example, if we use 5, then it chooses almost
the entire simulated chromosome.
To compare with Figure 2, a “confidence interval” was constructed around the ETL using a
1-LOD drop interval around peak LOD score (Mangin et al. 1994). This procedure is designed
to approximate a 96.8% confidence interval, but in general, these intervals can be biased in that
they are too small and bootstrap procedure has been recommended (Visscher et al. 1996). In the
ETL setting, obtaining bootstrap samples for thousands of transcripts doesn’t seem feasible. On
the other hand, confidence intervals that are slightly too small favors IM here as ETL appear to be
better localized. It is not always clear which peaks to construct confidence intervals around. To
give IM the best results, we consider a 10 cM window around the true ETL (35cM) and define the
LOD peak as the highest LOD within the window. The 1-LOD drop interval is then constructed.
Of course, in practice, one does not have the luxury of knowing where to choose these peaks and
perhaps only the largest peak would be identified. For these reasons, this method of identifying
ETL regions favors IM. Even so, in comparison to pseudomarker MOM, IM does not provide as
precise estimates of ETL location.
Similar results were obtained for the 2-ETL case. As shown in Figure 4 (left panel), for a
particular simulation, the two ETL regions are identified both by MOM and pseudomarker-MOM.
However, spurious peaks show up in different places from different simulations, but in general
have low average posterior probability compared with the main peaks (e.g., in the bottom row). As
shown, pseudomarker-MOM increases localization of the ETL compared with MOM. In Figure
4 (right panel), the distinct ETLs are identified for most of the mapping transcripts. We see that
pseudomarker-MOM provides good ETL localization.
We again considered IM on the same two simulated data sets. Figure 5 (left panel) shows the
average LOD score profile, averaged across mapping transcripts. The regions containing the ETLs
has the highest average LOD, but distinct ETLs are not as clear as in Figure 3. As before, the 96.8
% “confidence intervals” were around each ETL using a 1-LOD drop interval around peak LOD
scores, with peak chosen from the two 10cM windows around the true ETLs (35cM and 75cM).
IM suffers from the same problem as before in that has relatively more false positive calls.
18
5.2 Multiple ETL mapping
The MOM approach can be extended to account for multiple ETL. For example, if transcript t
is possibly affected by two genotype locations l1 and l2, then four latent means are of interest:
µ0,0t,(l1,l2)
, µ0,1t,(l1,l2)
, µ1,0t,(l1,l2) and µ1,0
t,(l1,l2), where µj,k
t,(l1,l2)denotes the latent mean level of expression
for transcript t for the populations of individuals with genotype (j, k) at locations l1 and l2. These
latent means can be arranged in 15 possible expression patterns, all of which may be of interest.
For simplicity, we consider:
P0: µ0,0t,(l1,l2)
= µ1,0t,(l1,l2)
= µ0,1t,(l1,l2)
= µ1,1t,(l1,l2)
P1: µ0,0t,(l1,l2)
= µ0,1t,(l1,l2)
6= µ1,0t,(l1,l2)
= µ1,1t,(l1,l2)
P2: µ0,0t,(l1,l2)
= µ1,0t,(l1,l2)
6= µ0,1t,(l1,l2)
= µ1,1t,(l1,l2)
P3: µ0,0t,(l1,l2)
6= µ1,0t,(l1,l2)
= µ0,1t,(l1,l2)
6= µ1,1t,(l1,l2)
P4: µ0,0t,(l1,l2)
6= µ1,0t,(l1,l2)
= µ0,1t,(l1,l2)
= µ1,1t,(l1,l2)
P5: µ0,0t,(l1,l2)
6= µ1,0t,(l1,l2)
6= µ0,1t,(l1,l2)
6= µ1,1t,(l1,l2)
Pattern P0 allows for the possibility that a transcript maps to neither location. The latent means of
a transcript mapping only to location l1 would satisfy pattern P1 since only allelic differences at l1
affect the mean level of expression. Similarly, the means of transcripts mapping only to location l2
satisfy P2. Patterns P3-P5 describe possible ways in which the allelic variation at both locations
can act and interact to affect expression level. P3 describes a scenario in which the alleles at each
location have equal, but not dominant, effects. A dominant model would be described by P4 and
an additive model by P5.
The multiple ETL MOM model (M-MOM) has 5(
M
2
)
+ 1 patterns. As before, of primary
interest is the posterior probability of particular expression patterns. They can be calculated
similarly as in (4.5) for any pattern of interest, where k = 0, 1, . . . , 5.
5.2.1 Simulation Results
We apply M-MOM to the simulated 2-ETL data described in Section 5.1.4. We perform a
two dimensional scan, by looking at every possible marker pair and all the expression patterns
simultaneously. In the simulation, the two ETL effect sizes are generated to be the same, thus
corresponding to pattern 3. If we look at the average posterior probabilities for all non-null patterns,
most of them are negligible as expected, except for pattern 3 (figure not shown).
Looking closely at pattern 3, we plot the average posterior probabilities for every marker pair
in Figure 6. The first ETL location is on the Y axis and the second ETL location on the X axis. As
19
shown, the plot locates the two ETLs at 30cM and 80cM, which is very close to the truth and the
best accuracy that M-MOM can achieve, since the true ETLs are not at markers.
For comparison purposes, we implement a 2-D marker regression scan on the same data set
(see Figure 7.) On the upper triangle, we plot the average LOD score for epistasis, which has very
low probability, as expected. The diagonal is the average LOD sore from 1-D IM scan. The lower
triangle gives the average joint LOD scores. Also shown in the plot are the contour lines over
the range of the LOD scores obtained. The region between 30cM, 40cM and 70cM to 90cM has
relatively high average probabilities compared to the others. In order to assess the significance,
we randomly sample 10 transcripts corresponding to every 10th percentile of the log of expression
means, and perform permutation tests of size 1000 on each of them. The average 95th percentile of
the LOD scores from each of the 1000 permutation tests is about 3.2. Using this as our 2-D LOD
score cutoff, we see from the contour lines that it gives a much wider region than the actual ETL.
In line with the previous comparisons, using traditional QTL mapping techniques on the ETL data
seems to yield more false positives.
6 Future Research Questions
The proposed framework for ETL mapping enables IM of single ETL and the identification of
multiple ETL at markers. Preliminary results from simulated data sets are encouraging. There are
a lot of questions that need to be explored further.
1. Our methods are extensions of the MOM model and a number of questions of
implementation remain open. One question that is important to our extension is:
How sparse is sparse? In other words, how many markers can MOM handle relative
to the number of transcripts? Generally, one might also be concerned about whether
MOM could fit mixtures with so many components. We did a series of simulations,
trying to shed some light on this. Consider a backcross, and suppose the genome has 400
equally spaced markers with 5cM in between. There are 25% of DE transcripts, mapping
to 80% of the markers with equal probabilities. We varied the number of transcripts
to be 100, 1000, 5000 and 40000, representing very small, small, moderate and large
number of transcripts in a realistic experiment, with animal number being 10, 60 and 100.
Here 10 might be a little unrealistic, but we were curious about how the results will look like.
20
We plot the posterior probabilities for all the transcripts of a particular mixture component
in the MOM model in Figures 8, 9 and 10. The number of transcripts mapping to that
mixture component is 1, 1, 4 and 22 corresponding to the four values of total transcripts
number. The true DE transcripts are colored in red. Surprisingly, it seems that MOM can
fit a mixture model with the number of components bigger than the number of transcripts
pretty well (first panel in Figure 9 and Figure 10), as long as there are enough animals in the
experiment. Our impression is that when the number of animals is small, there tends to be
very few recombinations. The degree of distincitiveness between the MOM components is
not very high, resulting in relatively low power (see Figure 8.)
When the number of transcripts is small, there are also very few DE transcripts mapped
to every marker. We might expect the posterior probabilities of those transcripts not to be
very accurate. But as seen from Figure 9, this does not seem to be the case. There are
60 animals, but even when we only have 100 transcripts, MOM still detects the one DE
transcript with posterior probabilitiy being 1.0. When there are 40000 transcripts, among the
top 22 transcripts with the highest posterior probabilities, 20 of them are true DE (22 DE in
total). The posterior probabilities range from 0.714 to 1.0. In Figure 10, when there are 100
animals and 40000 transcripts, 21 out of the top 22 transcripts are DE, and their posterior
probabilities range from 0.76 to 1.0. Much more work is required to investigate these results
and define precise conditions under which parameters are well estimated.
2. Importance sampling is applied in pseudomarker-MOM. Importance samples are taken from
the proposal distribution p(gl|m) to approximate what we really desire p(gl|y, m). We
will investigate ways to choose Q so that we balance between computational burden and
accuracy. Perhaps a lower bound on Q can be obtained.
3. The proposed two-stage approach relies on the ability of MOM to identify the correct
“interesting” regions for follow-up. We have shown theoretically that under simplified
conditions, the highest posterior probability region obtained using MOM is that region
closest to ETL. The simplified conditions need to be relaxed. In particular, we will
investigate the case of multiple ETL and varying prior probabilities.
4. We have developed a method for mapping multiple-ETL. Preliminary results from
simulations investigating a 2 ETL system are encouraging. We would like to extend this
21
approach so that interval mapping can be done. All the questions we want to answer for
pseudomarker-MOM in the single ETL case will be applicable here, with greater complexity.
5. SNP maps are coming. When dense maps are available, fitting all the marker components
in the MOM model would be impossible. Time permitting, some model selection tool to
proceed with MOM followed by pseudomarker MOM will be considered.
22
References
1. Brem, R.B., G. Yvert, R. Clinton, and L. Kruglyak. (2002). Genetic Dissection of
Transcriptional Regulation in Budding Yeast. Science 296, 752-755.
2. Broman, K.W., and T.P. Speed. (2002). A model selection approach for the identification
of quantitative trait loci in experimental crosses (with discussion). Journal of the Royal
Statistical Society Series B 64, 641-656 and 737-775 (discussion).
3. Broman, K.W., H. Wu, S. Sen, and G.A. Churchill. (2003). R/qtl: QTL mapping in
experimental crosses. Bioinformatics 19 (7), pages 889-890.
4. Carlin, B., and T. Louis. (1998). Bayes and Empirical Bayes Methods for Data Analysis.
Chapman & Hall, New York, New York.
5. Churchill, G.A. and R.W. Doerge. (1994). Empirical threshold values for quantitative trait
mapping. Genetics, 138, 963-971.
6. Cox, N.J. (2004). An expression of interest. Nature 430, 733-734.
7. Doerge, R. W., Zeng, Z.-B. and Weir, B. S. (1997). Statistical issues in the search for genes
affecting quantitative traits in experimental populations. Statis. Sci., 12, 195-219.
8. Dupuis, J., P.O. Brown and D. Siegmund. (1995). Statistical methods for linkage analysis of
complex traits from high resolution maps of identity by descent. Genetics, 140: 843-856.
9. Dupuis, J., and D. Siegmund. (1999). Statistical Methods for Mapping Quantitative Trait
Loci From a Dense Set of Markers. Genetics 151, 373-386.
10. Feingold, E., P.O. Brown and D. Siegmund. (1993). Gaussian models for genetic linkage
analysis using complete high resolution maps of identity-by-descent. Am. J. Jum. Genet.
53: 234-251.
11. Haley, C., and S. Knott. (1992). A simple regression method for mapping quantitative trait
loci in line crosses using flanking markers. Heredity 69, 315-324.
12. Hubner, N., Wallace, C.A., Zimdahl, H., Petretto, E., Schulz, H., Maciver F., Mueller M.,
Hummel, O., Monti, J., Zidek, V., Musilova, A., Kren, V., Causton, H., Game, L., Born, G.,
23
Schmidt, S., Muller, A., Cook, S.A., Kurtz, T.W., Whittaker, J., Pravenec, M., and Aitman,
T.J. (2005). Integrated transcriptional profiling and linkage analysis for identification of
genes underlying disease. Nature Genetics, Vol 37, Number 3, 243-253.
13. Irizarry, R.A., B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf,
and T.P. Speed. (2003). Exploration, Normalization, and Summaries of High Density
Oligonucleotide Array Probe Level Data. Biostatistics 4(2), 249-264.
14. Jansen, R. (1993). A general mixture model for mapping quantitative trait loci by using
molecular markers. Theoretical and Applied Genetics 85, 252-260.
15. Jansen, R., and P. Stam. (1994). High resolution of quantitative traits into multiple loci via
interval mapping. Genetics 136, 1447-1455.
16. Jansen, R., and J.P. Nap (2001). Genetical genomics: the added value from segregation.
Trends in Genetics 17 388-391.
17. Jansen, R. C. (2003). Studying complex biological systems using multifactorial perturbation.
Nature Rev. Genet. 4: 145-151.
18. Jiang, C., and Z-B. Zeng. (1995). Multiple trait analysis of genetic mapping of quantitative
trait loci. Genetics 140: 1111-1127.
19. Kendziorski, C.M., Chen, M., Yuan, M., Lan, H., and A.D. Attie. (2004). Statistical
Methods for Expression Trait Loci (ETL) Mapping. Department of Biostatistics and Medical
Informatics Technical Report #184, submitted.
20. Kendziorski, C.M., Newton, M.A., Lan, H., and M.N. Gould. (2003). On parametric
empirical Bayes methods for comparing multiple groups using replicated gene expression
profiles. Statistics in Medicine, 22, 3899-3914.
21. Knott, S.A., and C.S. Haley. (2000). Multitrait least squares for quantitative trait loci
detection. Genetics 156: 899-911.
22. Lan, H., Rabaglia, M.E., Stoehr, J.P., Nadler, S.T., Schueler, K.L., Zou, F., Yandell, B.S.,
and A.D. Attie. (2003a). Gene expression profiles of non-diabetic and diabetic obese mice
suggest a role of hepatic lipogenic capacity in diabetes susceptibility. Diabetes, 52(3), 688-
700.
24
23. Lan, H., Stoehr, J.P., Nadler, S.T., Schueler, K.L., Yandell, B.S., and A.D. Attie. (2003b).
Dimension reduction for mapping mRNA abundance as quantitative traits. Genetics 164,
1607-1614.
24. Lander, E. S., and D. Botstein. (1989). Mapping mendelian factors underlying quantitative
traits using RFLP linkage maps. Genetics 121, 185-199.
25. Lund, M.S., Sorenson, P., Guldbrandtsen, B., and D.A. Sorensen. (2003). Multitrait Fine
Mapping of Quantitative Trait Loci Using Combined Linkage Disequilibria and Linkage
Analysis. Genetics 163(1), 405-410.
26. Lynch, M. and B. Walsh (1998). Genetics and Analysis of Quantitative Traits. Sunderland:
Sinauer.
27. Mangin, B., Goffinet, B., and A. Rebai. (1994). Constructing Confidence Intervals for QTL
Location. Genetics 138, 1301-1308.
28. Morley, M., C.M. Molony, T.M. Weber, J.L. Devlin, K.G. Ewens, R.S. Spielman, and V.G.
Cheung. (2004). Genetic analysis of genome-wide variation in human gene expression.
Nature 430, 743-747.
29. Nguyen, D.V., Arpat, A.B., Wang N., and R.J. Carroll. (2002). DNA Microarray
Experiments: Biological and Technological Aspects. Biometrics 58, 701-717.
30. Newton, M.A., Kendziorski, C.M, Richmond, C.S., Blattner, F.R., and K.W. Tsui. (2001).
On differential variability of expression ratios: Improving statistical inference about gene
expression changes from microarray data. Journal of Computational Biology, 8, 37-52.
31. Newton, M.A., Noueiry, A., Sarkar, D., and P. Ahlquist. (2004). Detecting differential gene
expression with a semiparametric hierarchical mixture method. Biostatistics 5, 155-176.
32. R Development Core Team (2004). R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria.
33. Rebai, A., Goffinet, B. and B. Mangin. (1994). Approximate thresholds for interval mapping
test for QTL detection. Genetics 138: 235-240.
25
34. Rebai, A., Goffinet, B. and B. Mangin. (1995). Comparing Power of Different Methods of
QTL Detection. Biometrics 51, 87-99.
35. Sax, K. (1923). The association of size differences with seed-coat pattern and pigmentation
in Phaseolusvulgaris. Genetics 8, 552 - 560.
36. Schadt, E., Monks, S., Drake, T.A., Lusis, A.J., Che, N., Collnayo, V., Ruff, T.G., Milligan,
S.B., Lamb, J.R., Cavet, G., Linsley, P.S., Mao, M., Stoughton, R.B., and S.H. Friend.
(2003). Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297-
302.
37. Sen, S., and G.A. Churchill. (2001). A Statistical Framework for Quantitative Trait
Mapping. Genetics 159, 371-387.
38. Storey, J.D., and R. Tibshirani. (2003). Statistical significance for genomewide studies.
Proceedings of the National Academy of Sciences 100(16), 9440-9445.
39. Storey JD. (2003). The positive false discovery rate: A Bayesian interpretation and the
q-value. Annals of Statistics, 31, 2013-2035.
40. Thoday, J.M. (1961). Location of Polygenes. Nature 191: 368-370.
41. Yvert, G., R.B. Brem, J. Whittle, J.M. Akey, E. Foss, E.N. Smith, R. Mackelprang, and L.
Kruglyak. (2003). Trans-acting regulatory variation in Saccharomyces cerevisiae and the
role of transcription factors. Nature Genetics 35(1), 57-64.
42. Zeng, Z.B. (1993). Theoretical basis of separation of multiple linked gene effects on
mapping quantitative trait loci. Proceedings of the National Academy of Sciences 90, 10972-
10976.
43. Zeng, Z.B. (1994). Precision of mapping of quantitative trait loci. Genetics 136, 1457-1468.
26
Table 1: Probability of QTL genotype conditional on the flanking markers.
QTL genotype
Mi M(i+1) Aa AA
Aa Aa (1−rL)(1−rR)1−r
rLrR
1−r
Aa AA (1−rL)rR
r
(1−rR)rL
r
AA Aa (1−rR)rL
r
(1−rL)rR
r
AA AA rLrR
1−r
(1−rL)(1−rR)1−r
where rL is the recombination frequency between marker i
and the putative QTL, rR is the recombination frequency
between marker (i + 1) and the putative QTL, r is the
recombination frequencey between marker i and i + 1.
Table 2: Reproduced from Table 1 of Zeng (1994)
Marker genotype
Group i i + 1 Pr(x∗ = 1)
1 + + 1
2 + -
1 with prob (1 − p)
0 with prob p
3 - +
1 with prob p
0 with prob (1 − p)
4 - - 0
where x∗ is the unknown genotype in the marker interval
and p = rL/r. Double recombination within the marker
the marker interval is ignored.
27
ν0
Pow
er
5−5 5−3 5−1 50 51 53 55
0.80
0.85
0.90
0.95
1.00
TB−MRMB−EBMB−QQ−ALLMOMSAMLIMMA
ν0
FD
R
5−5 5−3 5−1 50 51 53 55
0.0
0.1
0.2
0.3
0.4
TB−MRMB−EBMB−QQ−ALLMOMSAMLIMMA
Figure 1: Average power and FDR (over 20 simulated data sets) for each ν0, with 95% point-wise
confidence intervals.
28
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
cM
Ave
rage
Pos
terio
r P
roba
bilit
y
MOMpseudo−MOM
0 10 20 30 40 50 60 70 80 90
cM
Map
ping
Tra
nscr
ipts
0 10 20 30 40 50 60 70 80 90
Figure 2: The left panel gives the average posterior probability profiles calculated across mapping
transcripts. The ETL is marked by a vertical line. The right panels show the 500 mapping
transcripts (rows) from the left. For each transcript, genome locations inside (outside) the 96.8%
HPD region for pseudomarker-MOM are colored red (blue). Note that 96.8% is used to compare
with confidence intervals in Figure 3, which are based on existing methods.
29
0 20 40 60 80
510
1520
cM
Ave
rage
LO
D S
core
TB−MRIM
0 10 20 30 40 50 60 70 80 90
cM
Map
ping
Tra
nscr
ipts
0 10 20 30 40 50 60 70 80 90
Figure 3: The left panel gives the average LOD score calculated across mapping transcripts. The
ETL is marked by a vertical line. The right panels show the 500 mapping transcripts (rows). For
each transcript, genome locations within (outside) the 96.8% confidence region for the IM method
are colored red (blue).
30
0 20 40 60 80
0.0
0.1
0.2
0.3
0.4
cM
Ave
rage
Pos
terio
r P
roba
bilit
y
MOMpseudo−MOM
0 10 20 30 40 50 60 70 80 90
cMM
appi
ng T
rans
crip
ts
0 10 20 30 40 50 60 70 80 90
0 20 40 60 80
0.0
0.1
0.2
0.3
0.4
0.5
cM
Ave
rage
Pos
terio
r P
roba
bilit
y
MOMpseudo−MOM
0 10 20 30 40 50 60 70 80 90
cM
Map
ping
Tra
nscr
ipts
0 10 20 30 40 50 60 70 80 90
Figure 4: The left panel gives the average posterior probability profiles calculated across mapping
transcripts for two simulations (top and bottom). Each ETL is marked by a vertical line. The
right panels show the 500 mapping transcripts (rows) from each simulation on the left. For each
transcript, genome locations inside (outside) the 96.8% HPD region for pseudomarker-MOM are
colored red (blue). Note that 96.8% is used to compare with confidence intervals in Figure 5, which
are based on existing methods.
31
0 20 40 60 80
510
1520
cM
Ave
rage
LO
D S
core
TB−MRIM
0 10 20 30 40 50 60 70 80 90
cM
Map
ping
Tra
nscr
ipts
0 10 18 30 40 50 60 70
0 20 40 60 80
810
1214
1618
20
cM
Ave
rage
LO
D S
core
TB−MRIM
0 10 20 30 40 50 60 70 80 90
cM
Map
ping
Tra
nscr
ipts
0 10 18 30 40 50 60 70
Figure 5: The left panel gives the average LOD score calculated across mapping transcripts for
the two simulations shown in Figure 4. Each ETL is marked by a vertical line. The right panels
show the 500 mapping transcripts (rows). For each transcript, genome locations within (outside)
the 96.8% confidence region for the IM method are colored red (blue).
32
0 20 40 60 80
0
20
40
60
80
cM (QTL 2)
cM (
QT
L 1)
2−D MOM Scan
0 10 20 30 40 50 60 70 80 90
0
10
20
30
40
50
60
70
80
90
0
0.02
0.04
0.06
0.08
Figure 6: Average posterior probability of pattern 3 from the 2-D MOM scan, for all marker pairs.
The Y axis is for the first ETL, and the X axis is for the second. The diagonal is the average
posterior probability from the 1-D MOM scan.
33
0 20 40 60 80
0
20
40
60
80
cM (QTL 2)
cM (
QT
L 1)
2−D Marker Regression Scan
0 10 20 30 40 50 60 70 80 90
0
10
20
30
40
50
60
70
80
90
0
0.5
1
1.5
2
2.5
3
3.5
Figure 7: Average LOD scores from 2-D marker regression scan. Upper triangle is the average
LOD scores for epistasis; the lower triangle is the average LOD scores comparing the full model
with the null model; the diagonal is the average LOD score from a 1-D marker regression scan.
The contour lines are also shown.
34
0 20 40 60 80 100
0e+
002e
−06
4e−
066e
−06
8e−
06
100 transcripts
0 200 400 600 800 1000
0.00
0.05
0.10
0.15
0.20
0.25
1000 transcripts
post
erio
r pr
ob
0 1000 3000 5000
0.00
0.05
0.10
0.15
0.20
0.25
0.30
5000 transcripts
post
erio
r pr
ob
0 10000 20000 30000 40000
0.00
0.05
0.10
0.15
0.20
0.25
0.30
40000 transcripts
post
erio
r pr
ob
Figure 8: Posterior probabilities of a mixture component. The data consists of 10 animals. The
number of mapping DE transcripts are 1, 1, 4 and 22 for the four panels. The red dots are true DE
transcripts.
35
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
100 transcripts
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
1000 transcripts
post
erio
r pr
ob
0 1000 3000 5000
0.0
0.2
0.4
0.6
0.8
1.0
5000 transcripts
post
erio
r pr
ob
0 10000 20000 30000 40000
0.0
0.2
0.4
0.6
0.8
1.0
40000 transcripts
post
erio
r pr
ob
Figure 9: Posterior probabilities of a mixture component. The data consists of 60 animals. The
number of mapping DE transcripts are 1, 1, 4 and 22 for the four panels. The red dots are true DE
transcripts.
36
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
100 transcripts
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
1000 transcripts
post
erio
r pr
ob
0 1000 3000 5000
0.0
0.2
0.4
0.6
0.8
1.0
5000 transcripts
post
erio
r pr
ob
0 10000 20000 30000 40000
0.0
0.2
0.4
0.6
0.8
1.0
40000 transcripts
post
erio
r pr
ob
Figure 10: Posterior probabilities of a mixture component. The data consists of 100 animals. The
number of mapping DE transcripts are 1, 1, 4 and 22 for the four panels. The red dots are true DE
transcripts.
37
maps.out
0.0 19.6 25.5 42.9 50.1 62.9 69.6 77.9 85.4 98.7 110.5 121.6
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98 104 110 116
Figure 11: Example of genotype sampling using the HMM. The upper thin panel consists of the
marker data for one animal. The lower panel has 20 samples of pseudomarkers. Blue indicates
AA, yellow for Aa and light blue for missing data.
38
Appendix A
EM in interval mapping
Denote zt and gt to be the mapping location and QTL genotype for transcript t, t = 1, . . . , T .
zt can take on value among 0, 1, . . . , L. When zt = 0, we consider the transcript doens’t map
anywhere in the genome. gt is a vector of length n (number of animals in the experiment.) Denote
z = {z1, . . . , zT}, g = {g1, . . . , gT} and y = {y1, . . . ,yT}.
The complete joint likelihood of the data is:
Lc(θ) = p(m)p(θ)T∏
t=1
p(yt|zt, gt, θ)p(zt)p(gt|m)
= p(m)p(θ)
T∏
t=1
(
[
p0f0(yt)]I(zt=0)
L∏
l=1
[
plfl(yt|glt)]I(zt=l)
p(gt|m)
)
The Q function can be written as:
Q(θ, θ(i−1)) = E[log(Lc(θ))|y, m, θ(i−1)]
=∑
z1
· · ·∑
zT
∫
g1
· · ·
∫
gT
[
T∑
t=1
log(p(zt))
]
p(z, g|y, m, θ(i−1))
+∑
z1
· · ·∑
zT
∫
g1
· · ·
∫
gT
[
T∑
t=1
log(p(gt))
]
p(z, g|y, m, θ(i−1))
+∑
z1
· · ·∑
zT
∫
g1
· · ·
∫
gT
[
T∑
t=1
log(p(yt|zt, gt))
]
p(z, g|y, m, θ(i−1)) + const
= (I) + (II) + (III) + const
We can maximize Q(θ, θ(i−1)) by maximizing (I), (II) and (III) individually. It can be shown that
in the M-step, p̂(gt) = p(gt|y, m) and p̂(zt) =∫
p(zt|yt, gt)p(gt|y, m)dgt. (III) can be maximized
using numerical methods in R.
The mixing proportion pl, l = 1, . . . , L thus can be estimated as 1T
∑T
t=1
∫
p(zt =
l|yt, gt)p(gt|y, m)dgt. And the estimate for p0 can be expressed as 1T
∑Tt=1 p(zt = 0|yt) since
it doesn’t depend on the QTL genotype.
The posterior probability of transcript t mapping to the lth location can be written as:
p(zlt = 1|y, m) =
p(zlt = 1)p(yt|z
lt = 1, m)
p(yt|m)
=p(zl
t = 1)∫
fl(yt|gl)p(gl|m)dgl
p(yt|m)
39
where p(zlt = 1) is the prior probability.
Appendix B
Multiple Imputation: Pseudomarker Algorithm
Sen and Churchill (2001) proposed the pseudomarker algorithm to implement Bayesian QTL
analysis. Suppose that quantitative traits are measured for n members of an inbred line cross.
Denote the traits by y = (y1, y2, . . . , yn)′ and denote the corresponding marker data by the n × M
matrix m where M denotes the total number of markers. Marker location and genetic distances
are assumed known, although in practice these quantities are estimated.
A genetic model H describes the way in which QTL genotypes determine phenotype. A model
is prescribed by the number of QTL, their locations, and the way in which they act and interact
to affect the phenotype. Let µ denote the parameters of the genetic model. Assuming there are p
QTL in the genetic model, let γ denote the p-dimensional vector of QTL locations and g denote
the n × p matrix of QTL genotypes. The authors observed that one can decompose the problem
into two parts conditional on the unknown QTL genotypes.
p(y, m, g, µ, γ) =(
p(y|g, µ)p(µ))(
p(g|m, γ)p(m)p(γ))
(6.10)
Conditional on the QTL genotypes, the genetic part of the problem can be solved independently
from the linkage part.
Of primary interest is the posterior distribution of QTL locations, p(γ|y, m) given by:
p(γ|y, m) ∝ p(γ)
∫
p(y|g) p(g|m, γ) dg (6.11)
An exact evaluation of the above equation is computationally prohibitive, but a Monte Carlo
approximation can be obtained by first sampling multiple versions of the putative QTL genotypes
and then averaging the results. The detailed pseudomarker algorithm is as follows:
1. Select a regularly spaced grid G of pseudomarker locations and create q realizations of
the pseudomarkers by sampling from p(g|m). Assuming known genetic distances and no
crossover interference, a Markov chain sampling scheme can be used. Each realization of
pseudomarker genotypes is an n × G matrix.
40
2. For the assumed genetic model H , a p-dimensional vector of pseduomarker locations
corresponding to the QTL, γH , is prescribed; and the ith realization of pseudomarker
genotypes provides gi(γH), an n×p matrix of pseudomarker genotypes at the QTL locations.
3. For each realization, calculate the weight under the assumed genetic model H . The weight
for the ith realization is:
WH(gi(γH)) = p(y|g = gi(γ
H))p(γ = γH) (6.12)
If the prior on the QTL locations is uniform, then p(γ = γH) can be dropped.
4. An average of these weights approximates (6.11):
p(γ|y, m) ≈ C ·
q∑
i=1
WH(gi(γH))
for some constant of proportionality C.
Appendix C
Importance Sampling in pseudomarker-MOM
In pseudomarker-MOM, we are interested in sampling all the unknown genotypes over the L
locations simultaneously, in order to fit pseudomarker0-MOM. Denote g = {g1, g2, . . . , gn},
where gl is the unknown genotype vector at location l, l = 1, . . . , L. Specifically, we want to
sample from p(g|y, m). Direct sampling is computationally prohibitive. One can get around this
difficulty using importance sampling. Note that
p(g|y, m) ∝ p(y|g)p(g|m)
One can instead sample from p(g|m) and weight each sample by the importance weight p(y|g)
which can be obtained as∫
p(y|zt, g)p(zt)dzt.
p(zlt|yt, m) =
∫
p(zlt|yt, g)p(g|yt, m)dg
≈
∑Qq=1 p(zl
t|yt, gq)∑Q
q=1 p(y|gq)
=pl
∑Q
q=1 fl(yt|glq)
∑Q
q=1
(
p0f0(yt) +∑L
l=1 plfl(yt|glq))
where gq = {g1q , g
2q , . . . , g
Lq } is the qth sample from p(g|m).
41
Appendix D
Theorem 1
Assumptions:
1. For each mapping transcript, there is only 1 ETL.
2. The prior probability of a transcript mapping to the ith component is known, and equal for
all the components.
3. The hyper-parameters are assumed to be known.
Result:
For each mapping transcript, the posterior probability of the pattern that is closest to the true pattern
will have the highest posterior probability among all the patterns fit in the MOM model.
Proof:
For algebraic simplicity, consider a backcross.
We assume the LNN model described in Kendziorski et al. (2003) and use that same notation. The
log predictive density of mRNA expression for transcript t, assuming the null pattern, is:
log(f0(yt)) = −n
2log(2π) −
n − 1
2log(σ2) −
1
2log(σ2 + nτ 2
0 )
−
∑n
i=1(yit − µ0)2
2(σ2 + nτ 20 )
+τ 20 [(∑
yit)2 − n
∑
y2it]
2σ2(σ2 + nτ 20 )
(A-1)
For non-null patterns, the log predictive density is: log(f1(yt)) = log(f0(y0t ))+ log(f0(y
1t )), where
y0t and y1
t denote the tth transcript intensity for population that have genotype 0 and 1, respectively.
Suppose at the ETL, the segregating population has n1 and n2 animals having genotype 0 and 1,
respectively. Then, the log predictive density of pattern 1 at ETL can be written as
log(f ∗1 (yt)) = −
n1
2log(2π) −
n1 − 1
2log(σ2) −
1
2log(σ2 + n1τ
20 )
−
∑
i∈P 0(y0it − µ0)
2
2(σ2 + n1τ 20 )
+τ 20 [(∑
y0it)
2 − n1
∑
(y0it)
2]
2σ2(σ2 + n1τ 20 )
−n2
2log(2π) −
n2 − 1
2log(σ2) −
1
2log(σ2 + n2τ
20 )
−
∑
i∈P 1(y1it − µ0)
2
2(σ2 + n2τ 20 )
+τ 20 [(∑
y1it)
2 − n2
∑
(y1it)
2]
2σ2(σ2 + n2τ 20 )
(A-2)
42
Since n1 + n2 is fixed for a given experiment, comparison of (A-2) is equivalent to comparison of
−1
2log(σ2 + n1τ
20 ) −
1
2log(σ2 + n2τ
20 ) +
τ 20 (∑
y0it + µ0σ2
τ2
0
)2
2σ2(σ2 + n1τ 20 )
+τ 20 (∑
y1it + µ0σ2
τ2
0
)2
2σ2(σ2 + n2τ 20 )
(A-3)
for n1, n2 and y0it, y
1it evaluated at each MOM component.
Now, consider a MOM pattern that is described according to a test location l that has recombination
frequency r to the ETL. The assignment of y0it and y1
it has changed according to the genotype at
that location, denote as y′0it and y
′1it , respectively. (A-3) evaluated at l becomes
−1
2log(σ2 + (n1(1 − r) + n2r)τ
20 ) −
1
2log(σ2 + (n2(1 − r) + n1r)τ
20 )
+τ 20 (∑
y′0it + µ0σ2
τ2
0
)2
2σ2(σ2 + (n1(1 − r) + n2r)τ 2)+
τ 20 (∑
y′1it + µ0σ2
τ2
0
)2
2σ2(σ2 + (n2(1 − r) + n1r)τ 2)
(A-4)
We therefore have
∑
i
y′0it +
µ0σ2
τ 20
∼ N(µ0(n1(1 − r) + n2r) +µ0σ
2
τ 20
, v20)
∑
i
y′1it +
µ0σ2
τ 20
∼ N(µ0(n1(1 − r) + n2r) +µ0σ
2
τ 20
, v21)
where v20 = (σ2 + τ 2
0 )(n1(1 − r) + n2r) + τ 20 (n1(n1 − 1)(1 − r)2 + n2(n2 − 1)r2) and v2
1 =
(σ2 + τ 20 )(n2(1 − r) + n1r) + τ 2
0 (n2(n2 − 1)(1 − r)2 + n1(n1 − 1)r2).
Thus,(∑
i y′0it + µ0σ2
τ2
0
)2
v0
∼ χ21(δ).
with non-central parameter δ =
(
µ0(n1(1−r)+n2r)+µ0σ2
τ20
)2
v2
0
. Similary for(P
i y′1
it +µ0σ2
τ20
)2
v1.
Taking the expection of (A-4), we are left with
−1
2log(
σ2 + (n1(1 − r) + n2r)τ20
)
−1
2log(
σ2 + (n2(1 − r) + n1r)τ20
)
+ (I) + (II) (A-5)
43
where
(I) =τ 20
2σ2
(σ2 + τ 20 )(n1(1 − r) + n2r) + τ 2
0 (n1(n1 − 1)(1 − r)2 + n2(n2 − 1)r2)
σ2 + τ 20 (n1(1 − r) + n2r)
+µ2
0
(
σ2 + τ 20 (n1(1 − r) + n2r)
)
2σ2τ 20
(II) =τ 20
2σ2
(σ2 + τ 20 )(n2(1 − r) + n1r) + τ 2
0 (n2(n2 − 1)(1 − r)2 + n1(n1 − 1)r2)
σ2 + τ 20 (n2(1 − r) + n1r)
+µ2
0
(
σ2 + τ 20 (n2(1 − r) + n1r)
)
2σ2τ 20
We want to show (A-5) is decreasing in r, in other words, it reaches its maximum when r = 0.
Taking the partial derivative with respect to r of the first two terms, we have
−12
τ4
0(n1−n2)2(1−2r)
(
σ2+τ2
0(n1(1−r)+n2r)
)(
σ2+τ2
0(n2(1−r)+n1r)
) , which is ≤ 0 since 0 ≤ r ≤ 12.
Taking the partial derivative with respect to r of (I) + (II) is a little messy. It can be shown that
∂(I)
∂r+
∂(II)
∂r=
τ 20
2σ2
[ rB + C
(σ2 + τ 20 (n1(1 − r) + n2r))2(σ2 + τ 2
0 (n2(1 − r) + n1r))2
]
(A-6)
where B = 2τ 20 n1(n1 − 1)(σ2 + n2τ
20 )2(2σ2 + (n1 + n2)τ
20 ) + 2τ 2
0 n2(n2 − 1)(σ2 + n1τ20 )2(2σ2 +
(n1 + n2)τ20 ) − 2σ2τ 2
0 (σ2 + τ 20 )(n2 − n1)
2(2σ2 + (n1 + n2)τ20 ) and C = −B
2.
When r = 12, (A-6)= 0, and when r = 0, (A-6)< 0, since
B = (2n1n2 − n1 − n2)(n1n2τ40 + σ2τ 2
0 (n1 + n2) + σ4)
> 0
Therefore, the partial derivative of (A-4) with respect to r is < 0. (A-2) is maximized in expectation
when r = 0. Assuming equal prior probabilities of patterns, we show that the posterior probability
of the pattern described by a test location that is closest to the ETL is the highest.
44