taking the broad view of model selection for qtl in...
Post on 21-Mar-2020
3 Views
Preview:
TRANSCRIPT
January 2003 PAG XI © Brian S. Yandell 1
Taking the Broad Viewof Model Selection for QTL
in Experimental Crosses
Brian S. YandellUniversity of Wisconsin-Madison
www.stat.wisc.edu/~yandell/statgenwith Chunfang “Amy” Jin, UW-Madison,
Patrick J. Gaffney, Lubrizol,and Jaya M. Satagopan, Sloan-Kettering
Plant & Animal Genome XI, January 2003
January 2003 PAG XI © Brian S. Yandell 2
0 5 10 15 20 25 30
01
23
rank order of QTL
addi
tive
effe
ct
Pareto diagram of QTL effects
54
3
2
1
major QTL onlinkage map
majorQTL
minorQTL
polygenes
(modifiers)
January 2003 PAG XI © Brian S. Yandell 3
how many (detectable) QTL?• build m = number of QTL detected into model
– directly allow uncertainty in genetic architecture– model selection over number of QTL, architecture– use Bayes factors and model averaging
• to identify “better” models• many, many QTL may affect most any trait
– how many QTL are detectable with these data?• limits to useful detection (Bernardo 2000)• depends on sample size, heritability, environmental variation
– consider probability that a QTL is in the model• avoid sharp in/out dichotomy• major QTL usually selected, minor QTL sampled infrequently
January 2003 PAG XI © Brian S. Yandell 4
interval mapping basics• observed measurements
– Y = phenotypic trait– X = markers & linkage map
• i = individual index 1,…,n• missing data
– missing marker data– Q = QT genotypes
• alleles QQ, Qq, or qq at locus• unknown genetic architecture
– λ = QT locus (or loci)– θ = genetic action– m = number of QTL
• pr(Q|X,λ,m) recombination model– grounded by linkage map, experimental cross– recombination yields multinomial for Q given X
• pr(Y|Q,θ,m) phenotype model– distribution shape (assumed normal here) – unknown parameters θ (could be non-parametric)
observed X Y
missing Q
unknown λ θafter
Sen Churchill (2001)
January 2003 PAG XI © Brian S. Yandell 5
Classical vs. Bayesian IM• MIM: classical LOD: mix over genotypes Q
– L(λ,θ|Y,m) = pr(Y|X,λ,θ,m)= producti [sumQ pr(Q|Xi,λ,m) pr(Yi|Q,θ,m)]
– maximize LOD(λ) = 2.3log(LR(λ)) = maxθ log10 L(λ,θ|Y,m)/L(µ|Y)– threshold for testing presence of QTL– Kao Zeng Teasdale 1999; Zeng et al. 2000; Broman Speed 2002
• BIM: Bayesian posterior: Q as missing data– sample genotypes Q, loci λ, effects θ and number of QTL m
• pr(λ,Q,θ,m|Y,X) = [producti pr(Qi|Xi,λ,m) pr(Yi|Qi,θ,m)] pr(λ,θ|X,m)pr(m)– study marginal posteriors
• pr(λ,θ|Y,X,m) = sumQ pr(λ,Q,θ|Y,X,m) with m fixed• pr(m|Y,X) = sum(λ,θ) pr(λ,θ|Y,X,m)pr(m)
– threshold for posterior “power” (positive false discovery rate)– Satagopan et al. 1996; Gaffney 2001; Yi Xu 2002
January 2003 PAG XI © Brian S. Yandell 6
Model Selection for QTL• what is the genetic architecture?
– M = model = (λ,θ,m)– λ = QT locus (or loci)– θ = genetic action (additive, dominance, epistasis)– m = number of QTL
• how to assess models?– MIM: various flavors of AIC, BIC– BIM: Bayes factors
• how to search model space?– MIM: sequential forward selection/backward elimination
• scan loci systematically across genome– BIM: sample forward/backward: transdimensional MCMC
• sample loci at random across genome
January 2003 PAG XI © Brian S. Yandell 7
Bayes factors to assess models• Bayes factor: which model best supports the data?
– ratio of posterior odds to prior odds– ratio of model likelihoods
• equivalent to LR statistic when– comparing two nested models– simple hypotheses (e.g. 1 vs 2 QTL)
• related to Bayes Information Criteria (BIC)– Schwartz introduced for model selection in general settings– penalty to balance model size (p = number of parameters)
)log(2)log(2)log(2),1|pr(
),|pr()1pr(/)pr(
),|1(pr/),|(pr
nLRBFXmY
XmYmm
XYmXYmBF
−−=−+
=++=
January 2003 PAG XI © Brian S. Yandell 8
QTL Bayes factors & RJ-MCMC• easy to compute Bayes factors from samples
– posterior pr(m|Y,X) is marginal histogram– posterior affected by prior pr(m)
• BF insensitive to shape of prior– geometric, Poisson, uniform– precision improves when prior mimics posterior
• BF sensitivity to prior variance on effects θ– prior variance should reflect data variability– resolved by using hyper-priors
• automatic algorithm; no need for user tuning
e
e
e
ee
ee e e e e
0 2 4 6 8 10
0.00
0.10
0.20
0.30
p
p
p p
p
p
pp p p p
u u u u u u u
u u u u
m = number of QTL
prio
r pro
babi
lity
epu
exponentialPoissonuniform
January 2003 PAG XI © Brian S. Yandell 9
• Y = µ + GQ + environment• partition genotypic effect into separate QTL effects
GQ = main QTL effects + epistatic interactionsGQ = θ1Q + . . . + θmQ + θ12Q + . . .
• priors on mean and effectsGQ ~ N(0, h2s2) model independent genotypic priorθjQ ~ N(0, κ1s2/m.) effects and interactionsθj2Q ~ N(0, κ2s2/m.) down-weighted
• hyperparameters (to reduce sensitivity of Bayes factors to prior)– s2 = total sample variance– m.= m+m2 = number of QTL effects and interactions– h2 = κ1+κ2 = unknown heritability, h2/2~Beta(a,b)
multiple QTL phenotype model
January 2003 PAG XI © Brian S. Yandell 10
Markov chain Monte Carlo idea
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
θ
pr(θ
|Y)
have posterior pr(θ |Y)want to draw samples
propose θ ~ pr(θ |Y)(ideal: Gibbs sample)
propose new θ “nearby”accept if more probabletoss coin if less probablebased on relative heights(Metropolis-Hastings)
January 2003 PAG XI © Brian S. Yandell 11
MCMC realization
0 2 4 6 8 10
020
0040
0060
0080
0010
000
θ
mcm
c se
quen
ce
2 3 4 5 6 7 8
0.0
0.1
0.2
0.3
0.4
0.5
0.6
θ
pr(θ
|Y)
added twist: occasionally propose from whole domain
January 2003 PAG XI © Brian S. Yandell 12
a complicated simulation•simulated F2 intercross, 8 QTL
– (Stephens, Fisch 1998)– n=200, heritability = 50%– detected 3 QTL
•increase to detect all 8– n=500, heritability to 97%
QTL chr loci effect1 1 11 –32 1 50 –53 3 62 +24 6 107 –35 6 152 +36 8 32 –47 8 54 +18 9 195 +2
7 8 9 10 11 12 13
010
2030
40
number of QTL
frequ
ency
in %
0 50 100 150 200
ch1ch2ch3ch4ch5ch6ch7ch8ch9
ch10
Genetic map
posterior
January 2003 PAG XI © Brian S. Yandell 13
loci pattern across genome• notice which chromosomes have persistent loci• best pattern found 42% of the time
Chromosome m 1 2 3 4 5 6 7 8 9 10 Count of 80008 2 0 1 0 0 2 0 2 1 0 33719 3 0 1 0 0 2 0 2 1 0 7517 2 0 1 0 0 2 0 1 1 0 3779 2 0 1 0 0 2 0 2 1 0 2189 2 0 1 0 0 3 0 2 1 0 2189 2 0 1 0 0 2 0 2 2 0 198
January 2003 PAG XI © Brian S. Yandell 14
Bmapqtl: our RJ-MCMC software• www.stat.wisc.edu/~yandell/qtl/software/Bmapqtl
– module using QtlCart format– compiled in C for Windows/NT– extensions in progress– R post-processing graphics
• library(bim) is cross-compatible with library(qtl)
• Bayes factor and reversible jump MCMC computation• enhances MCMCQTL and revjump software
– initially designed by JM Satagopan (1996)– major revision and extension by PJ Gaffney (2001)
• whole genome• multivariate update of effects; long range position updates• substantial improvements in speed, efficiency• pre-burnin: initial prior number of QTL very large
January 2003 PAG XI © Brian S. Yandell 15
B. napus 8-week vernalizationwhole genome study
• 108 plants from double haploid– similar genetics to backcross: follow 1 gamete– parents are Major (biennial) and Stellar (annual)
• 300 markers across genome– 19 chromosomes– average 6cM between markers
• median 3.8cM, max 34cM– 83% markers genotyped
• phenotype is days to flowering– after 8 weeks of vernalization (cooling)– Stellar parent requires vernalization to flower
January 2003 PAG XI © Brian S. Yandell 16
Markov chain Monte Carlo sequence
burnin (sets up chain)mcmc sequence
number of QTLenvironmental varianceh2 = heritability(genetic/total variance)LOD = likelihood
-50000 -40000 -30000 -20000 -10000 0
05
1015
2025
burnin sequence
nqtl
-50000 -40000 -30000 -20000 -10000 00.
006
0.01
00.
014
burnin sequenceen
vvar
-50000 -40000 -30000 -20000 -10000 0
0.0
0.2
0.4
0.6
burnin sequence
herit
-50000 -40000 -30000 -20000 -10000 0
05
1015
20
burnin sequence
LOD
0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+0
24
68
10
mcmc sequence
nqtl
0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+0
0.00
40.
010
0.01
6
mcmc sequence
envv
ar
0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+0
0.1
0.3
0.5
mcmc sequence
herit
0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+05
1015
20mcmc sequence
LOD
January 2003 PAG XI © Brian S. Yandell 17
MCMC sampled loci
subset of chromosomesN2, N3, N16
points jittered for viewblue lines at markers
note concentrationon chromosome N2
020
4060
8010
012
0
N2 N3 N16chromosome
MC
MC
sam
pled
loci
ec2e5a
E33M49.117
ec3b12wg2d11b
wg1g8a
E32M49.73wg3g11ec3a8wg7f3aE33M59.59
E35M62.117wg6b10
wg8g1bwg5a5
tg6a12
Aca1E38M50.133
E35M59.117
ec2d1a
wg1a10tg2h10b
tg2f12
ec3e12b
wg5a1bwg2a3c
ec5e12awg7f5bE32M47.159
tg5d9bslg6E35M59.85
E33M49.491E38M62.229wg1g10bec4h7
wg4d10
E32M50.409
E38M50.159
E33M47.338ec2d8a
E33M49.165wg4f4cE35M48.148wg6c6ec4g7bwg7a11
wg5b1aec4e7aE32M61.218
wg4a4bE33M50.183E33M62.140
wg9c7
wg6b2
E33M49.211wg2e12bisoIdh
wg3f7c
January 2003 PAG XI © Brian S. Yandell 18
Bayesian model assessment
row 1: # QTLrow 2: pattern
col 1: posteriorcol 2: Bayes factornote error bars on bf
evidence suggests4-5 QTLN2(2-3),N3,N16
1 3 5 7 9 110.
00.
10.
20.
3number of QTL
QTL
pos
terio
r
QTL posterior
15
5050
0
number of QTL
post
erio
r / p
rior
Bayes factor ratios
1 3 5 7 9 11
weak
moderate
strong
1 3 5 7 9 11 13
0.0
0.1
0.2
0.3
2*2
2:2,
123:
2*2,
122:
2,13
2:2,
32:
2,16
2:2,
113:
2*2,
32:
2,15
4:2*
2,3,
163:
2*2,
132:
2,14
2
model index
mod
el p
oste
rior
pattern posterior
5 e
-01
1 e
+01
5 e
+02
model indexpo
ster
ior /
prio
r
Bayes factor ratios
1 3 5 7 9 11 13
12
23
2 22
23
24
32
weak
moderate
strong
January 2003 PAG XI © Brian S. Yandell 19
Bayesian model diagnostics pattern: N2(2),N3,N16col 1: densitycol 2: boxplots by m
environmental varianceσ2 = .008, σ = .09
heritabilityh2 = 52%
LOD = 16(highly significant)
but note change with m
0.004 0.006 0.008 0.010 0.012
050
100
200
dens
ity
marginal envvar, m ≥ 44 5 6 7 8 9 11 12
0.00
60.
008
0.01
0
envvar conditional on number of QTL
envv
ar
0.2 0.3 0.4 0.5 0.6 0.7
01
23
45
dens
ity
marginal heritability, m ≥ 44 5 6 7 8 9 11 12
0.30
0.40
0.50
0.60
heritability conditional on number of QT
herit
abilit
y5 10 15 20 25
0.00
0.04
0.08
0.12
dens
ity
marginal LOD, m ≥ 44 5 6 7 8 9 11 12
1012
1416
1820
LOD conditional on number of QTLLO
D
January 2003 PAG XI © Brian S. Yandell 20
Bayesian estimates of loci & effects
histogram of lociblue line is densityred lines at estimates
estimate additive effects(red circles)
grey points sampledfrom posterior
blue line is cubic splinedashed line for 2 SD
0 50 100 150 200 2500.
000.
020.
040.
06lo
ci h
isto
gram
napus8 summaries with pattern 1,1,2,3 and m ≥ 4
N2 N3 N16
0 50 100 150 200 250
-0.0
8-0
.02
0.02
addi
tive
N2 N3 N16
January 2003 PAG XI © Brian S. Yandell 21
loci marginal posteriors
unlinked loci linked loci
January 2003 PAG XI © Brian S. Yandell 22
mapping gene expression
• 108 F2 mice• mRNA to RT-PCR• multivariate screen
• clustering• PC analysis
• highlight SCD• Lan et al. (2003)
• ch2 dominance
January 2003 PAG XI © Brian S. Yandell 23
false detection rates and posteriors
• multiple comparisons: test QTL across genome– size = Pr( LOD(λ) > t | no QTL at λ )– genome-wise threshold
• theoretical value or permutation value (Churchill Doerge 1995)
– threshold guards against a single false detection– difficult to extend to multiple QTL
• positive false discovery rate (Storey 2001)– pFDR = Pr( no QTL at λ | LOD(λ) > t )– consider proportion of false detections for threshold– related to Bayesian posterior– extends naturally to multiple QTL
January 2003 PAG XI © Brian S. Yandell 24
pFDR and QTL posterior• single QTL case
– pick a rejection region R = {λ|LOD(λ)>t} for some t– pFDR = Pr(m=0)*size/[Pr(m=0)*size+Pr(m=1)*power]– power = Pr(λ in R | Y, X, m = 1 )– size = (length of R) / (length of genome)
• multiple QTL case– pFDR = Pr(m=0)*size/[Pr(m=0)*size+Pr(m>1)*power]– power = Pr(λ in R | Y, X, m > 1 )
• extends to other null hypotheses– pFDR = Pr(m=1)*size/[Pr(m=1)*size+Pr(m>2)*power]
January 2003 PAG XI © Brian S. Yandell 25
B napus with m~Poisson(1)
January 2003 PAG XI © Brian S. Yandell 26
Summary• Bayesian posteriors and Bayes factors
– Bayes factors for model assessment– posteriors can reveal subtle hints of QTL
• graphical tools for model selection– Bayes factor ratios on log scale– model identified by m or genetic architecture
• connection to false discovery rate– whole genome evaluation– calibrate posterior region with pFDR
top related