taking the broad view of model selection for qtl in...

January 2003 PAG XI © Brian S. Yandell 1

Taking the Broad Viewof Model Selection for QTL

in Experimental Crosses

Brian S. YandellUniversity of Wisconsin-Madison

www.stat.wisc.edu/~yandell/statgenwith Chunfang “Amy” Jin, UW-Madison,

Patrick J. Gaffney, Lubrizol,and Jaya M. Satagopan, Sloan-Kettering

Plant & Animal Genome XI, January 2003

0 5 10 15 20 25 30

rank order of QTL

Pareto diagram of QTL effects

major QTL onlinkage map

majorQTL

minorQTL

polygenes

(modifiers)

how many (detectable) QTL?• build m = number of QTL detected into model

– directly allow uncertainty in genetic architecture– model selection over number of QTL, architecture– use Bayes factors and model averaging

• to identify “better” models• many, many QTL may affect most any trait

– how many QTL are detectable with these data?• limits to useful detection (Bernardo 2000)• depends on sample size, heritability, environmental variation

– consider probability that a QTL is in the model• avoid sharp in/out dichotomy• major QTL usually selected, minor QTL sampled infrequently

interval mapping basics• observed measurements

– Y = phenotypic trait– X = markers & linkage map

• i = individual index 1,…,n• missing data

– missing marker data– Q = QT genotypes

• alleles QQ, Qq, or qq at locus• unknown genetic architecture

– λ = QT locus (or loci)– θ = genetic action– m = number of QTL

• pr(Q|X,λ,m) recombination model– grounded by linkage map, experimental cross– recombination yields multinomial for Q given X

• pr(Y|Q,θ,m) phenotype model– distribution shape (assumed normal here) – unknown parameters θ (could be non-parametric)

observed X Y

missing Q

unknown λ θafter

Sen Churchill (2001)

Classical vs. Bayesian IM• MIM: classical LOD: mix over genotypes Q

– L(λ,θ|Y,m) = pr(Y|X,λ,θ,m)= producti [sumQ pr(Q|Xi,λ,m) pr(Yi|Q,θ,m)]

– maximize LOD(λ) = 2.3log(LR(λ)) = maxθ log10 L(λ,θ|Y,m)/L(µ|Y)– threshold for testing presence of QTL– Kao Zeng Teasdale 1999; Zeng et al. 2000; Broman Speed 2002

• BIM: Bayesian posterior: Q as missing data– sample genotypes Q, loci λ, effects θ and number of QTL m

• pr(λ,Q,θ,m|Y,X) = [producti pr(Qi|Xi,λ,m) pr(Yi|Qi,θ,m)] pr(λ,θ|X,m)pr(m)– study marginal posteriors

• pr(λ,θ|Y,X,m) = sumQ pr(λ,Q,θ|Y,X,m) with m fixed• pr(m|Y,X) = sum(λ,θ) pr(λ,θ|Y,X,m)pr(m)

– threshold for posterior “power” (positive false discovery rate)– Satagopan et al. 1996; Gaffney 2001; Yi Xu 2002

Model Selection for QTL• what is the genetic architecture?

– M = model = (λ,θ,m)– λ = QT locus (or loci)– θ = genetic action (additive, dominance, epistasis)– m = number of QTL

• how to assess models?– MIM: various flavors of AIC, BIC– BIM: Bayes factors

• how to search model space?– MIM: sequential forward selection/backward elimination

• scan loci systematically across genome– BIM: sample forward/backward: transdimensional MCMC

• sample loci at random across genome

Bayes factors to assess models• Bayes factor: which model best supports the data?

– ratio of posterior odds to prior odds– ratio of model likelihoods

• equivalent to LR statistic when– comparing two nested models– simple hypotheses (e.g. 1 vs 2 QTL)

• related to Bayes Information Criteria (BIC)– Schwartz introduced for model selection in general settings– penalty to balance model size (p = number of parameters)

)log(2)log(2)log(2),1|pr(

),|pr()1pr(/)pr(

),|1(pr/),|(pr

nLRBFXmY

XYmXYmBF

−−=−+

QTL Bayes factors & RJ-MCMC• easy to compute Bayes factors from samples

– posterior pr(m|Y,X) is marginal histogram– posterior affected by prior pr(m)

• BF insensitive to shape of prior– geometric, Poisson, uniform– precision improves when prior mimics posterior

• BF sensitivity to prior variance on effects θ– prior variance should reflect data variability– resolved by using hyper-priors

• automatic algorithm; no need for user tuning

ee e e e e

0 2 4 6 8 10

pp p p p

u u u u u u u

u u u u

m = number of QTL

exponentialPoissonuniform

• Y = µ + GQ + environment• partition genotypic effect into separate QTL effects

GQ = main QTL effects + epistatic interactionsGQ = θ1Q + . . . + θmQ + θ12Q + . . .

• priors on mean and effectsGQ ~ N(0, h2s2) model independent genotypic priorθjQ ~ N(0, κ1s2/m.) effects and interactionsθj2Q ~ N(0, κ2s2/m.) down-weighted

• hyperparameters (to reduce sensitivity of Bayes factors to prior)– s2 = total sample variance– m.= m+m2 = number of QTL effects and interactions– h2 = κ1+κ2 = unknown heritability, h2/2~Beta(a,b)

multiple QTL phenotype model

Markov chain Monte Carlo idea

0 2 4 6 8 10

have posterior pr(θ |Y)want to draw samples

propose θ ~ pr(θ |Y)(ideal: Gibbs sample)

propose new θ “nearby”accept if more probabletoss coin if less probablebased on relative heights(Metropolis-Hastings)

MCMC realization

0 2 4 6 8 10

2 3 4 5 6 7 8

added twist: occasionally propose from whole domain

a complicated simulation•simulated F2 intercross, 8 QTL

– (Stephens, Fisch 1998)– n=200, heritability = 50%– detected 3 QTL

•increase to detect all 8– n=500, heritability to 97%

QTL chr loci effect1 1 11 –32 1 50 –53 3 62 +24 6 107 –35 6 152 +36 8 32 –47 8 54 +18 9 195 +2

7 8 9 10 11 12 13

number of QTL

0 50 100 150 200

ch1ch2ch3ch4ch5ch6ch7ch8ch9

Genetic map

posterior

loci pattern across genome• notice which chromosomes have persistent loci• best pattern found 42% of the time

Chromosome m 1 2 3 4 5 6 7 8 9 10 Count of 80008 2 0 1 0 0 2 0 2 1 0 33719 3 0 1 0 0 2 0 2 1 0 7517 2 0 1 0 0 2 0 1 1 0 3779 2 0 1 0 0 2 0 2 1 0 2189 2 0 1 0 0 3 0 2 1 0 2189 2 0 1 0 0 2 0 2 2 0 198

Bmapqtl: our RJ-MCMC software• www.stat.wisc.edu/~yandell/qtl/software/Bmapqtl

– module using QtlCart format– compiled in C for Windows/NT– extensions in progress– R post-processing graphics

• library(bim) is cross-compatible with library(qtl)

• Bayes factor and reversible jump MCMC computation• enhances MCMCQTL and revjump software

– initially designed by JM Satagopan (1996)– major revision and extension by PJ Gaffney (2001)

• whole genome• multivariate update of effects; long range position updates• substantial improvements in speed, efficiency• pre-burnin: initial prior number of QTL very large

B. napus 8-week vernalizationwhole genome study

• 108 plants from double haploid– similar genetics to backcross: follow 1 gamete– parents are Major (biennial) and Stellar (annual)

• 300 markers across genome– 19 chromosomes– average 6cM between markers

• median 3.8cM, max 34cM– 83% markers genotyped

• phenotype is days to flowering– after 8 weeks of vernalization (cooling)– Stellar parent requires vernalization to flower

Markov chain Monte Carlo sequence

burnin (sets up chain)mcmc sequence

number of QTLenvironmental varianceh2 = heritability(genetic/total variance)LOD = likelihood

-50000 -40000 -30000 -20000 -10000 0

burnin sequence

-50000 -40000 -30000 -20000 -10000 00.

burnin sequenceen

-50000 -40000 -30000 -20000 -10000 0

burnin sequence

-50000 -40000 -30000 -20000 -10000 0

burnin sequence

0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+0

mcmc sequence

0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+0

mcmc sequence

0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+0

mcmc sequence

0 e+00 2 e+05 4 e+05 6 e+05 8 e+05 1 e+05

20mcmc sequence

MCMC sampled loci

subset of chromosomesN2, N3, N16

points jittered for viewblue lines at markers

note concentrationon chromosome N2

N2 N3 N16chromosome

ec2e5a

E33M49.117

ec3b12wg2d11b

wg1g8a

E32M49.73wg3g11ec3a8wg7f3aE33M59.59

E35M62.117wg6b10

wg8g1bwg5a5

tg6a12

Aca1E38M50.133

E35M59.117

ec2d1a

wg1a10tg2h10b

tg2f12

ec3e12b

wg5a1bwg2a3c

ec5e12awg7f5bE32M47.159

tg5d9bslg6E35M59.85

E33M49.491E38M62.229wg1g10bec4h7

wg4d10

E32M50.409

E38M50.159

E33M47.338ec2d8a

E33M49.165wg4f4cE35M48.148wg6c6ec4g7bwg7a11

wg5b1aec4e7aE32M61.218

wg4a4bE33M50.183E33M62.140

E33M49.211wg2e12bisoIdh

wg3f7c

Bayesian model assessment

row 1: # QTLrow 2: pattern

col 1: posteriorcol 2: Bayes factornote error bars on bf

evidence suggests4-5 QTLN2(2-3),N3,N16

1 3 5 7 9 110.

3number of QTL

QTL posterior

number of QTL

Bayes factor ratios

1 3 5 7 9 11

moderate

strong

1 3 5 7 9 11 13

model index

pattern posterior

model indexpo

Bayes factor ratios

1 3 5 7 9 11 13

moderate

strong

Bayesian model diagnostics pattern: N2(2),N3,N16col 1: densitycol 2: boxplots by m

environmental varianceσ2 = .008, σ = .09

heritabilityh2 = 52%

LOD = 16(highly significant)

but note change with m

0.004 0.006 0.008 0.010 0.012

marginal envvar, m ≥ 44 5 6 7 8 9 11 12

envvar conditional on number of QTL

0.2 0.3 0.4 0.5 0.6 0.7

marginal heritability, m ≥ 44 5 6 7 8 9 11 12

heritability conditional on number of QT

abilit

y5 10 15 20 25

marginal LOD, m ≥ 44 5 6 7 8 9 11 12

LOD conditional on number of QTLLO

Bayesian estimates of loci & effects

histogram of lociblue line is densityred lines at estimates

estimate additive effects(red circles)

grey points sampledfrom posterior

blue line is cubic splinedashed line for 2 SD

0 50 100 150 200 2500.

napus8 summaries with pattern 1,1,2,3 and m ≥ 4

N2 N3 N16

0 50 100 150 200 250

N2 N3 N16

loci marginal posteriors

unlinked loci linked loci

mapping gene expression

• 108 F2 mice• mRNA to RT-PCR• multivariate screen

• clustering• PC analysis

• highlight SCD• Lan et al. (2003)

• ch2 dominance

false detection rates and posteriors

• multiple comparisons: test QTL across genome– size = Pr( LOD(λ) > t | no QTL at λ )– genome-wise threshold

• theoretical value or permutation value (Churchill Doerge 1995)

– threshold guards against a single false detection– difficult to extend to multiple QTL

• positive false discovery rate (Storey 2001)– pFDR = Pr( no QTL at λ | LOD(λ) > t )– consider proportion of false detections for threshold– related to Bayesian posterior– extends naturally to multiple QTL

pFDR and QTL posterior• single QTL case

– pick a rejection region R = {λ|LOD(λ)>t} for some t– pFDR = Pr(m=0)*size/[Pr(m=0)*size+Pr(m=1)*power]– power = Pr(λ in R | Y, X, m = 1 )– size = (length of R) / (length of genome)

• multiple QTL case– pFDR = Pr(m=0)*size/[Pr(m=0)*size+Pr(m>1)*power]– power = Pr(λ in R | Y, X, m > 1 )

• extends to other null hypotheses– pFDR = Pr(m=1)*size/[Pr(m=1)*size+Pr(m>2)*power]

B napus with m~Poisson(1)

Summary• Bayesian posteriors and Bayes factors

– Bayes factors for model assessment– posteriors can reveal subtle hints of QTL

• graphical tools for model selection– Bayes factor ratios on log scale– model identified by m or genetic architecture

• connection to false discovery rate– whole genome evaluation– calibrate posterior region with pFDR

taking the broad view of model selection for qtl in...

Documents

qtl statics

causal network models for correlated quantitative traits...

june 2002ncsu qtl ii © brian s. yandell1 efficient and...

fund performance · 2019-02-28 · citywire rating rayner...

qtl 2: overviewseattle sisg: yandell © 20101 seattle summer...

a guide to qtl mapping with r/qtl

qtl model selection: key...

qtl cartographer

qtl 2: tutorialseattle sisg: yandell © 20081 r/qtl &...

gcvpack routines for generalized cross validation douglas m....

yandell © june 20051 bayesian analysis of microarray traits...

quantile-based permutation thresholds for qtl...

april 2008uw-madison © brian s. yandell1 bayesian qtl...

r/qtl & r/qtlbim tutorials -...

qtl mapping of powdery mildew resistance in wi 2757...

pmfs - pages.cs.wisc.edu

quantile-based permutation thresholds for qtl hotspots brian...

overview of multiple qtl - university of...

augustus courts yandell

model selectionseattle sisg: yandell © 20121 qtl model...