statistical methods for quantitative trait loci (qtl...

16
1 Lectures 5 – Oct 12, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Statistical Methods for Quantitative Trait Loci (QTL) Mapping II 1 Course Announcements HW #1 is out Project proposal Due next Wed 1 paragraph describing what you’d like to work on for the class project. 2

Upload: others

Post on 18-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

1

Lectures 5 – Oct 12, 2011CSE 527 Computational Biology, Fall 2011

Instructor: Su-In LeeTA: Christopher Miles

Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022

Statistical Methods for Quantitative Trait Loci (QTL) Mapping II

1

Course Announcements HW #1 is out Project proposal

Due next Wed 1 paragraph describing what you’d like to work on for

the class project.

2

Page 2: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

2

3

Why are we so different? Human genetic diversity

Different “phenotype” Appearance Disease susceptibility Drug responses

: Different “genotype”

Individual-specific DNA 3 billion-long string……ACTGTTAGGCTGAGCTAGCCCAAAATTTATAGC

GTCGACTGCAGGGTCCACCAAAGCTCGACTGCAGTCGACGACCTAAAATTTAACCGACTACGAGATGGGCACGTCACTTTTACGCAGCTTGATGATGCTAGCTGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATTCACTTTTACGCAGCTTGATGACGACTACGAGATGGGCACGTTCACCATCTACTACTACTCATCTACTCATCAACCAAAAACACTACTCATCATCATCATCTACATCTATCATCATCACATCTACTGGGGGTGGGATAGATAGTGTGCTCGATCGATCGATCGTCAGCTGATCGACGGCAG……

Any observable characteristic or trait

TGATCGAAGCTAAATGCATCAGCTGATGATCCTAGC…

TGATCGTAGCTAAATGCATCAGCTGATGATCGTAGC…

TGATCGCAGCTAAATGCAGCAGCTGATGATCGTAGC…

4

cellcell

Motivation Which sequence variation affects a trait?

Better understanding disease mechanisms Personalized medicine

Obese? 15%Bold? 30%Diabetes? 6.2%Parkinson’s disease? 0.3%Heart disease? 20.1%Colon cancer? 6.5%

:

A person

ACTTCGGAACATATCAAATCCAACGC

DNA – 3 billion long!

…… XXX

GTCDifferent instructionInstruction

Sequence variations

XX

AG

A different person

Appearance, Personality, Disease susceptibility, Drug responses, …

Page 3: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

3

QTL mapping Data

Phenotypes: yi = trait value for mouse i Genotypes: xik = 1/0 (i.e. AB/AA) of mouse i at marker k Genetic map: Locations of genetic markers

Goals: Identify the genomic regions (QTLs) contributing to variation in the phenotype.

5

:

1 2 3 4 5 … 3,000

mouseindividuals

0101100100…0111011110100…0010010110000…010

:

0000010100…101

0010000000…100

Genotype data3000 markers

010:0

100:0

110:0

Phenotype data

6

Outline Statistical methods for mapping QTL

What is QTL? Experimental animals Analysis of variance (marker regression) Interval mapping (EM)

:

1 2 3 4 5 … 3,000

mouseindividuals 0

10:0

100:0

110:0

QTL?

Page 4: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

4

Interval mapping [Lander and Botstein, 1989]

Consider any one position in the genome as the location for a putative QTL.

For a particular mouse, let z = 1/0 if (unobserved) genotype at QTL is AB/AA.

Calculate P(z = 1 | marker data). Need only consider nearby genotyped markers. May allow for the presence of genotypic errors.

7

Interval mapping [Lander and Botstein, 1989]

Consider any one position in the genome as the location for a putative QTL.

For a particular mouse, let z = 1/0 if (unobserved) genotype at QTL is AB/AA.

Calculate P(z = 1 | marker data). Need only consider nearby genotyped markers. May allow for the presence of genotypic errors.

Given genotype at the QTL, phenotype is distributed as N(µ+∆z, σ2).

Given marker data, phenotype follows a mixture of normal distributions.

8

Page 5: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

5

IM: the mixture model

Let’s say that the mice with QTL genotype AA have average phenotype µA while the mice with QTL genotype AB have average phenotype µB.

The QTL has effect ∆ = µB - µA. What are unknowns?

µA and µB Genotype of QTL

9

0 7 20

M1 QTL M2

M1/M2Nearest flanking markers

65% AB35% AA

35% AB65% AA

99% AB

99% AA

IM: estimation and LOD scores Use a version of the EM algorithm to obtain

estimates of µA, µB, σ and expectation on z (an iterative algorithm).

Calculate the LOD score

Repeat for all other genomic positions (in practice, at 0.5 cM steps along genome).

10

Page 6: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

6

A simulated example LOD score curves

11

Genetic markers

Interval mapping Advantages

Make proper account of missing data Can allow for the presence of genotypic errors Pretty pictures High power in low-density scans Improved estimate of QTL location

Disadvantages Greater computational effort (doing EM for each

position) Requires specialized software More difficult to include covariates Only considers one QTL at a time

12

Page 7: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

7

Statistical significance Large LOD score → evidence for QTL Question: How large is large? Answer 1: Consider distribution of LOD score if there

were no QTL. Answer 2: Consider distribution of maximum LOD score.

13

Null distribution of the LOD scores at a particular genomic position (solid curve)

Null hypothesis – assuming that there are no QTLs segregating in the population.

)QTL no|(

)position at the QTL|(10log

DP

DP

Only ~3% of chance that the genomic position gets LOD score≥1.

Null distribution of the LOD scores at a particular genomic position (solid curve) and of the maximum LOD score from a genome scan (dashed curve).

LOD thresholds To account for the genome-wide search, compare the

observed LOD scores to the null distribution of the maximum LOD score, genome-wide, that would be obtained if there were no QTL anywhere.

LOD threshold = 95th percentile of the distribution of genome-wide max LOD, when there are no QTL anywhere.

Methods for obtaining thresholds Analytical calculations (assuming dense map of markers)

(Lander & Botstein, 1989) Computer simulations Permutation/ randomized test (Churchill & Doerge, 1994)

14

Page 8: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

8

More on LOD thresholds Appropriate threshold depends on:

Size of genome Number of typed markers Pattern of missing data Stringency of significance threshold Type of cross (e.g. F2 intercross vs backcross) Etc

15

An example Permutation distribution for a trait

16

Page 9: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

9

Modeling multiple QTLs Advantages

Reduce the residual variation and obtain greater power to detect additional QTLs.

Identification of (epistatic) interactions between QTLs requires the joint modeling of multiple QTLs.

Interactions between two loci

17

The effect of QTL1 is the same, irrespective of the genotype of QTL 2, and vice versa

The effect of QTL1 depends on the genotype of QTL 2, and vice versa

Trait variation that is not explained by a detected putative QTL.

Multiple marker model Let y = phenotype,

x = genotype data.

Imagine a small number of QTL with genotypes x1,…,xp 2p or 3p distinct genotypes for backcross and intercross,

respectively

We assume thatE(y|x) = µ(x1,…,xp), var(y|x) = σ2(x1,…,xp)

18

Page 10: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

10

Multiple marker model Constant variance

σ2(x1,…,xp) =σ2

Assuming normality y|x ~ N(µg, σ2)

Additivity µ(x1,…,xp) = µ + ∑j ∆jxj

Epistasis µ(x1,…,xp) = µ + ∑j ∆jxj + ∑j,k wj,kxjxk

19

Computational problem N backcross individuals, M markers in all with at

most a handful expected to be near QTL

xij = genotype (0/1) of mouse i at marker j yi = phenotype (trait value) of mouse i

Assuming addivitity,yi = µ + ∑j ∆jxij + e which ∆j ≠ 0?Variable selection in linear regression models

20

Page 11: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

11

Mapping QTL as model selection Select the class of models

Additive models Additive with pairwise interactions Regression trees

21

xN…x1 x2

w1w2 wN

Phenotype (y)

y = w1 x1+…+wN xN+ε

minimizew (w1x1 + … wNxN - y)2 ?

22

Linear Regressionminimizew (w1x1 + … wNxN - y)2+model complexity

Search model space Forward selection (FS) Backward deletion (BE) FS followed by BE

xN…x1 x2

w1w2 wN

Phenotype (y)parameters

w1w2 wN

Y = w1 x1+…+wN xN+ε

Page 12: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

12

23

Lasso* (L1) Regressionminimizew (w1x1 + … wNxN - y)2+ C |wi|

Induces sparsity in the solution w (many wi‘s set to zero) Provably selects “right” features when many features are irrelevant

Convex optimization problem No combinatorial search Unique global optimum Efficient optimization

xN…x1 x2

w1w2 wN

Phenotype (y)parameters

w1w2

x1 x2

* Tibshirani, 1996

L2 L1

L1 term

Model selection Compare models

Likelihood function + model complexity (eg # QTLs) Cross validation test Sequential permutation tests

Assess performance Maximize the number of QTL found Control the false positive rate

24

Page 13: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

13

Outline Basic concepts

Haplotype, haplotype frequency Recombination rate Linkage disequilibrium

Haplotype reconstruction Parsimony-based approach EM-based approach

25

Review: genetic variation Single nucleotide polymorphism (SNP)

Each variant is called an allele; each allele has a frequency

Hardy Weinberg equilibrium (HWE) Relationship between allele and genotype frequencies

How about the relationship between alleles of neighboring SNPs? We need to know about linkage (dis)equilibrium 26

Page 14: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

14

Let’s consider the history of two neighboring alleles…

27

History of two neighboring alleles Alleles that exist today arose through ancient

mutation events…

28

Before mutation

After mutation

Mutation

A

A

C

Page 15: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

15

29

C MutationC

G

G

G

G

History of two neighboring alleles One allele arose first, and then the other…

Before mutation

After mutation

A

A

C

C

Haplotype: combination of alleles present in a chromosome

Recombination can create more haplotypes

No recombination (or 2n recombination events)

Recombination

30

CC

GA

CC

GA

GC

CA

Page 16: Statistical Methods for Quantitative Trait Loci (QTL ...suinlee/cse527/notes/lecture5-eQTLmapping-annotated.pdf6 A simulated example LOD score curves 11 Genetic markers Interval mapping

16

31

CC

G

G

Without recombination

A

C

CC

G

G

With recombination

A

C

CA

Recombinant haplotype

Haplotype A combination of alleles present in a chromosome Each haplotype has a frequency, which is the proportion

of chromosomes of that type in the population

32

Consider N binary SNPs in a genomic region There are 2N possible haplotypes

But in fact, far fewer are seen in human population