peter beerli genome sciences university of washington...

67
Population Genetics using Trees Peter Beerli Genome Sciences University of Washington Seattle WA

Upload: others

Post on 14-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Population Genetics using Trees

Peter BeerliGenome SciencesUniversity of WashingtonSeattle WA

Page 2: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Outline

1. Introduction to the basic coalescent

Population modelsThe coalescentLikelihood estimation of parameters of interestWhy do we need Markov chain Monte Carlo

2. Extensions and examples

Page 3: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Population genetics can help us to find answers

the PCR revolution allows us to generate lots of data from manyindividuals and many loci

We are still interested in questions like

– Where are we or other species coming from?– How big are populations?– Are these populations species?– What is the recombination rate in species x?

Page 4: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Population genetics in the age of genomics

Why do we need theoretical population genetics when we can have thecomplete sequences of our favorite organism?

Page 5: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Basics: Wright-Fisher population model

All individuals release many gametes and new individuals for the nextgeneration are formed randomly from these.

Page 6: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Basics: Wright-Fisher population model

All individuals release many gametes and new individuals for the nextgeneration are formed randomly from these.

Page 7: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Basics: Wright-Fisher population model

All individuals release many gametes and new individuals for the nextgeneration are formed randomly from these.

Page 8: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Basics: Wright-Fisher population model

All individuals release many gametes and new individuals for the nextgeneration are formed randomly from these.

Page 9: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Basics: Wright-Fisher population model

All individuals release many gametes and new individuals for the nextgeneration are formed randomly from these.

Page 10: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Basics: Wright-Fisher population model

All individuals release many gametes and new individuals for the nextgeneration are formed randomly from these.

Page 11: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Wright-Fisher population model

Population size N is constant through time.

Each individual gets replaced every generation.

Next generation is drawn randomly from a large gamete pool.

Only genetic drift is manipulating the allele frequencies.

Page 12: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

The Coalescent

Sewall Wright showed that the probabilitythat 2 gene copies come from the samegene copy in the preceding generation is

Prob (two genes share a parent) =1

2N

Page 13: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

The Coalescent

Present

Past

In every generation, there is a chance of 1/2N to coalesce. Following thesampled lineages through generations backwards in time we realize that itfollows a geometric distribution with

E(u) = 2N [the expectation of the time of coalescence u of two tips is 2N ]

Page 14: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

The Coalescent

JFC Kingman generalized this for k genecopies.

Prob (k copies are reduced to k − 1 copies) =k(k − 1)

4N

Page 15: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Kingman’s n-coalescent

Present

Past

Page 16: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Kingman’s n-coalescent

Present

Past

u4

u3

u2

p(G|N) =∏

i exp(−uik(k−1)

4N ) 12N

The expectation for the timeinterval uk is

E(uk) =4N

k(k − 1)

Page 17: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

BUT, what’s this good for????????????????????

Page 18: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Naively we could estimate: 1.Time of the most recentcommon ancestor

For a given population size we can calculate the time of the most recentcommon ancestor [MRCA].

1. Get a TRUE genealogy (topology and branch lengths) from an infallibleoracle.

2. Get the population size from the same oracle.

3. Calculate the time of the MRCA by summing over all time intervals.

Page 19: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

1. Time of the most recent common ancestor [Shortcut]

1. Get the population size from another oracle

2. Use the expectation for your data type to get an estimate of the time ofthe MRCA

The expectation for the time of the MRCA is

E(u) = 4N for diploid organisms

E(u) = 2N for haploid organisms

E(u) = N for maternally transmitted mtDNA,

paternally transmitted Y-chromosome

[assumption: sex-ratio is 1:1]

Page 20: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

1. We get THE genealogy from our oracle

2. We know that we can calculate p(Genealogy|N)

Page 21: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

1. We get THE genealogy from our oracle

2. We remember the probability calculation

p(G|N) = p(u1|N, k)1

2N× p(u2|N, k− 1)

12N

× .....

Page 22: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

1. We get THE genealogy from our oracle

2. We remember the probability calculation

p(Genealogy|N) =T∏j

e−ujkj(kj−1)

4N1

2N

Page 23: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 24: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 25: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 26: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 27: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 28: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 29: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 30: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 31: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 32: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

Page 33: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

2. Calculate the size of the population

N = 2270

N = 12286

Page 34: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Problems with these very naive approaches

We assume we knowthe TRUE genealogy:topology and branchlength.

Page 35: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Variability of the coalescent

10 coalescent trees generated with the same population size, N = 10, 000

Page 36: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Variability of mutations

Page 37: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

How many samples do we need?

Page 38: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Summary of the basic Coalescent

Mathematically tractable way to calculate probabilities of genealogies ina population.

The coalescent is a very noisy distribution of times on a tree.

Variability because of mutation increases the uncertainty of these times.

The population size is correlated with the depth of the tree.

Estimations of population size or the time of the MRCA from a singletree are very error-prone.

Page 39: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Variable population size

In a small population lineages coalesce quickly

In a large population lineages coalesce slowly

This leaves a signature in the data. We can exploit this and estimate thepopulation growth rate g jointly with the population size Θ.

Page 40: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Exponential population size expansion or shrinkage

Page 41: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Grow a frog

-50 0 50 100

1

0.01

0.1

Θ

gMutation Rate Population sizes

-10000 generations Present10−8 8, 300, 000 8, 360, 00010−7 780, 000 836, 00010−6 40, 500 83, 600

Page 42: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

Mutation model: Nucleotide mutation model, ...

Population genetics model: the Coalescent

Page 43: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

Mutation model: Nucleotide mutation model, ...

Population genetics model: the Coalescent

Prob (N, µ|data)

Page 44: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

Mutation model: Nucleotide mutation model, ...

Population genetics model: the Coalescent

Prob (N, µ|data) =Prob (data|N, µ) Prob (N, µ)

Prob (data)

Page 45: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

Mutation model: Nucleotide mutation model, ...

Population genetics model: the Coalescent

Prob (N, µ|data) =Prob (data|N, µ) Prob (N, µ)

Prob (data)

L(N, µ) = Prob (data|N, µ) = c Prob (N, µ|data)

Page 46: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

L(N, µ) = Prob (data|N, µ)

Page 47: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

L(N, µ) = Prob (data|N, µ) =∫

G

p(G|N, µ) Prob (data|G, µ)

Page 48: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

L(N, µ) = Prob (data|N, µ) =∫

G

p(G|N, µ) Prob (data|G, µ)

We cannot observe the mutation events. Instead of estimating N and µ weestimate the product Θ = 4Nµ and scale G with µ

L(Θ) =∫

G∗p(G∗|Θ) Prob (data|G∗)

Page 49: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Parameter estimation using maximum likelihood

L(N, µ) = Prob (data|N, µ) =∫

G

p(G|N, µ) Prob (data|G, µ)

We cannot observe the mutation events. Instead of estimating N, and µwe estimate the product Θ = 4Nµ and scale G with µ

L(Θ) =∫

G∗p(G∗|Θ) Prob (data|G∗)

Problem: We need to integrate over all genealogies: all different labelledhistories, all different branchlengths

Page 50: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Can we calculate this sum over all genealogies?

We need to integrate over all genealogies: all different topologies, alldifferent branchlengths

Tips Topologies

3 3

4 18

5 180

6 2700

7 56700

8 1587600

9 57153600

10 2571912000

15 6958057668962400000

20 564480989588730591336960000000

30 4368466613103069512464680198620763891440640000000000000

40 30273338299480073565463033645514572000429394320538625017078887219200000000000000000

50 3.28632 × 10112

100 1.37416 × 10284

Page 51: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

A solution:Markov chain Monte Carlo

Metropolis recipe

0. first state

1. perturb old state and calculateprobability of new state

2. test if new state is better thanold state: accept if ratio of newand old is larger than a randomnumber between 0 and 1.

3. move to new state if acceptedotherwise stay at old state

4. go to 1

Page 52: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

How do we change a genealogy

zA B

C D

1

2j

k

Page 53: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

1

L(G1|Θ)

Page 54: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

1

create a new tree

Page 55: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

12

L(G2|Θ)

Evaluate

r <p(G2|Θ)P(D|G2)P(G1|G2)p(G1|Θ)P(D|G1)P(G2|G1)

luckily reduces most often to:

r <P(D|G2)P(D|G1)

Page 56: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

12

Store G1

Make another change to the tree

Page 57: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

12

3

L(G3|Θ)

r <P(D|G3)P(D|G2)

Page 58: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

12

3

Store G2

Make another change to the tree

Page 59: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

12

3

Page 60: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

12

3

4

Page 61: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Markov chain Monte Carlo

12

3

4

5

67

8

9

1110

121314

15

16

Page 62: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

MCMC walk I

Page 63: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

MCMC walk result

Tree space Tree space

0.002

0.005

0.01

Pro

babi

lity

0.002

0.005

0.01

Page 64: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

MCMC walk result

Tree space Tree space

0.002

0.005

0.01

Pro

babi

lity

0.002

0.005

0.01

Page 65: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

Improving our MCMC walker: MCMCMC or MC3

Metropolis Coupled Markov chain Monte Carlo

Run several independent parallel chains: each has a different temperature

After some sampling of genealogies, swap the genealogies of a pair ofchains if the ratio between probabilities in the cold and the hot chain islarger than a random number drawn between 0 and 1.

Page 66: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

MCMC walk II

Page 67: Peter Beerli Genome Sciences University of Washington ...evolution.genetics.washington.edu/PBhtmls/courses/course-I.pdf · Seattle WA. Outline 1. Introduction to the basic coalescent

better MCMC walk result