genealogical trees, coalescent theory, and the analysis of ... › ~epxing › cbml › coalescent1...

Genealogical trees, coalescent theory,and the analysis of genetic

polymorphisms

Magnus Nordborg

University of Southern California

1

The importance of history

• Genetic polymorphism data represent the outcome ofa single, highly complex, non-repeatable evolutionaryhistory

• Traditional analysis methods cannot take this intoaccount

• The stochastic process known as “the coalescent”presents a coherent statistical framework foranalyzing genetic polymorphism data

2

The importance of history: mutations arerandom

G T

G T

T TG G GGGG

MRCA

3

The importance of history: trees are random

4

Modeling genetic polymorphism

At a minimum, models must include:

• coalescence (who begat whom, and when)

• mutation

• recombination

5

Recombinationmakes it possiblefor linked sites to

have differentgenealogies

induced trees

breakpoint

recombination

coalescence

6

What is the coalescent?

• The coalescent is a stochastic process that iswell-suited for modeling polymorphism data

• It is a natural extension to classical populationgenetics models

7

Coalescence: picking parents

N = 10th

e pa

st

n = 3

T(3)

T(2)

8

The rate of coalescence

The rate at which lineages find each other depends on:

• The population size: the per-generation probability ofcoalescence is ∝ 1/N

• The number of lineages: the rate of coalescence whenthere are k lineages is

(k2

)• A number of other demographic factors, such as

inbreeding, age structure, and the variance inreproductive success

Because the per-generation probability of coalescence is onthe order of 1/N , we use a continuous-time approximationwhere time is measured in units of N generations

9

Mutation

• Selectively neutral mutations are added to thebranches of the tree afterwards according to a ratethat depends on the per-generation probability ofmutation

• The expected number of mutations on a branchdepends on its length — the expected number ofmutations on the tree depends on the total branchlength of the tree

• Any mutation model can be used

10

Recombination

• Recombination breaks up lineages according to a ratethat depends on the per-generation probability ofrecombination

• There will be more recombination in the genealogy ofa longer chromosomal segment

• Any recombination model can be used

• The coalescent with recombination generates arandom graph — or a forest of trees

11

A graph or a forest. . .

induced trees

breakpoint

recombination

coalescence

12

A walk through tree space

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

chromosomal position

time

to M

RCA

13

The trees are correlated

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

1 3 2 654 621543

14

The trees are correlated

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

13 2 654 621543 621543

15

Recombination is common

0 0.2 0.4 0.6 0.8 1

1.5

2

2.5

3

3.5

4

this may be 10 kb!

these arejunctions

these are mutations

16

Recombination is as common as mutation

• If 1 cM ∼ 1 Mb, then the probability ofrecombination per bp per generation is ∼ 10−8

• The probability of mutation per bp per generation isestimated to be at most 10−8

• It follows that a sample of sequences will contain asmany junctions as polymorphisms

17

Genealogical graphs can in general not bereconstructed

• Even with infinitely many polymorphisms, asubstantial fraction of all junctions would not bedetected

• In reality, there are clearly too few polymorphisms perjunction to estimate the graph

• Remember: a phylogenetic algorithm will alwaysreconstruct a tree, regardless of whether there existsa tree to be reconstructed. . .

18

We do not in general wish to reconstructgenealogical graphs

• Population genetics is not phylogenetics!

• Gene genealogies are of no interest per se — they arerandom outcomes of an underlying evolutionaryprocess, and are of interest only insofar as theycontain information about this process

19

Gene trees and species trees

Phylogenetic methods estimatespecies trees by estimating genetrees; they are appropriate if andonly if the latter are stronglycorrelated with the former

20

Phylogenetic methods are not applicable towithin-species data

Africa Europe Asia

Africa

Africa Europe Asia

Africa

Africa Europe Asia

Africa

a) Out-of-AfricaModel

b) MultiregionalModel

c) CandelabraModel

millionyears ago

1

0

0.5

migration

Africans

Schematic version of the human mtDNA

tree

non-Africans

• We must consider the likelihood of the data underalternative models

21

A likelihood framework

Phylogenetics:L = P(D|G, µ)

Population genetics:

L =∑G

P(D|G, µ)P(G, α)

Here D is the data, G the genealogy, µ the mutationmodel, and α the demographic model

Note that G is a nuisance parameter in population genetics

22

Uses of the coalescent

• A mathematical modeling tool

• A simulation tool for hypothesis testing andexploratory data analysis

• The basis for full likelihood inference

23

The simplicity and elegance of the coalescentprocess makes it a powerful modeling tool

At least for the standard coalescent, it is often possible toderive results analytically

• Estimators and test, e.g., Tajima’s D statistic

• Illuminating theoretical results, e.g., the probabilitythat a sample of size n contains the MRCA of theentire population is

n− 1n + 1

24

Almost any scenario can be simulated usingthe coalescent

• Coalescent simulations are enormously more efficientthan classical methods

• Simulated data can be compared with real data — orused to evaluate the feasibility of a study before it iscarried out

25

Example: ancient Neanderthal mtDNA

986 modern humans

Neanderthalts Te

Tr• Modern humans

monophyletic

• Tr > 4Te

Does this prove that Neanderthals and modern humans didnot interbreed?

26

Example: ancient Neanderthal mtDNA

986 modern humans

Neanderthalts Te

Tr

Med

iterra

nean

Assuming that they didinterbreed, what is theprobability of getting atree like the oneobserved just bychance?

Coalescent simulations showed that this probability is higheven for large amounts of interbreeding

27

Full likelihood analysis

• In principle possible

• In practice difficult

• Unless major breakthroughs are made, not likely to beapplicable to genomic polymorphism data

28

What is the main insight from coalescenttheory?

That very large numbers of loci arerequired to answer most questions!

29

Population Genomics is upon us!

• Data sets containing 100’s and 1000’s of loci alreadyexist

• Within 10 years, it seems likely that whole-genomecomparisons between species will be common, andthat we will have whole genome sequences from1000’s of humans

30

Less assumptions — more data

We will be able to use empirically estimated distributions oftest statistics rather than theoretically predicted ones

200 300 400 500 600position in kb

-2

-1

0

1

2

D

31

Selective sweeps

• Fixation of new alleles leaves a footprint in thepattern of genomic variation

• Can we find the genes that “make us human”

Selection

Advantageous variant

32

How many genes?

33

Teosinte to corn: < 10,000 years; five genes?

teosinte maize maize with tb1 mutation

34

What’s the use polymorphism data?

• Whole-genome properties

– demographic (sensu lato) history

– molecular evolution

– genetic mechanisms

• The history of individual loci — selection

– divergence between human and other primates

– traces of selection within the last million years

35

The history and future of multi-locusmethods

36

genealogical trees, coalescent theory, and the analysis of ... › ~epxing › cbml › coalescent1...

Documents