genealogical trees, coalescent theory, and the analysis of ... › ~epxing › cbml › coalescent1...
TRANSCRIPT
Genealogical trees, coalescent theory,and the analysis of genetic
polymorphisms
Magnus Nordborg
University of Southern California
1
The importance of history
• Genetic polymorphism data represent the outcome ofa single, highly complex, non-repeatable evolutionaryhistory
• Traditional analysis methods cannot take this intoaccount
• The stochastic process known as “the coalescent”presents a coherent statistical framework foranalyzing genetic polymorphism data
2
The importance of history: mutations arerandom
G T
G T
T TG G GGGG
MRCA
3
The importance of history: trees are random
4
Modeling genetic polymorphism
At a minimum, models must include:
• coalescence (who begat whom, and when)
• mutation
• recombination
5
Recombinationmakes it possiblefor linked sites to
have differentgenealogies
induced trees
breakpoint
recombination
coalescence
6
What is the coalescent?
• The coalescent is a stochastic process that iswell-suited for modeling polymorphism data
• It is a natural extension to classical populationgenetics models
7
Coalescence: picking parents
N = 10th
e pa
st
n = 3
T(3)
T(2)
8
The rate of coalescence
The rate at which lineages find each other depends on:
• The population size: the per-generation probability ofcoalescence is ∝ 1/N
• The number of lineages: the rate of coalescence whenthere are k lineages is
(k2
)• A number of other demographic factors, such as
inbreeding, age structure, and the variance inreproductive success
Because the per-generation probability of coalescence is onthe order of 1/N , we use a continuous-time approximationwhere time is measured in units of N generations
9
Mutation
• Selectively neutral mutations are added to thebranches of the tree afterwards according to a ratethat depends on the per-generation probability ofmutation
• The expected number of mutations on a branchdepends on its length — the expected number ofmutations on the tree depends on the total branchlength of the tree
• Any mutation model can be used
10
Recombination
• Recombination breaks up lineages according to a ratethat depends on the per-generation probability ofrecombination
• There will be more recombination in the genealogy ofa longer chromosomal segment
• Any recombination model can be used
• The coalescent with recombination generates arandom graph — or a forest of trees
11
A graph or a forest. . .
induced trees
breakpoint
recombination
coalescence
12
A walk through tree space
0 0.2 0.4 0.6 0.8 1
1.5
2
2.5
3
3.5
4
chromosomal position
time
to M
RCA
13
The trees are correlated
0 0.2 0.4 0.6 0.8 1
1.5
2
2.5
3
3.5
4
1 3 2 654 621543
14
The trees are correlated
0 0.2 0.4 0.6 0.8 1
1.5
2
2.5
3
3.5
4
13 2 654 621543 621543
15
Recombination is common
0 0.2 0.4 0.6 0.8 1
1.5
2
2.5
3
3.5
4
this may be 10 kb!
these arejunctions
these are mutations
16
Recombination is as common as mutation
• If 1 cM ∼ 1 Mb, then the probability ofrecombination per bp per generation is ∼ 10−8
• The probability of mutation per bp per generation isestimated to be at most 10−8
• It follows that a sample of sequences will contain asmany junctions as polymorphisms
17
Genealogical graphs can in general not bereconstructed
• Even with infinitely many polymorphisms, asubstantial fraction of all junctions would not bedetected
• In reality, there are clearly too few polymorphisms perjunction to estimate the graph
• Remember: a phylogenetic algorithm will alwaysreconstruct a tree, regardless of whether there existsa tree to be reconstructed. . .
18
We do not in general wish to reconstructgenealogical graphs
• Population genetics is not phylogenetics!
• Gene genealogies are of no interest per se — they arerandom outcomes of an underlying evolutionaryprocess, and are of interest only insofar as theycontain information about this process
19
Gene trees and species trees
Phylogenetic methods estimatespecies trees by estimating genetrees; they are appropriate if andonly if the latter are stronglycorrelated with the former
20
Phylogenetic methods are not applicable towithin-species data
Africa Europe Asia
Africa
Africa Europe Asia
Africa
Africa Europe Asia
Africa
a) Out-of-AfricaModel
b) MultiregionalModel
c) CandelabraModel
millionyears ago
1
0
0.5
migration
Africans
Schematic version of the human mtDNA
tree
non-Africans
• We must consider the likelihood of the data underalternative models
21
A likelihood framework
Phylogenetics:L = P(D|G, µ)
Population genetics:
L =∑G
P(D|G, µ)P(G, α)
Here D is the data, G the genealogy, µ the mutationmodel, and α the demographic model
Note that G is a nuisance parameter in population genetics
22
Uses of the coalescent
• A mathematical modeling tool
• A simulation tool for hypothesis testing andexploratory data analysis
• The basis for full likelihood inference
23
The simplicity and elegance of the coalescentprocess makes it a powerful modeling tool
At least for the standard coalescent, it is often possible toderive results analytically
• Estimators and test, e.g., Tajima’s D statistic
• Illuminating theoretical results, e.g., the probabilitythat a sample of size n contains the MRCA of theentire population is
n− 1n + 1
24
Almost any scenario can be simulated usingthe coalescent
• Coalescent simulations are enormously more efficientthan classical methods
• Simulated data can be compared with real data — orused to evaluate the feasibility of a study before it iscarried out
25
Example: ancient Neanderthal mtDNA
986 modern humans
Neanderthalts Te
Tr• Modern humans
monophyletic
• Tr > 4Te
Does this prove that Neanderthals and modern humans didnot interbreed?
26
Example: ancient Neanderthal mtDNA
986 modern humans
Neanderthalts Te
Tr
Med
iterra
nean
Assuming that they didinterbreed, what is theprobability of getting atree like the oneobserved just bychance?
Coalescent simulations showed that this probability is higheven for large amounts of interbreeding
27
Full likelihood analysis
• In principle possible
• In practice difficult
• Unless major breakthroughs are made, not likely to beapplicable to genomic polymorphism data
28
What is the main insight from coalescenttheory?
That very large numbers of loci arerequired to answer most questions!
29
Population Genomics is upon us!
• Data sets containing 100’s and 1000’s of loci alreadyexist
• Within 10 years, it seems likely that whole-genomecomparisons between species will be common, andthat we will have whole genome sequences from1000’s of humans
30
Less assumptions — more data
We will be able to use empirically estimated distributions oftest statistics rather than theoretically predicted ones
200 300 400 500 600position in kb
-2
-1
0
1
2
D
31
Selective sweeps
• Fixation of new alleles leaves a footprint in thepattern of genomic variation
• Can we find the genes that “make us human”
Selection
Advantageous variant
32
How many genes?
33
Teosinte to corn: < 10,000 years; five genes?
teosinte maize maize with tb1 mutation
34
What’s the use polymorphism data?
• Whole-genome properties
– demographic (sensu lato) history
– molecular evolution
– genetic mechanisms
• The history of individual loci — selection
– divergence between human and other primates
– traces of selection within the last million years
35
The history and future of multi-locusmethods
36