phylogeny ii : parsimony, ml, semphy. phylogenetic tree u topology: bifurcating leaves - 1…n...
Post on 19-Dec-2015
218 views
TRANSCRIPT
.
Phylogeny II : Parsimony, ML, SEMPHY
Phylogenetic Tree
Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2
leaf
branch internal node
Character Based Methods
We start with a multiple alignment Assumptions:
All sequences are homologous Each position in alignment is homologous Positions evolve independently No gaps
We seek to explain the evolution of each position in the alignment
Parsimony
Character-based method A way to score trees (but not to build trees!)
Assumptions: Independence of characters (no interactions) Best tree is one where minimal changes take place
A Simple Example
What is the parsimony score of
Aardvark Bison Chimp Dog Elephant
A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA
A Simple Example
Each column is scored separately. Let’s look at the first column:
Minimal tree has one evolutionary change:C
C
CC
C
T
T
T
T C
A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA
Evaluating Parsimony Scores
How do we compute the Parsimony score for a given tree?
Traditional Parsimony Each base change has a cost of 1
Weighted Parsimony Each change is weighted by the score c(a,b)
Traditional Parsimony
}{},{
1 1min);,...,(vu xx
Evun TxxPar
nodesinternal
a g a
{a,g}
{a}
•Solved independently for each position
•Linear time solution
a
a
Evaluating Weighted Parsimony
Dynamic programming on the tree
S(i,a) = cost of tree rooted at i if i is labeled by a
Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a,
otherwise S(i,a) = Iteration: if k is a node with children i and j, then
S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))
Termination: cost of tree is minaS(r,a) where r is the root
Cost of Evaluating Parsimony
Score is evaluated on each position independetly. Scores are then summed over all positions.
If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)
By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
Species 1 - A G G G T A A C T G
Species 2 - A C G A T T A T T A
Species 3 - A T A A T T G T C T
Species 4 - A A T G T T G T C G
How many possible unrooted trees?
Maximum Parsimony
How many possible unrooted trees?
1
3
2
4
1
2
3
4
1
4
3
2
1 2 3 4 5 6 7 8 9 10
Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G
Maximum Parsimony
How many substitutions?
A
A
G
GA G
1 change
A
A
G
GG A
5 changes
1
2
3
4
tree
MP
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G0
0
0
1
3
2
4
1
2
3
4
1
4
3
2
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G0 3
0 3
0 3
1
3
2
4
1
2
3
4
1
4
3
2
Maximum Parsimony
4
1 - G
2 - C
3 - T
4 - A
1
2
3
4A
G
C
T
C
A
G
T
C1
3
2
4C
C
G
A
T1
4
3
2C
3
3
3
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G0 3 2
0 3 2
0 3 2
1
3
2
4
1
2
3
4
1
4
3
2
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G0 3 2 2
0 3 2 1
0 3 2 2
1
3
2
4
1
2
3
4
1
4
3
2
Maximum Parsimony
4
1 - G
2 - A
3 - A
4 - G
1
2
3
4G
G
A
A
A
G
G
A
A1
3
2
4A
G
A
A
G1
4
3
2A
2
2
1
Maximum Parsimony
0 3 2 2 0 1 1 1 1 3 14
0 3 2 1 0 1 2 1 2 3 15
0 3 2 2 0 1 2 1 2 3 16
1
3
2
4
1
2
3
4
1
4
3
2
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
1
2
3
4
Searching for Trees
#Taxa #Trees #Taxa #Trees
3 1 10 2 x 106
4 3 50 3 x 1074
5 15 100 2 x 10182
Searching for the Optimal Tree
Exhaustive Search Very intensive
Branch and Bound A compromise
Heuristic Fast Usually starts with NJ
Phylogenetic Tree Assumptions
Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2
Lengths t = {ti} for each branch Phylogenetic tree = (Topology, Lengths) = (T,t)
leaf
branch internal node
Probabilistic Methods
The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.
Background probabilities: q(a) Mutation probabilities: P(a|b, t) Models for evolutionary mutations
Jukes Cantor Kimura 2-parameter model
Such models are used to derive the probabilities
Jukes Cantor model
A model for mutation rates
•Mutation occurs at a constant rate •Each nucleotide is equally likely to mutate into any other nucleotide with rate .
Kimura 2-parameter model
Allows a different rate for transitions and transversions.
Mutation Probabilities
The rate matrix R is used to derive the mutation probability matrix S:
S is obtained by integration. For Jukes Cantor:
q can be obtained by setting t to infinity
RItS )(
)()(),|(
)()(),|(
tag
taa
etStagP
etStaaP
4
4
14
1
314
1
Mutation Probabilities
Both models satisfy the following properties:
Lack of memory:
Reversibility: Exist stationary probabilities
{Pa} s.t.
A
G T
C
b
cbbaca tPtPttP )'()()'(
)()( tPPtPP abbbaa
Probabilistic Approach
Given P,q, the tree topology and branch lengths, we can compute:
x1 x2 x3
x4
x5
),|(),|(),|(),|()(
),|,,,,(
2421413534545
54321
txxptxxptxxptxxpxq
tTxxxxxP
t1t2 t3
t4
Computing the Tree Likelihood
54
54321321xx
tTxxxxxPtTxxxP,
),|,,,,(),|,,(
We are interested in the probability of observed data given tree and branch “lengths”:
Computed by summing over internal nodes This can be done efficiently using a tree upward
traversal pass.
Tree Likelihood Computation
Define P(Lk|a)= prob. of leaves below node k given that xk=a
Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise Iteration: if k is node with children i and j, then
Termination:Likelihood is
ba
jik cjLtacPbiLtabPaLP,
)|(),|()|(),|()|(
)()|(),|,,( aqaLPtTxxPa
root31
Maximum Likelihood (ML)
Score each tree by Assumption of independent positions
Branch lengths t can be optimized Gradient ascent EM
We look for the highest scoring tree Exhaustive search Sampling methods (Metropolis)
m
nn tTmxmxPtTXXP ),|][,],[(),|,,( 11
Optimal Tree Search
Perform search over possible topologiesT1 T3
T4
T2
Tn
Parametric optimization
(EM)
Parameter space
Local Maxima
Computational Problem
Such procedures are computationally expensive! Computation of optimal parameters, per candidate,
requires non-trivial optimization step. Spend non-negligible computation on a candidate,
even if it is a low scoring one. In practice, such learning procedures can only
consider small sets of candidate structures
Structural EM
Idea: Use parameters found for current topology to help evaluate new topologies.
Outline: Perform search in (T, t) space. Use EM-like iterations:
E-step: use current solution to compute expected sufficient statistics for all topologies
M-step: select new topology based on these expected sufficient statistics
The Complete-Data ScenarioSuppose we observe H, the ancestral sequences.
Tjijiji
Tji m mx
jimxmx
i mmx
mN
complete
StFconst
p
tpp
tTmxPHDtTl
j
ji
i
),(,,
),(
,
22...1
),(
)(loglog
),|(log,:,
),(max ,,, , jijitji StFwji
Tji
jiw),(
,
Define:
Find: topology T that maximizes
Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,jF is a linear function of Si,j
Expected Likelihood
Start with a tree (T0,t0) Compute
Formal justification: Define:
Theorem:
Consequence: improvement in expected score improvement in likelihood
m
mN
mj
miji tTxbXaXPbaSE ),,|,()],([ 00
],,1[),(
Tjijiji
complete
constSEtF
tTtTHDlEtTQ
),(,,
00
])[,(
],|),:,([),(
),:(),:(),(),( 0000 tTDltTDltTQtTQ
Algorithm Outline
Original Tree (T0,t0)
Unlike standard EM for trees, we compute all possible pairwise statistics
Time: O(N2M)
Compute: ],,|),([ 00),( tTDbaSE ji
])[,(max ,, jitji SEtFw Weights:
Pairwise weights
This stage also computes the branch length for each pair (i,j)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
])[,(max ,, jitji SEtFw Weights:
Tji
jiT wT),(
,maxarg'Find:
Max. Spanning Tree
Fast greedy procedure to find tree
By construction:Q(T’,t’) Q(T0,t0)
Thus, l(T’,t’) l(T0,t0)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
])[,(max ,, jitji SEtFw Weights:
Tji
jiT wT),(
,maxarg'Find:
Construct bifurcation T1
Fix Tree
Remove redundant nodesAdd nodes to break large degree
This operation preserves likelihood l(T1,t’) =l(T’,t’) l(T0,t0)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
Tji
jiT wT),(
,maxarg'Find:
])[,(max ,, jitji SEtFw Weights:
Construct bifurcation T1
Assessing trees: the Bootstrap
Often we don’t trust the tree found as the “correct” one.
Bootstrapping: Sample (with replacement) n positions from the
alignment Learn the best tree for each sample Look for tree features which are frequent in all
trees. For some models this procedure approximates the
tree posterior P(T| X1,…,Xn)
New TreeThm: l(T1,t1) l(T0,t0)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
Construct bifurcation T1
Tji
jiT wT),(
,maxarg'Find:
])[,(max ,, jitji SEtFw Weights:
These steps are then repeated until convergence