marta casanellas rius (joint work with jesus fern andez-s...
TRANSCRIPT
Phylogenetic invariants versus classical phylogenetics
Marta Casanellas Rius(joint work with Jesus Fernandez-Sanchez)
Departament de Matematica Aplicada IUniversitat Politecnica de Catalunya
Algebraic Statistics 2014IIT, Chicago
May 2014
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 1 / 55
Outline
1 Evolutionary Markov models
2 Phylogenetic invariants
3 Eriksson’s method revisited
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 2 / 55
Outline
1 Evolutionary Markov models
2 Phylogenetic invariants
3 Eriksson’s method revisited
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 2 / 55
Outline
1 Evolutionary Markov models
2 Phylogenetic invariants
3 Eriksson’s method revisited
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 2 / 55
Outline
1 Evolutionary Markov models
2 Phylogenetic invariants
3 Eriksson’s method revisited
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 3 / 55
Phylogenetic trees
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 4 / 55
Phylogenetic reconstruction
Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):
Input: DNA sequences (part of their genome) and an alignment
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 5 / 55
Phylogenetic reconstruction
Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):
Input: DNA sequences (part of their genome) and an alignment
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 5 / 55
Phylogenetic reconstruction
Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):
Input: DNA sequences (part of their genome) and an alignment
AACTTCGAGGCTTACCGCTG...
AAGGTCGATGCTCACCGATG...
AACGTCTATGCTCACCGATG...
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 5 / 55
Phylogenetic reconstruction goals
Goals:
Reconstruct the correct tree topology (topology of the graph withspecies labels at the leaves). E.g. 3 different tree topologies ofunrooted trees on four leaves :
Estimate branch lengths: the amount of substitutions per site thathave occurred between the two species at the ends of the branch
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 6 / 55
Modelling evolution
Assume that all sites of the alignment evolve independently andaccording to the same process (i.i.d.).Model the evolution of z single site on a tree T .At each node of the tree T one puts a random variable taking valuesin {A, C, G, T}
Variables at the leaves are observed and variables at the interior nodesare hidden.At each branch one writes a substitution matrix with the conditionalprobabilities of a nucleotide at the parent node being substituted byanother at its child.
A C G T
Si =
ACGT
P(A|A, i) P(C |A, i) P(G |A, i) P(T |A, i)P(A|C , i) P(C |C , i) P(G |C , i) P(T |C , i)P(A|G , i) P(C |G , i) P(G |G , i) P(T |G , i)P(A|T , i) P(C |T , i) P(G |T , i) P(T |T , i)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 7 / 55
Modelling evolution
Assume that all sites of the alignment evolve independently andaccording to the same process (i.i.d.).Model the evolution of z single site on a tree T .At each node of the tree T one puts a random variable taking valuesin {A, C, G, T}
Variables at the leaves are observed and variables at the interior nodesare hidden.At each branch one writes a substitution matrix with the conditionalprobabilities of a nucleotide at the parent node being substituted byanother at its child.
A C G T
Si =
ACGT
P(A|A, i) P(C |A, i) P(G |A, i) P(T |A, i)P(A|C , i) P(C |C , i) P(G |C , i) P(T |C , i)P(A|G , i) P(C |G , i) P(G |G , i) P(T |G , i)P(A|T , i) P(C |T , i) P(G |T , i) P(T |T , i)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 7 / 55
Modelling evolution
Assume that all sites of the alignment evolve independently andaccording to the same process (i.i.d.).Model the evolution of z single site on a tree T .At each node of the tree T one puts a random variable taking valuesin {A, C, G, T}
Variables at the leaves are observed and variables at the interior nodesare hidden.At each branch one writes a substitution matrix with the conditionalprobabilities of a nucleotide at the parent node being substituted byanother at its child.
A C G T
Si =
ACGT
P(A|A, i) P(C |A, i) P(G |A, i) P(T |A, i)P(A|C , i) P(C |C , i) P(G |C , i) P(T |C , i)P(A|G , i) P(C |G , i) P(G |G , i) P(T |G , i)P(A|T , i) P(C |T , i) P(G |T , i) P(T |T , i)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 7 / 55
Evolutionary Markov models
The entries of Si and the root distribution are unknown parameters.
According to the shape of (or group of symmetries in S4 acting on) Si
one has different evolutionary models: general Markov model, strandsymmetric model, Kimura 3-parameter, Kimura 2-parameters,Jukes-Cantor model.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 8 / 55
Evolutionary Markov models
Jukes-Cantor model
Root node has uniform distribution: pA = ... = pT = 0.25
A C G T
Si =
ACGT
ai bi bi bi
bi ai bi bi
bi bi ai bi
bi bi bi ai
, ai + 3bi = 1
only 2 different parameters (one for mutation and one for non-mutation),only 1 free parameter.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55
Evolutionary Markov models
Kimura 3-parameter model
Root node has uniform distribution: pA = ... = pT = 0.25
A C G T
Si =ACGT
(ai bi ci dibi ai di cici di ai bidi ci bi ai
), ai + bi + ci + di = 1
Kimura 2-parameter: bi = di
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55
Evolutionary Markov models
Strand Symmetric model
Root node distribution satisfies: pA = pT, pC = pG
A C G T
Si =
ACGT
ai bi ci di
ei fi gi hi
hi gi fi eidi ci bi ai
Strand symmetry in a DNA molecule:
A–TC–G
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55
Evolutionary Markov models
General Markov model
Root node distribution: arbitrary
A C G T
Si =
ACGT
ai bi ci di
ei fi gi hi
ji ki li mi
ni oi pi qi
All these models are known as equivariant models: they are invariantby permutation of rows and columns of a certain subgroup of G = S4
(permutations of A,C,G,T).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55
Hidden Markov process
px1x2x3 = Prob(X1 = x1,X2 = x2,X3 = x3) = joint probability of theobserved variables X1,X2,X3. Then,
px1x2x3 =∑
y4,yr∈{A,C,G,T}
pyr S1(yr , x1)S4(yr , y4)S2(y4, x2)S3(y4, x3)
px1x2x3 is a homogeneous polynomial on the parameters (entries of thematrices Si and root distribution).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 10 / 55
An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map
ϕMT :∏
e∈E(T )4d −→ 44n−1
ϕMT :∏
e∈E(T ) Cd+1 −→ C4n
Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.
Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.
In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn
on the variety VM(T ).
In the real world, it is only close to VM(T ) for some tree topology T .
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 11 / 55
An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map
ϕMT :∏
e∈E(T )4d −→ 44n−1
ϕMT :∏
e∈E(T ) Cd+1 −→ C4n
Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.
Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.
In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn
on the variety VM(T ).
In the real world, it is only close to VM(T ) for some tree topology T .
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 11 / 55
An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map
ϕMT :∏
e∈E(T )4d −→ 44n−1
ϕMT :∏
e∈E(T ) Cd+1 −→ C4n
Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.
Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.
In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn
on the variety VM(T ).
In the real world, it is only close to VM(T ) for some tree topology T .
Image(ϕMT )
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 11 / 55
Phylogenetic mixtures
Change the i.i.d hypothesis by “all positions evolve independently andfollowing the same evolutionary model”.
Same model M, same tree topology T = T1 but different substitutionparameters. Then,
p = (pAA...A, . . . , pTT...T) = λϕMT1(S1, . . . ,S4) + (1− λ)ϕMT1
(S ′1, . . . ,S′4)
That is, p ∈ SecVMT1
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 12 / 55
Phylogenetic mixtures
Change the i.i.d hypothesis by “all positions evolve independently andfollowing the same evolutionary model”.
Same model M, different tree topologies and different substitutionparameters. Then,
p = (pAA...A, . . . , pTT...T) = λϕMT1(S1, . . . ,S4) + (1− λ)ϕMT2
(S ′1, . . . ,S′4)
That is, p ∈ VMT1∗ VMT2
(join of two varieties).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 13 / 55
Outline
1 Evolutionary Markov models
2 Phylogenetic invariants
3 Eriksson’s method revisited
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 14 / 55
Which generators of I (VM(T )) are useful for recovering the treetopology?
Some generators depend on the model chosen (not on the treetopology). E.g. for Jukes-Cantor, n = 3:
∑px1x2x3 = 1 and
pAAA − pCCC, pAAA − pGGG, pAAA − pTTT,
pAAC − pAAG, pAAC − pAAT, pCCA − pCCG, . . . ,
pACA − pAGA, pACA − pATA, pCAC − pCGC, . . .
pCAA − pGAA, pCAA − pTAA, pACC − pGCC, . . .For this model, the polynomials that detect the topology of the threespecies have degree 3.
Definition (Cavender–Felsenstein ’87)
The elements of I (VM(T )) are called invariants. They are calledphylogenetic invariants if they distinguish between different tree topologies.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 15 / 55
Which generators of I (VM(T )) are useful for recovering the treetopology?
Some generators depend on the model chosen (not on the treetopology). E.g. for Jukes-Cantor, n = 3:
∑px1x2x3 = 1 and
pAAA − pCCC, pAAA − pGGG, pAAA − pTTT,
pAAC − pAAG, pAAC − pAAT, pCCA − pCCG, . . . ,
pACA − pAGA, pACA − pATA, pCAC − pCGC, . . .
pCAA − pGAA, pCAA − pTAA, pACC − pGCC, . . .For this model, the polynomials that detect the topology of the threespecies have degree 3.
Definition (Cavender–Felsenstein ’87)
The elements of I (VM(T )) are called invariants. They are calledphylogenetic invariants if they distinguish between different tree topologies.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 15 / 55
Which generators of I (VM(T )) are useful for recovering the treetopology?
Some generators depend on the model chosen (not on the treetopology). E.g. for Jukes-Cantor, n = 3:
∑px1x2x3 = 1 and
pAAA − pCCC, pAAA − pGGG, pAAA − pTTT,
pAAC − pAAG, pAAC − pAAT, pCCA − pCCG, . . . ,
pACA − pAGA, pACA − pATA, pCAC − pCGC, . . .
pCAA − pGAA, pCAA − pTAA, pACC − pGCC, . . .For this model, the polynomials that detect the topology of the threespecies have degree 3.
Definition (Cavender–Felsenstein ’87)
The elements of I (VM(T )) are called invariants. They are calledphylogenetic invariants if they distinguish between different tree topologies.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 15 / 55
Description of the ideal
Computational algebra software: fails for ≥ 5 leaves.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 16 / 55
Description of the ideal
Computational algebra software: fails for ≥ 5 leaves.Model M fixed.
Theorem (Allman-Rhodes, Draisma-Kuttler, Sturmfels, Sullivant, C)
There is an algorithm for obtaining the generators of the ideal of ann-leaved tree T from those of an unrooted tree of 3 leaves and the edgesof the tree.
I (V (T )) =∑
v∈Int(T )
Iv +∑
e∈E(T )
Ie
Iv
Ie
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 16 / 55
Invariants from edges
Example: M = general Markov model.
16× 16 matrix obtained by flattening p = (pAAAA, pAAAC, . . . , pTTTT) in4⊗ C4 ∼= (
2⊗ C4)⊗ (
2⊗ C4) :
states in leaves 3,4
flatep =states
leaves
1, 2
pAAAA pAAAC pAAAG . . . pAATTpACAA pACAC pACAG . . . pACTTpAGAA pAGAC pAGAG . . . pAGTT. . . . . . . . . . . . . . .
pTTAA pTTAC pTTAG . . . pTTTT
Ie := ideal generated by all 5× 5 minors.
(Allman-Rhodes, Eriksson) Ie are phylogenetic invariants for thegeneral Markov model.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 17 / 55
Invariants from edges
Example: M = general Markov model.
16× 16 matrix obtained by flattening p = (pAAAA, pAAAC, . . . , pTTTT) in4⊗ C4 ∼= (
2⊗ C4)⊗ (
2⊗ C4) :
states in leaves 3,4
flatep =states
leaves
1, 2
pAAAA pAAAC pAAAG . . . pAATTpACAA pACAC pACAG . . . pACTTpAGAA pAGAC pAGAG . . . pAGTT. . . . . . . . . . . . . . .
pTTAA pTTAC pTTAG . . . pTTTT
Ie := ideal generated by all 5× 5 minors.
(Allman-Rhodes, Eriksson) Ie are phylogenetic invariants for thegeneral Markov model.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 17 / 55
Phylogenetic invariants
We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.
Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them
We call
T = {trivalent tree topologies on n labelled leaves}
Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.
In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0
such that p ∈ VM(T0).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55
Phylogenetic invariants
We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.
Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them
We call
T = {trivalent tree topologies on n labelled leaves}
Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.
In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0
such that p ∈ VM(T0).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55
Phylogenetic invariants
We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.
Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them
We call
T = {trivalent tree topologies on n labelled leaves}
Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.
In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0
such that p ∈ VM(T0).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55
Phylogenetic invariants
We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.
Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them
We call
T = {trivalent tree topologies on n labelled leaves}
Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.
In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0
such that p ∈ VM(T0).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55
Phylogenetic invariants
Goal: for any particular tree T0 ∈ T , how is the variety VM(T0)defined inside ∪T∈T VM(T )? (at least in a biologically meaningfulopen subset).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 19 / 55
Main Theorem
Theorem
Let M = Jukes-Cantor, Kimura (2 or 3 parameters), Strand Symmetric,General Markov model (or any equivariant model). For topologyreconstruction purposes, it is enough to consider invariants from edges.
More precisely, for each tree topology T ∈ T there exists a dense opensubset UT ⊆ V (T ) such that if a point p belongs to
⋃T∈T
UT , then
p ∈ V (T0) p ∈ Z (∑
e∈E(T0) Ie) .
Theorem (Buneman’s Splits equivalence Theorem)
A tree can be recovered by knowing all bipartitions induced from its edges.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 20 / 55
Key: Representation theory to include all models
Draisma-Kuttler framework: representation theory
Example: Jukes-Cantor model. Substitution matrices:A C G T
S =
A
C
G
T
a b b bb a b bb b a bb b b a
Any permutation of rows and columns leaves the matrix invariant.Consider G = S4, permutations of A,C,G,T.Consider the representation of G , ρ : G −→ GL(W ),W =< A, C, G, T >C∼= C4, corresponding to the action of G on4× 4-matrices. For instance,
(A, C) 7→(
0 1 0 01 0 0 00 0 1 00 0 0 1
)M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 21 / 55
New description of the parametrization
We associate the vector space W = 〈A, C, G, T〉C to each node in thetree and then each substitution matrix belongs to HomG (W ,W ).
Therefore the substitution matrices lie inparG (T ) :=
∏e∈E(T ) HomG (W ,W ) and the map ϕT parameterizing
VM(T ) can be seen as
ϕT : parG (T ) −→ (n⊗W )G
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 22 / 55
Other models: subgroups of S4
G = 〈(A, C)(G, T), (A, G)(C, T)〉 ∼= Z2 × Z2 for Kimura 3-parametermodel.
G = 〈(AT)(CG)〉 for strand symmetric model.
G = id for general Markov model.
and in all cases W = 〈A, C, G, T〉C and ρ : G −→ GL(W ) is thepermutation representation.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 23 / 55
Jukes-Cantor
If G = S4 (Jukes-Cantor), then G has 5 irreducible representationsM0, . . . ,M4.
(Maschke): Any G -representation E decomposes into isotypiccomponents, E ∼= ⊕4
i=0Mi ⊗ Cmi , and we associate to it a 5-tuple ofmultiplicities m(E ) = (m0, . . . ,m4).
Our particular representation ρ : G −→ GL(W ) for Jukes-Cantormodel decomposes as W ∼= M0 ⊕M3 and it hasm(W ) = (1, 0, 0, 1, 0).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 24 / 55
Jukes-Cantor
If G = S4 (Jukes-Cantor), then G has 5 irreducible representationsM0, . . . ,M4.
(Maschke): Any G -representation E decomposes into isotypiccomponents, E ∼= ⊕4
i=0Mi ⊗ Cmi , and we associate to it a 5-tuple ofmultiplicities m(E ) = (m0, . . . ,m4).
Our particular representation ρ : G −→ GL(W ) for Jukes-Cantormodel decomposes as W ∼= M0 ⊕M3 and it hasm(W ) = (1, 0, 0, 1, 0).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 24 / 55
Thin flattening
Let L1 | L2 be a bipartition of the set of leaves {1, . . . , n} , anddenote l1 = |L1|, l2 = |L2|
Let p ∈ (n⊗W )G . The matrix of the thin flattening of p along L1 | L2,
TflatL1|L2p = (P0, . . . ,P4), is obtained from p via the isomorphism
(l1⊗W ⊗
l2⊗W )G ∼= HomG (
l1⊗W ,
l2⊗W ) ∼=
4⊕i=0HomC(Cm(
l1⊗W )i ,Cm(
l2⊗W )i )
(last isomorphism is due to Schur’s Lemma).
Write rkTflatL1,L2p := (rk P0, . . . , rk P4).
Rank conditions:
rkTflatL1,L2p ≤ m(W ) = (1, 0, 0, 1, 0)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 25 / 55
Example (Jukes-Cantor)
n = 4, L1 = {1, 2}, L2 = {3, 4}.
HomG (2⊗W ,
2⊗W ) ∼=
4⊕i=0
HomC(Cm(2⊗W )i ,Cm(
2⊗W )i )
W ⊗W ∼= M20 ⊕M2 ⊕M3
3 ⊕M4, hence m(2⊗W ) = (2, 0, 1, 3, 1) and
TflatL1,L2p can be written as a block matrix in a certain basis:
TflatL1,L2p =
∗ ∗∗ ∗ 0 0 0
0 ∗ 0 0
0 0∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗
0
0 0 0 ∗
rkTflatL1,L2p ≤ m(W ) = (1, 0, 0, 1, 0).
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 26 / 55
Invariants for mixtures on the same tree
If p is a k-mixture distribution on a tree topology T ,p =
∑ki=1 λiϕ
MT (θi ), and A|B is a bipartition induced by and edge of
T , thenrk(flatA|B(p)) ≤ 4k
(for the general Markov model, analogous for other equivariantmodels)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 27 / 55
Some results on simulations
Phylogenetic invariants have been tested on simulated data with notso bad results
Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)
But... there is no canonical way of using them.
Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)
Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55
Some results on simulations
Phylogenetic invariants have been tested on simulated data with notso bad results
Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)
But... there is no canonical way of using them.
Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)
Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55
Some results on simulations
Phylogenetic invariants have been tested on simulated data with notso bad results
Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)
But... there is no canonical way of using them.
Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)
Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55
Some results on simulations
Phylogenetic invariants have been tested on simulated data with notso bad results
Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)
But... there is no canonical way of using them.
Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)
Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55
Some results on simulations
Phylogenetic invariants have been tested on simulated data with notso bad results
Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)
But... there is no canonical way of using them.
Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)
Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55
Are we ready for application to real data???
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 29 / 55
Outline
1 Evolutionary Markov models
2 Phylogenetic invariants
3 Eriksson’s method revisited
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 30 / 55
Singular value decomposition (SVD)
Distance (in Frobenius or 2-norm) of an m× n matrix M to R4 = { m× nmatrices of rank ≤ 4} can be computed easily:
SVD: M = UDV t with U ∈Mm×m, D ∈Mm×n, V ∈Mn×n, U, Vorthogonal, D “almost” diagonal
D =
σ1
σ2
. . .
σm
0 · · · 0
σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0
Then minX∈R4 ||M − X || = ||M − X4|| where
X4 = U
σ1
σ2
σ3
σ4
0
. . .
0 · · · 0
V t
and d2(M,R4) = σ5, dF (M,R4) =√∑
i≥5 σ2i .
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 31 / 55
Singular value decomposition (SVD)
Distance (in Frobenius or 2-norm) of an m× n matrix M to R4 = { m× nmatrices of rank ≤ 4} can be computed easily:
SVD: M = UDV t with U ∈Mm×m, D ∈Mm×n, V ∈Mn×n, U, Vorthogonal, D “almost” diagonal
D =
σ1
σ2
. . .
σm
0 · · · 0
σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0
Then minX∈R4 ||M − X || = ||M − X4|| where
X4 = U
σ1
σ2
σ3
σ4
0
. . .
0 · · · 0
V t
and d2(M,R4) = σ5, dF (M,R4) =√∑
i≥5 σ2i .
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 31 / 55
Eriksson’s SVD method
Use SVD to evaluate how close a bipartition A|B is from being anedge split: sc(A|B) = dF (flattA|B(p),R4).Find the best cherry in the tree by looking at all bipartitions of 2 taxaagainst others.Trait this cherry as an indivisible pair of taxa, and proceed recursively.Example:
12|34513|12514|235∗15|23423|14524|13525|13434|12535|12445|123
⇒142|35143|25145|23∗
⇒ (1, 4), 5, (2, 3)
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 32 / 55
Problems of Eriksson’s method:
1 Low performance in the presence of long branches
2 Is it fair to compare bipartitions of different size with this score?
3 What happens when we have small sample size?
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 33 / 55
Long branch attraction
correct tree usually inferred tree
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 34 / 55
Flattening matrix for the wrong topology
13|24 AA AC · · · CA CC CG · · · GC GG GT · · · TG TT +
AA 0.128 0.001 · · · 0.001 0 0.001 · · · 0.001 0.021 0.001 · · · 0 0.011 0.190AC 0.043 0.001 · · · 0 0.002 0 · · · 0.002 0.027 0.001 · · · 0 0.013 0.100AG 0.008 0 · · · 0 0 0 · · · 0 0.009 0 · · · 0 0.002 0.023AT 0.011 0 · · · 0 0 0 · · · 0.001 0.001 0.002 · · · 0 0.039 0.058CA 0.023 0 · · · 0.001 0.025 0.004 · · · 0.003 0.028 0.002 · · · 0.001 0.019 0.114CC 0.005 0 · · · 0.003 0.140 0.011 · · · 0.001 0.017 0 · · · 0.001 0.029 0.213CG 0 0 · · · 0.001 0.026 0.008 · · · 0.002 0.013 0 · · · 0 0.006 0.059CT 0 0 · · · 0 0.012 0.001 · · · 0 0.001 0.004 · · · 0 0.055 0.084GA 0.001 0 · · · 0 0 0 · · · 0 0.009 0 · · · 0 0.001 0.011GC 0.001 0 · · · 0 0 0 · · · 0.001 0.005 0 · · · 0 0.003 0.011GG 0.001 0 · · · 0 0 0 · · · 0 0.003 0 · · · 0 0 0.004GT 0 0 · · · 0 0 0 · · · 0 0.001 0.001 · · · 0 0.004 0.006TA 0.010 0 · · · 0 0 0 · · · 0.004 0.006 0 · · · 0 0.010 0.035TC 0.001 0 · · · 0 0.005 0 · · · 0.001 0.010 0 · · · 0 0.020 0.039TG 0.001 0 · · · 0.001 0 0 · · · 0 0.003 0.0020 · · · 0 0.001 0.009TT 0 0 · · · 0 0 0 · · · 0 0.001 0.003 · · · 0 0.036 0.043+ 0.233 0.002 · · · 0.007 0.210 0.025 · · · 0.016 0.155 0.016 · · · 0.002 0.249 1
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 35 / 55
Eriksson’s revisited: Erik+2
Given a bipartition A|B, consider the transition matrices MA→B andMB→A instead of the flattening matrix:
M12→34(p) =
pAAAApAA++
pAAACpAA++
. . . pAATTpAA++
pACAApAC++
pACACpAC++
. . . pACTTpAC++
pAGAApAG++
pAGACpAG++
. . . pAGTTpAG++
. . . . . . . . .
M34→12(p) =
(columnsums =1
)
These matrices have the same rank as the flattening. We redefine thescore as:
Erik+2: sc(A|B) = (dF (MA→B ,R4) + dF (MB→A,R4))/2.
The lower the score is, the more it is likely that the bipartition comes froman edge of T .
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 36 / 55
Assessing performance
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 37 / 55
Assessing performance
Huelsenbeck ’95: assesses the performance of different phylogeneticreconstruction methods on the following tree space.
For each (a, b), simulate 100 alignments under topology 12|34. For eachmethod, represent the percentage of success in a gray scale: white=0% ,black=100% success.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 37 / 55
Performance of Eriksson and Erik+2
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 38 / 55
Performance against other methods
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 39 / 55
old invariants
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 40 / 55
Erik+2 needs more data but... works under the most general model (underi.i.d. assumptions).Models difference:
Continuous-time models:
Si = exp(Qi ti ),
Parameters: Qi rate matrix, ti branch length. They assume localhomogeneity:S ′i (t) = QiSi (t). Usually, also global-homogeneity isassumed: same rate-matrix Q throughout the tree, different ti foreach edge i .
(Algebraic) Markov models: the entries of the matrices Si are theparameters of the model. Therefore they allow for non-homogeneousdata, both local and global (different mutation rates at differentlineages of the tree and on a single edge).The results we’ve seen so far were on globally homogeneous data.Now...
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 41 / 55
Erik+2 needs more data but... works under the most general model (underi.i.d. assumptions).Models difference:
Continuous-time models:
Si = exp(Qi ti ),
Parameters: Qi rate matrix, ti branch length. They assume localhomogeneity:S ′i (t) = QiSi (t). Usually, also global-homogeneity isassumed: same rate-matrix Q throughout the tree, different ti foreach edge i .
(Algebraic) Markov models: the entries of the matrices Si are theparameters of the model. Therefore they allow for non-homogeneousdata, both local and global (different mutation rates at differentlineages of the tree and on a single edge).The results we’ve seen so far were on globally homogeneous data.Now...
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 41 / 55
Erik+2 Length=1000
pa
ram
ete
r b
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
33%
33%
95%
95%
95%
95%
95%
95%
95%
95%
mean= 0.80343 ; sd= 0.16955
Erik+2 Length=10000
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
95%
95%
95%
95%
95
%
95%
mean= 0.97102 ; sd= 0.04373
NJ Length=1000
pa
ram
ete
r b
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
33%
9
5%
9
5%
95%
95%
95%
95%
95%
mean= 0.79745 ; sd= 0.18404
NJ Length=10000
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
95%
9
5%
95%
95%
95%
95%
95%
95
% 9
5%
9
5%
mean= 0.94313 ; sd= 0.08771
ML Length=1000
parameter a
pa
ram
ete
r b
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
95%
95%
95%
95%
mean= 0.73608 ; sd= 0.17409
ML Length=10000
parameter a
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
95%
95%
95%
95%
mean= 0.754 ; sd= 0.16967
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 42 / 55
Erik+2 is more robust: rates across sites
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 43 / 55
Erik+2 on 2- mixtures
Kolaczkowski, Thornton. Nature 2004
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 44 / 55
How about larger trees?
Scores are not normalized, we cannot compare bipartitions of differentsize
Eriksson’s method involved comparison of different bipartitions
Also, some penalization should be given: n = 6, requiring a 42 × 44
matrix to have rank ≤ 4 or requiring a 43 × 43 matrix having rank≤ 4 is not the same.
Indeed, codimension of R4,42×44 : (42 − 4)(44 − 4) = 3024,codimension of R4,43×43 : 3600
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 45 / 55
How about larger trees?
Scores are not normalized, we cannot compare bipartitions of differentsize
Eriksson’s method involved comparison of different bipartitions
Also, some penalization should be given: n = 6, requiring a 42 × 44
matrix to have rank ≤ 4 or requiring a 43 × 43 matrix having rank≤ 4 is not the same.
Indeed, codimension of R4,42×44 : (42 − 4)(44 − 4) = 3024,codimension of R4,43×43 : 3600
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 45 / 55
Normalizing the score
But in the meanwhile...
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 46 / 55
Normalizing the score
But in the meanwhile...
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 46 / 55
Weighted quartet-based methods
Weight each quartet tree T1 = 12|34, T2 = 13|24, T3 = 14|23 accordingto the reconstruction method (weight ≡ reliability):
Erik+2: w(Ti ) = sc(Ti )−1/(sc(T1)−1 + sc(T2)−1 + sc(T3)−1)
ML: w(Ti ) = logL(Ti )−1/(logL(T1)−1 + logL(T2)−1 + logL(T3)−1)
Barycentric coordinates: (w(T1),w(T2),w(T3))
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 47 / 55
Barycentric plots
(w(T1),w(T2),w(T3))
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 48 / 55
Barycentric plots
(w(T1),w(T2),w(T3))
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 49 / 55
Quartet-based methods (Weight optimization)
Where do we put leaf 5?
Gascuel-Ranwez 2001M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 50 / 55
Topology evaluated; b: internal branch length, leaf edges proportional to it
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 51 / 55
Method Quartets b = 0.005 b = 0.015 b = 0.05 b = 0.1 Average
QP Erik+2 0.05 (100) 0.07 (100) 1.69 (100) 4.57 (100) 1.6ML 9 (100) 9.08 (93) 9.65 (93) - (0) -NJ 9.7 (100) 9.65 (100) 9.91 (100) 9.91 (100) 9.79
WIL Erik+2 0.02 (100) 0.07 (100) 0.33 (100) 1.52 (100) 0.48ML 9.97 (100) 9.28 (93) 9.27 (93) - (0) -NJ 7.66 (100) 7.41 (100) 8.25 (100) 9.61 (100) 8.23
WO Erik+2 0.02 (100) 0.06 (100) 0.2 (100) 1.66 (100) 0.48ML 11.03 (100) 11.18 (93) 11.83 (93) - (0) -NJ 9.46 (100) 9.22 (100) 9.2 (100) 9.73 (100) 9.4
Global NJ 0 0 1.54 10.19
Table : Average Robinson-Foulds distance to the correct tree, alignment length5000, GMM data.
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 52 / 55
Method Quartets b = 0.005 b = 0.015 b = 0.05 b = 0.1 Average
QP Erik+2 0 (100) 0 (100) 0.44 (100) 2.92 (100) 0.84ML 9 (100) 9.11 (96) 9.6 (91) 9 (1) 9.18NJ 9.69 (100) 9.97 (100) 9.92 (100) 9.95 (100) 9.88
WIL Erik+2 0 (100) 0.02 (100) 0.09 (100) 0.44 (100) 0.14ML 9.33 (100) 9 (96) 8.98 (91) 9 (1) 9.08NJ 7.46 (100) 7.44 (100) 8.03 (100) 9.48 (100) 8.1
WO Erik+2 0 (100) 0 (100) 0.08 (100) 0.51 (100) 0.15ML 10.97 (100) 11.01 (96) 11.69 (91) 13 (1) 11.67NJ 9.41 (100) 9.18 (100) 9.33 (100) 9.62 (100) 9.38
Global NJ 0 0 1.16 9.53
Table : Average Robinson-Foulds distance to the correct tree, alignment length10000, GMM data
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 53 / 55
Open problems
Algorithm for larger trees
Incorporate Allman-Rhodes-Taylor semialgebraic conditions
Look for closest rank ≤ 4 nonnegative matrix
Statistical analysis of the method
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 54 / 55
Open problems
Algorithm for larger trees
Incorporate Allman-Rhodes-Taylor semialgebraic conditions
Look for closest rank ≤ 4 nonnegative matrix
Statistical analysis of the method
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 54 / 55
Thanks!
M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 55 / 55