marta casanellas rius (joint work with jesus fern andez-s...

Post on 25-Jul-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Phylogenetic invariants versus classical phylogenetics

Marta Casanellas Rius(joint work with Jesus Fernandez-Sanchez)

Departament de Matematica Aplicada IUniversitat Politecnica de Catalunya

Algebraic Statistics 2014IIT, Chicago

May 2014

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 1 / 55

Outline

1 Evolutionary Markov models

2 Phylogenetic invariants

3 Eriksson’s method revisited

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 2 / 55

Outline

1 Evolutionary Markov models

2 Phylogenetic invariants

3 Eriksson’s method revisited

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 2 / 55

Outline

1 Evolutionary Markov models

2 Phylogenetic invariants

3 Eriksson’s method revisited

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 2 / 55

Outline

1 Evolutionary Markov models

2 Phylogenetic invariants

3 Eriksson’s method revisited

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 3 / 55

Phylogenetic trees

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 4 / 55

Phylogenetic reconstruction

Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):

Input: DNA sequences (part of their genome) and an alignment

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 5 / 55

Phylogenetic reconstruction

Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):

Input: DNA sequences (part of their genome) and an alignment

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 5 / 55

Phylogenetic reconstruction

Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):

Input: DNA sequences (part of their genome) and an alignment

AACTTCGAGGCTTACCGCTG...

AAGGTCGATGCTCACCGATG...

AACGTCTATGCTCACCGATG...

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 5 / 55

Phylogenetic reconstruction goals

Goals:

Reconstruct the correct tree topology (topology of the graph withspecies labels at the leaves). E.g. 3 different tree topologies ofunrooted trees on four leaves :

Estimate branch lengths: the amount of substitutions per site thathave occurred between the two species at the ends of the branch

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 6 / 55

Modelling evolution

Assume that all sites of the alignment evolve independently andaccording to the same process (i.i.d.).Model the evolution of z single site on a tree T .At each node of the tree T one puts a random variable taking valuesin {A, C, G, T}

Variables at the leaves are observed and variables at the interior nodesare hidden.At each branch one writes a substitution matrix with the conditionalprobabilities of a nucleotide at the parent node being substituted byanother at its child.

A C G T

Si =

ACGT

P(A|A, i) P(C |A, i) P(G |A, i) P(T |A, i)P(A|C , i) P(C |C , i) P(G |C , i) P(T |C , i)P(A|G , i) P(C |G , i) P(G |G , i) P(T |G , i)P(A|T , i) P(C |T , i) P(G |T , i) P(T |T , i)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 7 / 55

Modelling evolution

Assume that all sites of the alignment evolve independently andaccording to the same process (i.i.d.).Model the evolution of z single site on a tree T .At each node of the tree T one puts a random variable taking valuesin {A, C, G, T}

Variables at the leaves are observed and variables at the interior nodesare hidden.At each branch one writes a substitution matrix with the conditionalprobabilities of a nucleotide at the parent node being substituted byanother at its child.

A C G T

Si =

ACGT

P(A|A, i) P(C |A, i) P(G |A, i) P(T |A, i)P(A|C , i) P(C |C , i) P(G |C , i) P(T |C , i)P(A|G , i) P(C |G , i) P(G |G , i) P(T |G , i)P(A|T , i) P(C |T , i) P(G |T , i) P(T |T , i)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 7 / 55

Modelling evolution

Assume that all sites of the alignment evolve independently andaccording to the same process (i.i.d.).Model the evolution of z single site on a tree T .At each node of the tree T one puts a random variable taking valuesin {A, C, G, T}

Variables at the leaves are observed and variables at the interior nodesare hidden.At each branch one writes a substitution matrix with the conditionalprobabilities of a nucleotide at the parent node being substituted byanother at its child.

A C G T

Si =

ACGT

P(A|A, i) P(C |A, i) P(G |A, i) P(T |A, i)P(A|C , i) P(C |C , i) P(G |C , i) P(T |C , i)P(A|G , i) P(C |G , i) P(G |G , i) P(T |G , i)P(A|T , i) P(C |T , i) P(G |T , i) P(T |T , i)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 7 / 55

Evolutionary Markov models

The entries of Si and the root distribution are unknown parameters.

According to the shape of (or group of symmetries in S4 acting on) Si

one has different evolutionary models: general Markov model, strandsymmetric model, Kimura 3-parameter, Kimura 2-parameters,Jukes-Cantor model.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 8 / 55

Evolutionary Markov models

Jukes-Cantor model

Root node has uniform distribution: pA = ... = pT = 0.25

A C G T

Si =

ACGT

ai bi bi bi

bi ai bi bi

bi bi ai bi

bi bi bi ai

, ai + 3bi = 1

only 2 different parameters (one for mutation and one for non-mutation),only 1 free parameter.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55

Evolutionary Markov models

Kimura 3-parameter model

Root node has uniform distribution: pA = ... = pT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameter: bi = di

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55

Evolutionary Markov models

Strand Symmetric model

Root node distribution satisfies: pA = pT, pC = pG

A C G T

Si =

ACGT

ai bi ci di

ei fi gi hi

hi gi fi eidi ci bi ai

Strand symmetry in a DNA molecule:

A–TC–G

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55

Evolutionary Markov models

General Markov model

Root node distribution: arbitrary

A C G T

Si =

ACGT

ai bi ci di

ei fi gi hi

ji ki li mi

ni oi pi qi

All these models are known as equivariant models: they are invariantby permutation of rows and columns of a certain subgroup of G = S4

(permutations of A,C,G,T).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 9 / 55

Hidden Markov process

px1x2x3 = Prob(X1 = x1,X2 = x2,X3 = x3) = joint probability of theobserved variables X1,X2,X3. Then,

px1x2x3 =∑

y4,yr∈{A,C,G,T}

pyr S1(yr , x1)S4(yr , y4)S2(y4, x2)S3(y4, x3)

px1x2x3 is a homogeneous polynomial on the parameters (entries of thematrices Si and root distribution).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 10 / 55

An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map

ϕMT :∏

e∈E(T )4d −→ 44n−1

ϕMT :∏

e∈E(T ) Cd+1 −→ C4n

Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.

In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn

on the variety VM(T ).

In the real world, it is only close to VM(T ) for some tree topology T .

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 11 / 55

An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map

ϕMT :∏

e∈E(T )4d −→ 44n−1

ϕMT :∏

e∈E(T ) Cd+1 −→ C4n

Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.

In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn

on the variety VM(T ).

In the real world, it is only close to VM(T ) for some tree topology T .

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 11 / 55

An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map

ϕMT :∏

e∈E(T )4d −→ 44n−1

ϕMT :∏

e∈E(T ) Cd+1 −→ C4n

Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.

In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn

on the variety VM(T ).

In the real world, it is only close to VM(T ) for some tree topology T .

Image(ϕMT )

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 11 / 55

Phylogenetic mixtures

Change the i.i.d hypothesis by “all positions evolve independently andfollowing the same evolutionary model”.

Same model M, same tree topology T = T1 but different substitutionparameters. Then,

p = (pAA...A, . . . , pTT...T) = λϕMT1(S1, . . . ,S4) + (1− λ)ϕMT1

(S ′1, . . . ,S′4)

That is, p ∈ SecVMT1

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 12 / 55

Phylogenetic mixtures

Change the i.i.d hypothesis by “all positions evolve independently andfollowing the same evolutionary model”.

Same model M, different tree topologies and different substitutionparameters. Then,

p = (pAA...A, . . . , pTT...T) = λϕMT1(S1, . . . ,S4) + (1− λ)ϕMT2

(S ′1, . . . ,S′4)

That is, p ∈ VMT1∗ VMT2

(join of two varieties).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 13 / 55

Outline

1 Evolutionary Markov models

2 Phylogenetic invariants

3 Eriksson’s method revisited

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 14 / 55

Which generators of I (VM(T )) are useful for recovering the treetopology?

Some generators depend on the model chosen (not on the treetopology). E.g. for Jukes-Cantor, n = 3:

∑px1x2x3 = 1 and

pAAA − pCCC, pAAA − pGGG, pAAA − pTTT,

pAAC − pAAG, pAAC − pAAT, pCCA − pCCG, . . . ,

pACA − pAGA, pACA − pATA, pCAC − pCGC, . . .

pCAA − pGAA, pCAA − pTAA, pACC − pGCC, . . .For this model, the polynomials that detect the topology of the threespecies have degree 3.

Definition (Cavender–Felsenstein ’87)

The elements of I (VM(T )) are called invariants. They are calledphylogenetic invariants if they distinguish between different tree topologies.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 15 / 55

Which generators of I (VM(T )) are useful for recovering the treetopology?

Some generators depend on the model chosen (not on the treetopology). E.g. for Jukes-Cantor, n = 3:

∑px1x2x3 = 1 and

pAAA − pCCC, pAAA − pGGG, pAAA − pTTT,

pAAC − pAAG, pAAC − pAAT, pCCA − pCCG, . . . ,

pACA − pAGA, pACA − pATA, pCAC − pCGC, . . .

pCAA − pGAA, pCAA − pTAA, pACC − pGCC, . . .For this model, the polynomials that detect the topology of the threespecies have degree 3.

Definition (Cavender–Felsenstein ’87)

The elements of I (VM(T )) are called invariants. They are calledphylogenetic invariants if they distinguish between different tree topologies.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 15 / 55

Which generators of I (VM(T )) are useful for recovering the treetopology?

Some generators depend on the model chosen (not on the treetopology). E.g. for Jukes-Cantor, n = 3:

∑px1x2x3 = 1 and

pAAA − pCCC, pAAA − pGGG, pAAA − pTTT,

pAAC − pAAG, pAAC − pAAT, pCCA − pCCG, . . . ,

pACA − pAGA, pACA − pATA, pCAC − pCGC, . . .

pCAA − pGAA, pCAA − pTAA, pACC − pGCC, . . .For this model, the polynomials that detect the topology of the threespecies have degree 3.

Definition (Cavender–Felsenstein ’87)

The elements of I (VM(T )) are called invariants. They are calledphylogenetic invariants if they distinguish between different tree topologies.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 15 / 55

Description of the ideal

Computational algebra software: fails for ≥ 5 leaves.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 16 / 55

Description of the ideal

Computational algebra software: fails for ≥ 5 leaves.Model M fixed.

Theorem (Allman-Rhodes, Draisma-Kuttler, Sturmfels, Sullivant, C)

There is an algorithm for obtaining the generators of the ideal of ann-leaved tree T from those of an unrooted tree of 3 leaves and the edgesof the tree.

I (V (T )) =∑

v∈Int(T )

Iv +∑

e∈E(T )

Ie

Iv

Ie

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 16 / 55

Invariants from edges

Example: M = general Markov model.

16× 16 matrix obtained by flattening p = (pAAAA, pAAAC, . . . , pTTTT) in4⊗ C4 ∼= (

2⊗ C4)⊗ (

2⊗ C4) :

states in leaves 3,4

flatep =states

leaves

1, 2

pAAAA pAAAC pAAAG . . . pAATTpACAA pACAC pACAG . . . pACTTpAGAA pAGAC pAGAG . . . pAGTT. . . . . . . . . . . . . . .

pTTAA pTTAC pTTAG . . . pTTTT

Ie := ideal generated by all 5× 5 minors.

(Allman-Rhodes, Eriksson) Ie are phylogenetic invariants for thegeneral Markov model.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 17 / 55

Invariants from edges

Example: M = general Markov model.

16× 16 matrix obtained by flattening p = (pAAAA, pAAAC, . . . , pTTTT) in4⊗ C4 ∼= (

2⊗ C4)⊗ (

2⊗ C4) :

states in leaves 3,4

flatep =states

leaves

1, 2

pAAAA pAAAC pAAAG . . . pAATTpACAA pACAC pACAG . . . pACTTpAGAA pAGAC pAGAG . . . pAGTT. . . . . . . . . . . . . . .

pTTAA pTTAC pTTAG . . . pTTTT

Ie := ideal generated by all 5× 5 minors.

(Allman-Rhodes, Eriksson) Ie are phylogenetic invariants for thegeneral Markov model.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 17 / 55

Phylogenetic invariants

We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.

Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them

We call

T = {trivalent tree topologies on n labelled leaves}

Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.

In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0

such that p ∈ VM(T0).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55

Phylogenetic invariants

We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.

Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them

We call

T = {trivalent tree topologies on n labelled leaves}

Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.

In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0

such that p ∈ VM(T0).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55

Phylogenetic invariants

We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.

Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them

We call

T = {trivalent tree topologies on n labelled leaves}

Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.

In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0

such that p ∈ VM(T0).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55

Phylogenetic invariants

We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.

Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them

We call

T = {trivalent tree topologies on n labelled leaves}

Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.

In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0

such that p ∈ VM(T0).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 18 / 55

Phylogenetic invariants

Goal: for any particular tree T0 ∈ T , how is the variety VM(T0)defined inside ∪T∈T VM(T )? (at least in a biologically meaningfulopen subset).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 19 / 55

Main Theorem

Theorem

Let M = Jukes-Cantor, Kimura (2 or 3 parameters), Strand Symmetric,General Markov model (or any equivariant model). For topologyreconstruction purposes, it is enough to consider invariants from edges.

More precisely, for each tree topology T ∈ T there exists a dense opensubset UT ⊆ V (T ) such that if a point p belongs to

⋃T∈T

UT , then

p ∈ V (T0) p ∈ Z (∑

e∈E(T0) Ie) .

Theorem (Buneman’s Splits equivalence Theorem)

A tree can be recovered by knowing all bipartitions induced from its edges.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 20 / 55

Key: Representation theory to include all models

Draisma-Kuttler framework: representation theory

Example: Jukes-Cantor model. Substitution matrices:A C G T

S =

A

C

G

T

a b b bb a b bb b a bb b b a

Any permutation of rows and columns leaves the matrix invariant.Consider G = S4, permutations of A,C,G,T.Consider the representation of G , ρ : G −→ GL(W ),W =< A, C, G, T >C∼= C4, corresponding to the action of G on4× 4-matrices. For instance,

(A, C) 7→(

0 1 0 01 0 0 00 0 1 00 0 0 1

)M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 21 / 55

New description of the parametrization

We associate the vector space W = 〈A, C, G, T〉C to each node in thetree and then each substitution matrix belongs to HomG (W ,W ).

Therefore the substitution matrices lie inparG (T ) :=

∏e∈E(T ) HomG (W ,W ) and the map ϕT parameterizing

VM(T ) can be seen as

ϕT : parG (T ) −→ (n⊗W )G

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 22 / 55

Other models: subgroups of S4

G = 〈(A, C)(G, T), (A, G)(C, T)〉 ∼= Z2 × Z2 for Kimura 3-parametermodel.

G = 〈(AT)(CG)〉 for strand symmetric model.

G = id for general Markov model.

and in all cases W = 〈A, C, G, T〉C and ρ : G −→ GL(W ) is thepermutation representation.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 23 / 55

Jukes-Cantor

If G = S4 (Jukes-Cantor), then G has 5 irreducible representationsM0, . . . ,M4.

(Maschke): Any G -representation E decomposes into isotypiccomponents, E ∼= ⊕4

i=0Mi ⊗ Cmi , and we associate to it a 5-tuple ofmultiplicities m(E ) = (m0, . . . ,m4).

Our particular representation ρ : G −→ GL(W ) for Jukes-Cantormodel decomposes as W ∼= M0 ⊕M3 and it hasm(W ) = (1, 0, 0, 1, 0).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 24 / 55

Jukes-Cantor

If G = S4 (Jukes-Cantor), then G has 5 irreducible representationsM0, . . . ,M4.

(Maschke): Any G -representation E decomposes into isotypiccomponents, E ∼= ⊕4

i=0Mi ⊗ Cmi , and we associate to it a 5-tuple ofmultiplicities m(E ) = (m0, . . . ,m4).

Our particular representation ρ : G −→ GL(W ) for Jukes-Cantormodel decomposes as W ∼= M0 ⊕M3 and it hasm(W ) = (1, 0, 0, 1, 0).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 24 / 55

Thin flattening

Let L1 | L2 be a bipartition of the set of leaves {1, . . . , n} , anddenote l1 = |L1|, l2 = |L2|

Let p ∈ (n⊗W )G . The matrix of the thin flattening of p along L1 | L2,

TflatL1|L2p = (P0, . . . ,P4), is obtained from p via the isomorphism

(l1⊗W ⊗

l2⊗W )G ∼= HomG (

l1⊗W ,

l2⊗W ) ∼=

4⊕i=0HomC(Cm(

l1⊗W )i ,Cm(

l2⊗W )i )

(last isomorphism is due to Schur’s Lemma).

Write rkTflatL1,L2p := (rk P0, . . . , rk P4).

Rank conditions:

rkTflatL1,L2p ≤ m(W ) = (1, 0, 0, 1, 0)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 25 / 55

Example (Jukes-Cantor)

n = 4, L1 = {1, 2}, L2 = {3, 4}.

HomG (2⊗W ,

2⊗W ) ∼=

4⊕i=0

HomC(Cm(2⊗W )i ,Cm(

2⊗W )i )

W ⊗W ∼= M20 ⊕M2 ⊕M3

3 ⊕M4, hence m(2⊗W ) = (2, 0, 1, 3, 1) and

TflatL1,L2p can be written as a block matrix in a certain basis:

TflatL1,L2p =

∗ ∗∗ ∗ 0 0 0

0 ∗ 0 0

0 0∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗

0

0 0 0 ∗

rkTflatL1,L2p ≤ m(W ) = (1, 0, 0, 1, 0).

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 26 / 55

Invariants for mixtures on the same tree

If p is a k-mixture distribution on a tree topology T ,p =

∑ki=1 λiϕ

MT (θi ), and A|B is a bipartition induced by and edge of

T , thenrk(flatA|B(p)) ≤ 4k

(for the general Markov model, analogous for other equivariantmodels)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 27 / 55

Some results on simulations

Phylogenetic invariants have been tested on simulated data with notso bad results

Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)

But... there is no canonical way of using them.

Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)

Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55

Some results on simulations

Phylogenetic invariants have been tested on simulated data with notso bad results

Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)

But... there is no canonical way of using them.

Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)

Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55

Some results on simulations

Phylogenetic invariants have been tested on simulated data with notso bad results

Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)

But... there is no canonical way of using them.

Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)

Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55

Some results on simulations

Phylogenetic invariants have been tested on simulated data with notso bad results

Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)

But... there is no canonical way of using them.

Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)

Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55

Some results on simulations

Phylogenetic invariants have been tested on simulated data with notso bad results

Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)

But... there is no canonical way of using them.

Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)

Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 28 / 55

Are we ready for application to real data???

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 29 / 55

Outline

1 Evolutionary Markov models

2 Phylogenetic invariants

3 Eriksson’s method revisited

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 30 / 55

Singular value decomposition (SVD)

Distance (in Frobenius or 2-norm) of an m× n matrix M to R4 = { m× nmatrices of rank ≤ 4} can be computed easily:

SVD: M = UDV t with U ∈Mm×m, D ∈Mm×n, V ∈Mn×n, U, Vorthogonal, D “almost” diagonal

D =

σ1

σ2

. . .

σm

0 · · · 0

σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0

Then minX∈R4 ||M − X || = ||M − X4|| where

X4 = U

σ1

σ2

σ3

σ4

0

. . .

0 · · · 0

V t

and d2(M,R4) = σ5, dF (M,R4) =√∑

i≥5 σ2i .

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 31 / 55

Singular value decomposition (SVD)

Distance (in Frobenius or 2-norm) of an m× n matrix M to R4 = { m× nmatrices of rank ≤ 4} can be computed easily:

SVD: M = UDV t with U ∈Mm×m, D ∈Mm×n, V ∈Mn×n, U, Vorthogonal, D “almost” diagonal

D =

σ1

σ2

. . .

σm

0 · · · 0

σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0

Then minX∈R4 ||M − X || = ||M − X4|| where

X4 = U

σ1

σ2

σ3

σ4

0

. . .

0 · · · 0

V t

and d2(M,R4) = σ5, dF (M,R4) =√∑

i≥5 σ2i .

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 31 / 55

Eriksson’s SVD method

Use SVD to evaluate how close a bipartition A|B is from being anedge split: sc(A|B) = dF (flattA|B(p),R4).Find the best cherry in the tree by looking at all bipartitions of 2 taxaagainst others.Trait this cherry as an indivisible pair of taxa, and proceed recursively.Example:

12|34513|12514|235∗15|23423|14524|13525|13434|12535|12445|123

⇒142|35143|25145|23∗

⇒ (1, 4), 5, (2, 3)

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 32 / 55

Problems of Eriksson’s method:

1 Low performance in the presence of long branches

2 Is it fair to compare bipartitions of different size with this score?

3 What happens when we have small sample size?

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 33 / 55

Long branch attraction

correct tree usually inferred tree

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 34 / 55

Flattening matrix for the wrong topology

13|24 AA AC · · · CA CC CG · · · GC GG GT · · · TG TT +

AA 0.128 0.001 · · · 0.001 0 0.001 · · · 0.001 0.021 0.001 · · · 0 0.011 0.190AC 0.043 0.001 · · · 0 0.002 0 · · · 0.002 0.027 0.001 · · · 0 0.013 0.100AG 0.008 0 · · · 0 0 0 · · · 0 0.009 0 · · · 0 0.002 0.023AT 0.011 0 · · · 0 0 0 · · · 0.001 0.001 0.002 · · · 0 0.039 0.058CA 0.023 0 · · · 0.001 0.025 0.004 · · · 0.003 0.028 0.002 · · · 0.001 0.019 0.114CC 0.005 0 · · · 0.003 0.140 0.011 · · · 0.001 0.017 0 · · · 0.001 0.029 0.213CG 0 0 · · · 0.001 0.026 0.008 · · · 0.002 0.013 0 · · · 0 0.006 0.059CT 0 0 · · · 0 0.012 0.001 · · · 0 0.001 0.004 · · · 0 0.055 0.084GA 0.001 0 · · · 0 0 0 · · · 0 0.009 0 · · · 0 0.001 0.011GC 0.001 0 · · · 0 0 0 · · · 0.001 0.005 0 · · · 0 0.003 0.011GG 0.001 0 · · · 0 0 0 · · · 0 0.003 0 · · · 0 0 0.004GT 0 0 · · · 0 0 0 · · · 0 0.001 0.001 · · · 0 0.004 0.006TA 0.010 0 · · · 0 0 0 · · · 0.004 0.006 0 · · · 0 0.010 0.035TC 0.001 0 · · · 0 0.005 0 · · · 0.001 0.010 0 · · · 0 0.020 0.039TG 0.001 0 · · · 0.001 0 0 · · · 0 0.003 0.0020 · · · 0 0.001 0.009TT 0 0 · · · 0 0 0 · · · 0 0.001 0.003 · · · 0 0.036 0.043+ 0.233 0.002 · · · 0.007 0.210 0.025 · · · 0.016 0.155 0.016 · · · 0.002 0.249 1

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 35 / 55

Eriksson’s revisited: Erik+2

Given a bipartition A|B, consider the transition matrices MA→B andMB→A instead of the flattening matrix:

M12→34(p) =

pAAAApAA++

pAAACpAA++

. . . pAATTpAA++

pACAApAC++

pACACpAC++

. . . pACTTpAC++

pAGAApAG++

pAGACpAG++

. . . pAGTTpAG++

. . . . . . . . .

M34→12(p) =

(columnsums =1

)

These matrices have the same rank as the flattening. We redefine thescore as:

Erik+2: sc(A|B) = (dF (MA→B ,R4) + dF (MB→A,R4))/2.

The lower the score is, the more it is likely that the bipartition comes froman edge of T .

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 36 / 55

Assessing performance

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 37 / 55

Assessing performance

Huelsenbeck ’95: assesses the performance of different phylogeneticreconstruction methods on the following tree space.

For each (a, b), simulate 100 alignments under topology 12|34. For eachmethod, represent the percentage of success in a gray scale: white=0% ,black=100% success.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 37 / 55

Performance of Eriksson and Erik+2

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 38 / 55

Performance against other methods

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 39 / 55

old invariants

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 40 / 55

Erik+2 needs more data but... works under the most general model (underi.i.d. assumptions).Models difference:

Continuous-time models:

Si = exp(Qi ti ),

Parameters: Qi rate matrix, ti branch length. They assume localhomogeneity:S ′i (t) = QiSi (t). Usually, also global-homogeneity isassumed: same rate-matrix Q throughout the tree, different ti foreach edge i .

(Algebraic) Markov models: the entries of the matrices Si are theparameters of the model. Therefore they allow for non-homogeneousdata, both local and global (different mutation rates at differentlineages of the tree and on a single edge).The results we’ve seen so far were on globally homogeneous data.Now...

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 41 / 55

Erik+2 needs more data but... works under the most general model (underi.i.d. assumptions).Models difference:

Continuous-time models:

Si = exp(Qi ti ),

Parameters: Qi rate matrix, ti branch length. They assume localhomogeneity:S ′i (t) = QiSi (t). Usually, also global-homogeneity isassumed: same rate-matrix Q throughout the tree, different ti foreach edge i .

(Algebraic) Markov models: the entries of the matrices Si are theparameters of the model. Therefore they allow for non-homogeneousdata, both local and global (different mutation rates at differentlineages of the tree and on a single edge).The results we’ve seen so far were on globally homogeneous data.Now...

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 41 / 55

Erik+2 Length=1000

pa

ram

ete

r b

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

33%

33%

95%

95%

95%

95%

95%

95%

95%

95%

mean= 0.80343 ; sd= 0.16955

Erik+2 Length=10000

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

95%

95%

95%

95

%

95%

mean= 0.97102 ; sd= 0.04373

NJ Length=1000

pa

ram

ete

r b

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

33%

9

5%

9

5%

95%

95%

95%

95%

95%

mean= 0.79745 ; sd= 0.18404

NJ Length=10000

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

9

5%

95%

95%

95%

95%

95%

95

% 9

5%

9

5%

mean= 0.94313 ; sd= 0.08771

ML Length=1000

parameter a

pa

ram

ete

r b

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

95%

95%

95%

mean= 0.73608 ; sd= 0.17409

ML Length=10000

parameter a

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

95%

95%

95%

mean= 0.754 ; sd= 0.16967

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 42 / 55

Erik+2 is more robust: rates across sites

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 43 / 55

Erik+2 on 2- mixtures

Kolaczkowski, Thornton. Nature 2004

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 44 / 55

How about larger trees?

Scores are not normalized, we cannot compare bipartitions of differentsize

Eriksson’s method involved comparison of different bipartitions

Also, some penalization should be given: n = 6, requiring a 42 × 44

matrix to have rank ≤ 4 or requiring a 43 × 43 matrix having rank≤ 4 is not the same.

Indeed, codimension of R4,42×44 : (42 − 4)(44 − 4) = 3024,codimension of R4,43×43 : 3600

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 45 / 55

How about larger trees?

Scores are not normalized, we cannot compare bipartitions of differentsize

Eriksson’s method involved comparison of different bipartitions

Also, some penalization should be given: n = 6, requiring a 42 × 44

matrix to have rank ≤ 4 or requiring a 43 × 43 matrix having rank≤ 4 is not the same.

Indeed, codimension of R4,42×44 : (42 − 4)(44 − 4) = 3024,codimension of R4,43×43 : 3600

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 45 / 55

Normalizing the score

But in the meanwhile...

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 46 / 55

Normalizing the score

But in the meanwhile...

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 46 / 55

Weighted quartet-based methods

Weight each quartet tree T1 = 12|34, T2 = 13|24, T3 = 14|23 accordingto the reconstruction method (weight ≡ reliability):

Erik+2: w(Ti ) = sc(Ti )−1/(sc(T1)−1 + sc(T2)−1 + sc(T3)−1)

ML: w(Ti ) = logL(Ti )−1/(logL(T1)−1 + logL(T2)−1 + logL(T3)−1)

Barycentric coordinates: (w(T1),w(T2),w(T3))

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 47 / 55

Barycentric plots

(w(T1),w(T2),w(T3))

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 48 / 55

Barycentric plots

(w(T1),w(T2),w(T3))

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 49 / 55

Quartet-based methods (Weight optimization)

Where do we put leaf 5?

Gascuel-Ranwez 2001M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 50 / 55

Topology evaluated; b: internal branch length, leaf edges proportional to it

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 51 / 55

Method Quartets b = 0.005 b = 0.015 b = 0.05 b = 0.1 Average

QP Erik+2 0.05 (100) 0.07 (100) 1.69 (100) 4.57 (100) 1.6ML 9 (100) 9.08 (93) 9.65 (93) - (0) -NJ 9.7 (100) 9.65 (100) 9.91 (100) 9.91 (100) 9.79

WIL Erik+2 0.02 (100) 0.07 (100) 0.33 (100) 1.52 (100) 0.48ML 9.97 (100) 9.28 (93) 9.27 (93) - (0) -NJ 7.66 (100) 7.41 (100) 8.25 (100) 9.61 (100) 8.23

WO Erik+2 0.02 (100) 0.06 (100) 0.2 (100) 1.66 (100) 0.48ML 11.03 (100) 11.18 (93) 11.83 (93) - (0) -NJ 9.46 (100) 9.22 (100) 9.2 (100) 9.73 (100) 9.4

Global NJ 0 0 1.54 10.19

Table : Average Robinson-Foulds distance to the correct tree, alignment length5000, GMM data.

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 52 / 55

Method Quartets b = 0.005 b = 0.015 b = 0.05 b = 0.1 Average

QP Erik+2 0 (100) 0 (100) 0.44 (100) 2.92 (100) 0.84ML 9 (100) 9.11 (96) 9.6 (91) 9 (1) 9.18NJ 9.69 (100) 9.97 (100) 9.92 (100) 9.95 (100) 9.88

WIL Erik+2 0 (100) 0.02 (100) 0.09 (100) 0.44 (100) 0.14ML 9.33 (100) 9 (96) 8.98 (91) 9 (1) 9.08NJ 7.46 (100) 7.44 (100) 8.03 (100) 9.48 (100) 8.1

WO Erik+2 0 (100) 0 (100) 0.08 (100) 0.51 (100) 0.15ML 10.97 (100) 11.01 (96) 11.69 (91) 13 (1) 11.67NJ 9.41 (100) 9.18 (100) 9.33 (100) 9.62 (100) 9.38

Global NJ 0 0 1.16 9.53

Table : Average Robinson-Foulds distance to the correct tree, alignment length10000, GMM data

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 53 / 55

Open problems

Algorithm for larger trees

Incorporate Allman-Rhodes-Taylor semialgebraic conditions

Look for closest rank ≤ 4 nonnegative matrix

Statistical analysis of the method

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 54 / 55

Open problems

Algorithm for larger trees

Incorporate Allman-Rhodes-Taylor semialgebraic conditions

Look for closest rank ≤ 4 nonnegative matrix

Statistical analysis of the method

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 54 / 55

Thanks!

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 55 / 55

top related