marta casanellas rius (joint work with jesus fern andez-s...

Phylogenetic invariants versus classical phylogenetics

Marta Casanellas Rius(joint work with Jesus Fernandez-Sanchez)

Departament de Matematica Aplicada IUniversitat Politecnica de Catalunya

Algebraic Statistics 2014IIT, Chicago

May 2014

M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 1 / 55

Outline

1 Evolutionary Markov models

2 Phylogenetic invariants

3 Eriksson’s method revisited


Outline





Phylogenetic trees


Phylogenetic reconstruction

Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):

Input: DNA sequences (part of their genome) and an alignment


Phylogenetic reconstruction

Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, wewant to reconstruct their ancestral relationships (phylogeny):

Input: DNA sequences (part of their genome) and an alignment

AACTTCGAGGCTTACCGCTG...

AAGGTCGATGCTCACCGATG...

AACGTCTATGCTCACCGATG...


Phylogenetic reconstruction goals

Goals:

Reconstruct the correct tree topology (topology of the graph withspecies labels at the leaves). E.g. 3 different tree topologies ofunrooted trees on four leaves :

Estimate branch lengths: the amount of substitutions per site thathave occurred between the two species at the ends of the branch


Modelling evolution

Assume that all sites of the alignment evolve independently andaccording to the same process (i.i.d.).Model the evolution of z single site on a tree T .At each node of the tree T one puts a random variable taking valuesin {A, C, G, T}

Variables at the leaves are observed and variables at the interior nodesare hidden.At each branch one writes a substitution matrix with the conditionalprobabilities of a nucleotide at the parent node being substituted byanother at its child.

A C G T

Si =

ACGT

P(A|A, i) P(C |A, i) P(G |A, i) P(T |A, i)P(A|C , i) P(C |C , i) P(G |C , i) P(T |C , i)P(A|G , i) P(C |G , i) P(G |G , i) P(T |G , i)P(A|T , i) P(C |T , i) P(G |T , i) P(T |T , i)


Evolutionary Markov models

The entries of Si and the root distribution are unknown parameters.

According to the shape of (or group of symmetries in S4 acting on) Si

one has different evolutionary models: general Markov model, strandsymmetric model, Kimura 3-parameter, Kimura 2-parameters,Jukes-Cantor model.



Jukes-Cantor model

Root node has uniform distribution: pA = ... = pT = 0.25

A C G T

Si =

ACGT

ai bi bi bi

bi ai bi bi

bi bi ai bi

bi bi bi ai

, ai + 3bi = 1

only 2 different parameters (one for mutation and one for non-mutation),only 1 free parameter.



Kimura 3-parameter model

Root node has uniform distribution: pA = ... = pT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameter: bi = di



Strand Symmetric model

Root node distribution satisfies: pA = pT, pC = pG

A C G T

Si =

ACGT

ai bi ci di

ei fi gi hi

hi gi fi eidi ci bi ai

Strand symmetry in a DNA molecule:

A–TC–G



General Markov model

Root node distribution: arbitrary

A C G T

Si =

ACGT

ai bi ci di

ei fi gi hi

ji ki li mi

ni oi pi qi

All these models are known as equivariant models: they are invariantby permutation of rows and columns of a certain subgroup of G = S4

(permutations of A,C,G,T).


Hidden Markov process

px1x2x3 = Prob(X1 = x1,X2 = x2,X3 = x3) = joint probability of theobserved variables X1,X2,X3. Then,

px1x2x3 =∑

y4,yr∈{A,C,G,T}

pyr S1(yr , x1)S4(yr , y4)S2(y4, x2)S3(y4, x3)

px1x2x3 is a homogeneous polynomial on the parameters (entries of thematrices Si and root distribution).


An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map

ϕMT :∏

e∈E(T )4d −→ 44n−1

ϕMT :∏

e∈E(T ) Cd+1 −→ C4n

Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.

In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn

on the variety VM(T ).

In the real world, it is only close to VM(T ) for some tree topology T .


An evolutionary model M (with d free parameters) on a treetopology T on n leaves defines a map

ϕMT :∏

e∈E(T )4d −→ 44n−1

ϕMT :∏

e∈E(T ) Cd+1 −→ C4n

Phylogenetic variety VM(T ) = ImϕT , closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1x2...xn bythe relative frequency of column x1x2 . . . xn in the alignment.

In the theoretical model, this would be a point p = (fx1x2...xn)x1x2...xn

on the variety VM(T ).

In the real world, it is only close to VM(T ) for some tree topology T .

Image(ϕMT )


Phylogenetic mixtures

Change the i.i.d hypothesis by “all positions evolve independently andfollowing the same evolutionary model”.

Same model M, same tree topology T = T1 but different substitutionparameters. Then,

p = (pAA...A, . . . , pTT...T) = λϕMT1(S1, . . . ,S4) + (1− λ)ϕMT1

(S ′1, . . . ,S′4)

That is, p ∈ SecVMT1


Phylogenetic mixtures

Change the i.i.d hypothesis by “all positions evolve independently andfollowing the same evolutionary model”.

Same model M, different tree topologies and different substitutionparameters. Then,

p = (pAA...A, . . . , pTT...T) = λϕMT1(S1, . . . ,S4) + (1− λ)ϕMT2

(S ′1, . . . ,S′4)

That is, p ∈ VMT1∗ VMT2

(join of two varieties).


Outline





Which generators of I (VM(T )) are useful for recovering the treetopology?

Some generators depend on the model chosen (not on the treetopology). E.g. for Jukes-Cantor, n = 3:

∑px1x2x3 = 1 and

pAAA − pCCC, pAAA − pGGG, pAAA − pTTT,

pAAC − pAAG, pAAC − pAAT, pCCA − pCCG, . . . ,

pACA − pAGA, pACA − pATA, pCAC − pCGC, . . .

pCAA − pGAA, pCAA − pTAA, pACC − pGCC, . . .For this model, the polynomials that detect the topology of the threespecies have degree 3.

Definition (Cavender–Felsenstein ’87)

The elements of I (VM(T )) are called invariants. They are calledphylogenetic invariants if they distinguish between different tree topologies.


Description of the ideal

Computational algebra software: fails for ≥ 5 leaves.


Description of the ideal

Computational algebra software: fails for ≥ 5 leaves.Model M fixed.

Theorem (Allman-Rhodes, Draisma-Kuttler, Sturmfels, Sullivant, C)

There is an algorithm for obtaining the generators of the ideal of ann-leaved tree T from those of an unrooted tree of 3 leaves and the edgesof the tree.

I (V (T )) =∑

v∈Int(T )

Iv +∑

e∈E(T )

Ie

Iv

Ie


Invariants from edges

Example: M = general Markov model.

16× 16 matrix obtained by flattening p = (pAAAA, pAAAC, . . . , pTTTT) in4⊗ C4 ∼= (

2⊗ C4)⊗ (

2⊗ C4) :

states in leaves 3,4

flatep =states

leaves

1, 2

pAAAA pAAAC pAAAG . . . pAATTpACAA pACAC pACAG . . . pACTTpAGAA pAGAC pAGAG . . . pAGTT. . . . . . . . . . . . . . .

pTTAA pTTAC pTTAG . . . pTTTT

Ie := ideal generated by all 5× 5 minors.

(Allman-Rhodes, Eriksson) Ie are phylogenetic invariants for thegeneral Markov model.


Phylogenetic invariants

We are just interested in phylogenetic invariants, i.e, generators thatvanish in some tree topologies but not all.

Ingredients:1 n given species2 evolutionary model M is fixed3 want tree topology relating them

We call

T = {trivalent tree topologies on n labelled leaves}

Assuming that our data p comes from a distribution on some treeT ∈ T under the model M, find the correct tree T0.

In algebraic geometry this is: assume p ∈ ∪T∈T VM(T ) and find T0

such that p ∈ VM(T0).


Phylogenetic invariants

Goal: for any particular tree T0 ∈ T , how is the variety VM(T0)defined inside ∪T∈T VM(T )? (at least in a biologically meaningfulopen subset).


Main Theorem

Theorem

Let M = Jukes-Cantor, Kimura (2 or 3 parameters), Strand Symmetric,General Markov model (or any equivariant model). For topologyreconstruction purposes, it is enough to consider invariants from edges.

More precisely, for each tree topology T ∈ T there exists a dense opensubset UT ⊆ V (T ) such that if a point p belongs to

⋃T∈T

UT , then

p ∈ V (T0) p ∈ Z (∑

e∈E(T0) Ie) .

Theorem (Buneman’s Splits equivalence Theorem)

A tree can be recovered by knowing all bipartitions induced from its edges.


Key: Representation theory to include all models

Draisma-Kuttler framework: representation theory

Example: Jukes-Cantor model. Substitution matrices:A C G T

S =

A

C

G

T

a b b bb a b bb b a bb b b a

Any permutation of rows and columns leaves the matrix invariant.Consider G = S4, permutations of A,C,G,T.Consider the representation of G , ρ : G −→ GL(W ),W =< A, C, G, T >C∼= C4, corresponding to the action of G on4× 4-matrices. For instance,

(A, C) 7→(

0 1 0 01 0 0 00 0 1 00 0 0 1

)M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 21 / 55

New description of the parametrization

We associate the vector space W = 〈A, C, G, T〉C to each node in thetree and then each substitution matrix belongs to HomG (W ,W ).

Therefore the substitution matrices lie inparG (T ) :=

∏e∈E(T ) HomG (W ,W ) and the map ϕT parameterizing

VM(T ) can be seen as

ϕT : parG (T ) −→ (n⊗W )G


Other models: subgroups of S4

G = 〈(A, C)(G, T), (A, G)(C, T)〉 ∼= Z2 × Z2 for Kimura 3-parametermodel.

G = 〈(AT)(CG)〉 for strand symmetric model.

G = id for general Markov model.

and in all cases W = 〈A, C, G, T〉C and ρ : G −→ GL(W ) is thepermutation representation.


Jukes-Cantor

If G = S4 (Jukes-Cantor), then G has 5 irreducible representationsM0, . . . ,M4.

(Maschke): Any G -representation E decomposes into isotypiccomponents, E ∼= ⊕4

i=0Mi ⊗ Cmi , and we associate to it a 5-tuple ofmultiplicities m(E ) = (m0, . . . ,m4).

Our particular representation ρ : G −→ GL(W ) for Jukes-Cantormodel decomposes as W ∼= M0 ⊕M3 and it hasm(W ) = (1, 0, 0, 1, 0).


Thin flattening

Let L1 | L2 be a bipartition of the set of leaves {1, . . . , n} , anddenote l1 = |L1|, l2 = |L2|

Let p ∈ (n⊗W )G . The matrix of the thin flattening of p along L1 | L2,

TflatL1|L2p = (P0, . . . ,P4), is obtained from p via the isomorphism

(l1⊗W ⊗

l2⊗W )G ∼= HomG (

l1⊗W ,

l2⊗W ) ∼=

4⊕i=0HomC(Cm(

l1⊗W )i ,Cm(

l2⊗W )i )

(last isomorphism is due to Schur’s Lemma).

Write rkTflatL1,L2p := (rk P0, . . . , rk P4).

Rank conditions:

rkTflatL1,L2p ≤ m(W ) = (1, 0, 0, 1, 0)


Example (Jukes-Cantor)

n = 4, L1 = {1, 2}, L2 = {3, 4}.

HomG (2⊗W ,

2⊗W ) ∼=

4⊕i=0

HomC(Cm(2⊗W )i ,Cm(

2⊗W )i )

W ⊗W ∼= M20 ⊕M2 ⊕M3

3 ⊕M4, hence m(2⊗W ) = (2, 0, 1, 3, 1) and

TflatL1,L2p can be written as a block matrix in a certain basis:

TflatL1,L2p =

∗ ∗∗ ∗ 0 0 0

0 ∗ 0 0

0 0∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗

0

0 0 0 ∗

rkTflatL1,L2p ≤ m(W ) = (1, 0, 0, 1, 0).


Invariants for mixtures on the same tree

If p is a k-mixture distribution on a tree topology T ,p =

∑ki=1 λiϕ

MT (θi ), and A|B is a bipartition induced by and edge of

T , thenrk(flatA|B(p)) ≤ 4k

(for the general Markov model, analogous for other equivariantmodels)


Some results on simulations

Phylogenetic invariants have been tested on simulated data with notso bad results

Have been also tested in quartet-based methods (Rusinko-Hipp,Sumner et al.)

But... there is no canonical way of using them.

Model invariants have been used for model selection with successfulresults (+Kedzierska, Guigo)

Invariants have been also useful for proving identifiability of certainmodels (Allman-Rhodes et al.)


Are we ready for application to real data???


Outline





Singular value decomposition (SVD)

Distance (in Frobenius or 2-norm) of an m× n matrix M to R4 = { m× nmatrices of rank ≤ 4} can be computed easily:

SVD: M = UDV t with U ∈Mm×m, D ∈Mm×n, V ∈Mn×n, U, Vorthogonal, D “almost” diagonal

D =

σ1

σ2

. . .

σm

0 · · · 0

σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0

Then minX∈R4 ||M − X || = ||M − X4|| where

X4 = U

σ1

σ2

σ3

σ4

0

. . .

0 · · · 0

V t

and d2(M,R4) = σ5, dF (M,R4) =√∑

i≥5 σ2i .


Eriksson’s SVD method

Use SVD to evaluate how close a bipartition A|B is from being anedge split: sc(A|B) = dF (flattA|B(p),R4).Find the best cherry in the tree by looking at all bipartitions of 2 taxaagainst others.Trait this cherry as an indivisible pair of taxa, and proceed recursively.Example:

12|34513|12514|235∗15|23423|14524|13525|13434|12535|12445|123

⇒142|35143|25145|23∗

⇒ (1, 4), 5, (2, 3)


Problems of Eriksson’s method:

1 Low performance in the presence of long branches

2 Is it fair to compare bipartitions of different size with this score?

3 What happens when we have small sample size?


Long branch attraction

correct tree usually inferred tree


Flattening matrix for the wrong topology

13|24 AA AC · · · CA CC CG · · · GC GG GT · · · TG TT +

AA 0.128 0.001 · · · 0.001 0 0.001 · · · 0.001 0.021 0.001 · · · 0 0.011 0.190AC 0.043 0.001 · · · 0 0.002 0 · · · 0.002 0.027 0.001 · · · 0 0.013 0.100AG 0.008 0 · · · 0 0 0 · · · 0 0.009 0 · · · 0 0.002 0.023AT 0.011 0 · · · 0 0 0 · · · 0.001 0.001 0.002 · · · 0 0.039 0.058CA 0.023 0 · · · 0.001 0.025 0.004 · · · 0.003 0.028 0.002 · · · 0.001 0.019 0.114CC 0.005 0 · · · 0.003 0.140 0.011 · · · 0.001 0.017 0 · · · 0.001 0.029 0.213CG 0 0 · · · 0.001 0.026 0.008 · · · 0.002 0.013 0 · · · 0 0.006 0.059CT 0 0 · · · 0 0.012 0.001 · · · 0 0.001 0.004 · · · 0 0.055 0.084GA 0.001 0 · · · 0 0 0 · · · 0 0.009 0 · · · 0 0.001 0.011GC 0.001 0 · · · 0 0 0 · · · 0.001 0.005 0 · · · 0 0.003 0.011GG 0.001 0 · · · 0 0 0 · · · 0 0.003 0 · · · 0 0 0.004GT 0 0 · · · 0 0 0 · · · 0 0.001 0.001 · · · 0 0.004 0.006TA 0.010 0 · · · 0 0 0 · · · 0.004 0.006 0 · · · 0 0.010 0.035TC 0.001 0 · · · 0 0.005 0 · · · 0.001 0.010 0 · · · 0 0.020 0.039TG 0.001 0 · · · 0.001 0 0 · · · 0 0.003 0.0020 · · · 0 0.001 0.009TT 0 0 · · · 0 0 0 · · · 0 0.001 0.003 · · · 0 0.036 0.043+ 0.233 0.002 · · · 0.007 0.210 0.025 · · · 0.016 0.155 0.016 · · · 0.002 0.249 1


Eriksson’s revisited: Erik+2

Given a bipartition A|B, consider the transition matrices MA→B andMB→A instead of the flattening matrix:

M12→34(p) =

pAAAApAA++

pAAACpAA++

. . . pAATTpAA++

pACAApAC++

pACACpAC++

. . . pACTTpAC++

pAGAApAG++

pAGACpAG++

. . . pAGTTpAG++

. . . . . . . . .

M34→12(p) =

(columnsums =1

)

These matrices have the same rank as the flattening. We redefine thescore as:

Erik+2: sc(A|B) = (dF (MA→B ,R4) + dF (MB→A,R4))/2.

The lower the score is, the more it is likely that the bipartition comes froman edge of T .


Assessing performance


Assessing performance

Huelsenbeck ’95: assesses the performance of different phylogeneticreconstruction methods on the following tree space.

For each (a, b), simulate 100 alignments under topology 12|34. For eachmethod, represent the percentage of success in a gray scale: white=0% ,black=100% success.


Performance of Eriksson and Erik+2


Performance against other methods


old invariants


Erik+2 needs more data but... works under the most general model (underi.i.d. assumptions).Models difference:

Continuous-time models:

Si = exp(Qi ti ),

Parameters: Qi rate matrix, ti branch length. They assume localhomogeneity:S ′i (t) = QiSi (t). Usually, also global-homogeneity isassumed: same rate-matrix Q throughout the tree, different ti foreach edge i .

(Algebraic) Markov models: the entries of the matrices Si are theparameters of the model. Therefore they allow for non-homogeneousdata, both local and global (different mutation rates at differentlineages of the tree and on a single edge).The results we’ve seen so far were on globally homogeneous data.Now...


Erik+2 Length=1000

pa

ram

ete

r b

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

33%

33%

95%

95%

95%

95%

95%

95%

95%

95%

mean= 0.80343 ; sd= 0.16955

Erik+2 Length=10000

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

95%

95%

95%

95

%

95%

mean= 0.97102 ; sd= 0.04373

NJ Length=1000

pa

ram

ete

r b

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

33%

9

5%

9

5%

95%

95%

95%

95%

95%

mean= 0.79745 ; sd= 0.18404

NJ Length=10000

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

9

5%

95%

95%

95%

95%

95%

95

% 9

5%

9

5%

mean= 0.94313 ; sd= 0.08771

ML Length=1000

parameter a

pa

ram

ete

r b

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

95%

95%

95%

mean= 0.73608 ; sd= 0.17409

ML Length=10000

parameter a

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

95%

95%

95%

95%

mean= 0.754 ; sd= 0.16967


Erik+2 is more robust: rates across sites


Erik+2 on 2- mixtures

Kolaczkowski, Thornton. Nature 2004


How about larger trees?

Scores are not normalized, we cannot compare bipartitions of differentsize

Eriksson’s method involved comparison of different bipartitions

Also, some penalization should be given: n = 6, requiring a 42 × 44

matrix to have rank ≤ 4 or requiring a 43 × 43 matrix having rank≤ 4 is not the same.

Indeed, codimension of R4,42×44 : (42 − 4)(44 − 4) = 3024,codimension of R4,43×43 : 3600


Normalizing the score

But in the meanwhile...


Weighted quartet-based methods

Weight each quartet tree T1 = 12|34, T2 = 13|24, T3 = 14|23 accordingto the reconstruction method (weight ≡ reliability):

Erik+2: w(Ti ) = sc(Ti )−1/(sc(T1)−1 + sc(T2)−1 + sc(T3)−1)

ML: w(Ti ) = logL(Ti )−1/(logL(T1)−1 + logL(T2)−1 + logL(T3)−1)

Barycentric coordinates: (w(T1),w(T2),w(T3))


Barycentric plots

(w(T1),w(T2),w(T3))


Quartet-based methods (Weight optimization)

Where do we put leaf 5?

Gascuel-Ranwez 2001M. Casanellas (UPC) www.ma1.upc.edu/∼casanellas Chicago 2014 50 / 55

Topology evaluated; b: internal branch length, leaf edges proportional to it


Method Quartets b = 0.005 b = 0.015 b = 0.05 b = 0.1 Average

QP Erik+2 0.05 (100) 0.07 (100) 1.69 (100) 4.57 (100) 1.6ML 9 (100) 9.08 (93) 9.65 (93) - (0) -NJ 9.7 (100) 9.65 (100) 9.91 (100) 9.91 (100) 9.79

WIL Erik+2 0.02 (100) 0.07 (100) 0.33 (100) 1.52 (100) 0.48ML 9.97 (100) 9.28 (93) 9.27 (93) - (0) -NJ 7.66 (100) 7.41 (100) 8.25 (100) 9.61 (100) 8.23

WO Erik+2 0.02 (100) 0.06 (100) 0.2 (100) 1.66 (100) 0.48ML 11.03 (100) 11.18 (93) 11.83 (93) - (0) -NJ 9.46 (100) 9.22 (100) 9.2 (100) 9.73 (100) 9.4

Global NJ 0 0 1.54 10.19

Table : Average Robinson-Foulds distance to the correct tree, alignment length5000, GMM data.


Method Quartets b = 0.005 b = 0.015 b = 0.05 b = 0.1 Average

QP Erik+2 0 (100) 0 (100) 0.44 (100) 2.92 (100) 0.84ML 9 (100) 9.11 (96) 9.6 (91) 9 (1) 9.18NJ 9.69 (100) 9.97 (100) 9.92 (100) 9.95 (100) 9.88

WIL Erik+2 0 (100) 0.02 (100) 0.09 (100) 0.44 (100) 0.14ML 9.33 (100) 9 (96) 8.98 (91) 9 (1) 9.08NJ 7.46 (100) 7.44 (100) 8.03 (100) 9.48 (100) 8.1

WO Erik+2 0 (100) 0 (100) 0.08 (100) 0.51 (100) 0.15ML 10.97 (100) 11.01 (96) 11.69 (91) 13 (1) 11.67NJ 9.41 (100) 9.18 (100) 9.33 (100) 9.62 (100) 9.38

Global NJ 0 0 1.16 9.53

Table : Average Robinson-Foulds distance to the correct tree, alignment length10000, GMM data


Open problems

Algorithm for larger trees

Incorporate Allman-Rhodes-Taylor semialgebraic conditions

Look for closest rank ≤ 4 nonnegative matrix

Statistical analysis of the method


Thanks!


marta casanellas rius (joint work with jesus fern andez-s...

Documents