using algebraic geometry for phylogenetic reconstruction...title using algebraic geometry for...

68
Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jes ´ us Fern ´ andez-S ´ anchez) Departament de Matem ` atica Aplicada I Universitat Polit` ecnica de Catalunya IMA Workshop Applications in Biology, Dynamics, and Statistics March 8, 2007 M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 1 / 36

Upload: others

Post on 03-Aug-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Using algebraic geometry for phylogeneticreconstruction

Marta Casanellas i Rius(joint work with Jesus Fernandez-Sanchez)

Departament de Matematica Aplicada IUniversitat Politecnica de Catalunya

IMA WorkshopApplications in Biology, Dynamics, and Statistics

March 8, 2007

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 1 / 36

Page 2: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Page 3: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Page 4: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Page 5: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Page 6: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 3 / 36

Page 7: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 4 / 36

Page 8: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 4 / 36

Page 9: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

AlignmentMutations, deletions and insertions of nucleotides occur along thespeciation process.

Given two DNA sequences, an alignment is a correspondencebetween them that accounts for their differences. The optimalalignment is the one that minimizes the number of mutations,deletions and insertions.seq1 : ACGTAGCTAAGTTA... seq2 : ACCGAGACCCAGTA...

A possible alignment is:

seq1 A C − G − T A − G C T A A G T T Aseq2 A C C G A G A C− C C A − G T − A

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 5 / 36

Page 10: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

AlignmentMutations, deletions and insertions of nucleotides occur along thespeciation process.

Given two DNA sequences, an alignment is a correspondencebetween them that accounts for their differences. The optimalalignment is the one that minimizes the number of mutations,deletions and insertions.seq1 : ACGTAGCTAAGTTA... seq2 : ACCGAGACCCAGTA...

A possible alignment is:

seq1 A C − G − T A − G C T A A G T T Aseq2 A C C G A G A C− C C A − G T − A

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 5 / 36

Page 11: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

AACTTCGAGGCTTACCGCTG

AAGGTCGATGCTCACCGATG

AACGTCTATGCTCACCGATG

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 6 / 36

Page 12: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

AACTTCGAGGCTTACCGCTG

AAGGTCGATGCTCACCGATG

AACGTCTATGCTCACCGATG

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 6 / 36

Page 13: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Algebraic evolutionary models

Assume that all sites of the alignment evolve equally andindependently.At each node of the tree we put a random variable taking values in{A,C,G,T }

ti = branch length(represents the number ofmutations per site along thatbranch)

Variables at the leaves are observed and variables at the interiornodes are hidden.At each branch we write a matrix (substitution matrix) with theprobabilities of a nucleotide at the parent node being substitutedby another at its child.

A C G T

S =

ACGT

P(A|A) P(C|A) P(G|A) P(T |A)P(A|C) P(C|C) P(G|C) P(T |C)P(A|G) P(C|G) P(G|G) P(T |G)P(A|T ) P(C|T ) P(G|T ) P(T |T )

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 7 / 36

Page 14: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Algebraic evolutionary models

Assume that all sites of the alignment evolve equally andindependently.At each node of the tree we put a random variable taking values in{A,C,G,T }

ti = branch length(represents the number ofmutations per site along thatbranch)

Variables at the leaves are observed and variables at the interiornodes are hidden.At each branch we write a matrix (substitution matrix) with theprobabilities of a nucleotide at the parent node being substitutedby another at its child.

A C G T

S =

ACGT

P(A|A) P(C|A) P(G|A) P(T |A)P(A|C) P(C|C) P(G|C) P(T |C)P(A|G) P(C|G) P(G|G) P(T |G)P(A|T ) P(C|T ) P(G|T ) P(T |T )

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 7 / 36

Page 15: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

The entries of Si are unknown parameters.

Example (Group-based models. Kimura 3-parameters)Root node has uniform distribution: πA = ... = πT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameters: bi = di .

Jukes-Cantor: bi = ci = di .

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 8 / 36

Page 16: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

The entries of Si are unknown parameters.

Example (Group-based models. Kimura 3-parameters)Root node has uniform distribution: πA = ... = πT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameters: bi = di .

Jukes-Cantor: bi = ci = di .

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 8 / 36

Page 17: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

The entries of Si are unknown parameters.

Example (Group-based models. Kimura 3-parameters)Root node has uniform distribution: πA = ... = πT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameters: bi = di .

Jukes-Cantor: bi = ci = di .

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 8 / 36

Page 18: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Hidden Markov processWe denote the joint distribution of the observed variables X1, X2, X3 aspx1x2x3 = Prob(X1 = x1, X2 = x2, X3 = x3).

px1x2x3 =∑

y4,yr∈{A,C,G,T}

πyr S1(x1, yr )S4(y4, yr )S2(x2, y4)S3(x3, y4)

px1x2x3 is a homogeneous polynomial on the parameters whosedegree is the number of edges.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 9 / 36

Page 19: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

The evolutionary model defines a polynomial map

ϕ : Rd −→ R4n

θ = (θ1, . . . , θd) 7→ (pAA...A, pAA...C, pAA...G, . . . , pTT...T)

ϕ : 4d−1 −→ 44n−1

ϕ : Cd −→ C4n

Algebraic variety V = imϕ, closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1,x2,x3

as the relative frequency of column x1x2x3 in the alignment.

In the theoretical model, this would be a point on the variety V .

Goal: use the ideal of V , I(V ) ⊂ R[pAA...A, pAA...C, . . . , pTT...T], toinfer the topology of the phylogenetic tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 10 / 36

Page 20: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

The evolutionary model defines a polynomial map

ϕ : Rd −→ R4n

θ = (θ1, . . . , θd) 7→ (pAA...A, pAA...C, pAA...G, . . . , pTT...T)

ϕ : 4d−1 −→ 44n−1

ϕ : Cd −→ C4n

Algebraic variety V = imϕ, closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1,x2,x3

as the relative frequency of column x1x2x3 in the alignment.

In the theoretical model, this would be a point on the variety V .

Goal: use the ideal of V , I(V ) ⊂ R[pAA...A, pAA...C, . . . , pTT...T], toinfer the topology of the phylogenetic tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 10 / 36

Page 21: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

The evolutionary model defines a polynomial map

ϕ : Rd −→ R4n

θ = (θ1, . . . , θd) 7→ (pAA...A, pAA...C, pAA...G, . . . , pTT...T)

ϕ : 4d−1 −→ 44n−1

ϕ : Cd −→ C4n

Algebraic variety V = imϕ, closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1,x2,x3

as the relative frequency of column x1x2x3 in the alignment.

In the theoretical model, this would be a point on the variety V .

Goal: use the ideal of V , I(V ) ⊂ R[pAA...A, pAA...C, . . . , pTT...T], toinfer the topology of the phylogenetic tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 10 / 36

Page 22: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Computing the ideal(Eriksson–Ranestad–Sturmfels–Sullivant ’05)

Some generators of I(V ) depend only on the model chosen (noton the topology). E.g. for Jukes-Cantor they are:

∑px1x2x3 = 1

pAAA = pCCC = pGGG = pTTT 4 terms

pAAC = pAAG = pAAT = · · · = pTTG 12 terms

pACA = pAGA = pATA = · · · = pTGT 12 terms

pCAA = pGAA = pTAA = · · · = pGTT 12 terms

pACG = pACT = pAGT = · · · = pCGT 24 termsFor this model, the unique polynomial that detects the phylogenyof the three species has degree 3.

Definition (Cavender–Felsenstein ’87)The generators of I(V ) are called phylogenetic invariants.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 11 / 36

Page 23: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Computing the ideal(Eriksson–Ranestad–Sturmfels–Sullivant ’05)

Some generators of I(V ) depend only on the model chosen (noton the topology). E.g. for Jukes-Cantor they are:

∑px1x2x3 = 1

pAAA = pCCC = pGGG = pTTT 4 terms

pAAC = pAAG = pAAT = · · · = pTTG 12 terms

pACA = pAGA = pATA = · · · = pTGT 12 terms

pCAA = pGAA = pTAA = · · · = pGTT 12 terms

pACG = pACT = pAGT = · · · = pCGT 24 termsFor this model, the unique polynomial that detects the phylogenyof the three species has degree 3.

Definition (Cavender–Felsenstein ’87)The generators of I(V ) are called phylogenetic invariants.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 11 / 36

Page 24: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Computing the ideal(Eriksson–Ranestad–Sturmfels–Sullivant ’05)

Some generators of I(V ) depend only on the model chosen (noton the topology). E.g. for Jukes-Cantor they are:

∑px1x2x3 = 1

pAAA = pCCC = pGGG = pTTT 4 terms

pAAC = pAAG = pAAT = · · · = pTTG 12 terms

pACA = pAGA = pATA = · · · = pTGT 12 terms

pCAA = pGAA = pTAA = · · · = pGTT 12 terms

pACG = pACT = pAGT = · · · = pCGT 24 termsFor this model, the unique polynomial that detects the phylogenyof the three species has degree 3.

Definition (Cavender–Felsenstein ’87)The generators of I(V ) are called phylogenetic invariants.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 11 / 36

Page 25: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Problem: computation of invariants

Computational algebra software fail to compute the ideal for ≥ 4species! Kimura 3-parameter, 4 species, 8002 generators like:

[Small trees webpage]

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 12 / 36

Page 26: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Problem: computation of invariants

Computational algebra software fail to compute the ideal for ≥ 4species! Kimura 3-parameter, 4 species, 8002 generators like:

[Small trees webpage]

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 12 / 36

Page 27: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Group-based models. Discrete Fourier transform

Group-based models (Kimura and Jukes-Cantor): G = Z2 × Z2,A=(0,0),C=(0,1),G=(1,0),T=(1,1) . Substitution matricesare functions on the group:

Si(gp, gc) = f i(gp − gc), gc , gp ∈ G.

pg1,...,gm = p(g1, . . . , gm) =14

∑ ∏e∈edges

f e(gp(e) − gc(e))

sum over all possible values at interior nodes.

Discrete Fourier transform f : G −→ C

f (χ) =∑g∈G

χ(g)f (g), χ ∈ Hom(G, C∗) ∼= G

Convolution:(f1 ∗ f2)(g) =∑h∈G

f1(h)f2(g − h) =⇒ f1 ∗ f2 = f1 · f2

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 13 / 36

Page 28: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Group based models.

Theorem (Evans–Speed)For a group-based model (i.e. Kimura 3, Kimura 2 or Jukes-Cantor) ona tree T, the discrete Fourier transform of the joint distributionp(g1, . . . , gn) has the following form

q(χ1, . . . , χm) =∏

e∈edges

f e( ∏

l∈leaves below e

χl)

Using a linear change of coordinates, one gets a monomialparameterization of the variety associated to the model

ϕ : Cd −→ C4n

θ = (θ1, . . . , θd) 7→ (qAA...A, qAA...C, qAA...G, . . . , qTT...T)

so that it is a toric variety.

The ideal is generated by binomials.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 14 / 36

Page 29: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Fourier transform for Kimura 3-parameter model

Parameters Fourier parameters

Se =

ae be ce de

be ae de ce

ce de ae be

de ce be ae

Pe =

Pe

A 0 0 00 Pe

C 0 00 0 Pe

G 00 0 0 Pe

T

Pe

A=ae+be+ce+de, PeC=ae−be+ce−de...

simplex: 43 ∆3

ae+be+ce+de=1 PeA=1

coordinates: px1...xn qg1...gn , g1+···+gn=0

simplex: 44n−1 ∆4n−1−1∑px1...xn = 1 qA...A = 1

Linear invariants: qg1...gn = 0 if g1 + · · ·+ gn 6= 0 in Z2 × Z2.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 15 / 36

Page 30: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

For the Kimura 3-parameter model, we are interested inV := im(ϕ) where

ϕ :∏

e∈E(T )

C4 −→ C4n−1

(PeA,Pe

C ,PeG,Pe

T )e

7→ (qx1...xn){x1,...,xn|x1+···+xn=0}

But ∆4n−1−1 ⊂ {qA...A = 1}.

DefinitionThe Kimura variety of the phylogenetic tree T is W := V ∩ {qA...A = 1}.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 16 / 36

Page 31: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Recursive construction of invariants

Computational algebra software: even in Fourier parameterization,fail for ≥ 5 leaves.

Theorem (Sturmfels – Sullivant ’05)For any group-based model, they give an explicit algorithm forobtaining the generators of the ideal of phylogenetic invariants of ann-leaved tree from those of an unrooted tree of 3 leaves. Thegenerators are binomials of degree ≤ 4.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 17 / 36

Page 32: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 18 / 36

Page 33: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic inference using algebraic geometry

1990-1995: Biologists claim that phylogenetic invariants are notuseful for phylogenetic reconstruction.Lake only used linear invariants!

In particular, a work of Huelsenbeck’95 reveals the inefficiency ofLake’s method of invariants.

(Eriksson ’05) for General Markov Model.

(C–Garcia–Sullivant ’05) for Kimura 3-parameter.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 19 / 36

Page 34: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic inference using algebraic geometry

1990-1995: Biologists claim that phylogenetic invariants are notuseful for phylogenetic reconstruction.Lake only used linear invariants!

In particular, a work of Huelsenbeck’95 reveals the inefficiency ofLake’s method of invariants.

(Eriksson ’05) for General Markov Model.

(C–Garcia–Sullivant ’05) for Kimura 3-parameter.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 19 / 36

Page 35: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic inference using algebraic geometry

Model: Unrooted tree of 4 leaves, Kimura 3-parameters.Naive algorithm: Take || · ||1 of the evaluation of all invariants onthe point given by the relative frequencies of the columns of thealignment. Obtain a score for each possible topology and choosethe one with smaller score.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 20 / 36

Page 36: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Phylogenetic inference using algebraic geometry

Model: Unrooted tree of 4 leaves, Kimura 3-parameters.Naive algorithm: Take || · ||1 of the evaluation of all invariants onthe point given by the relative frequencies of the columns of thealignment. Obtain a score for each possible topology and choosethe one with smaller score.

Huelsenbeck ’95: assesses the performance of different phylogeneticreconstruction methods on the following tree space.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 20 / 36

Page 37: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Results on simulated data (Huelsenbeck)

Huelsenbeck’s results:

Length 100 Length 500 Length 1000

Lake invariants

NJ

ML

Huelsenbeck, Syst. Biol., 1995M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 21 / 36

Page 38: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

C–Fernandez-Sanchez studies on C–Garcia–Sullivantmethod

Length 100 Length 500 Length 1000

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 22 / 36

Page 39: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Why using phylogenetic invariants

1 We do not need to estimate parameters of the model.2 The algebraic model allows species within the tree to evolve at

different mutation ratesIn the non-algebraic version, one has: Si = exp(Q · ti) where Q is afixed matrix that represents the instantaneous mutation rate of allthe species in the tree.In the algebraic model, one is allowing different rates at eachbranch of the tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 23 / 36

Page 40: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

For instance using algebraic geometry one should be able toreconstruct the biologically correct tree:

dog human chimp rat mouse chicken

(Al-Aidroos, Snir ’05) chapter 21 in ASCB book.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 24 / 36

Page 41: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

For instance using algebraic geometry one should be able toreconstruct the biologically correct tree:

dog human chimp rat mouse chicken

(Al-Aidroos, Snir ’05) chapter 21 in ASCB book.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 24 / 36

Page 42: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

For instance using algebraic geometry one should be able toreconstruct the biologically correct tree:

dog human chimp rat mouse chickendog human chimp rat mouse chicken

(Al-Aidroos, Snir ’05) chapter 21 in ASCB book.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 24 / 36

Page 43: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Simulations on non-homogeneous trees (different ratematrices)

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 25 / 36

Page 44: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 26 / 36

Page 45: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

The global geometry of the Kimura variety

Recall that V := im(ϕ) where

ϕ :∏

e∈E(T )

C4 −→ C4n−1

(PeA,Pe

C ,PeG,Pe

T )e

7→ (qx1...xn){x1,...,xn|x1+···+xn=0}

But ∆4n−1−1 ⊂ {qA...A = 1}.

DefinitionThe Kimura variety of the phylogenetic tree T is W := V ∩ {qA...A = 1}.

W = ϕ(∏

e∈E(T )(C4 ∩ {PeA = 1})).

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 27 / 36

Page 46: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

V ⊂ AN algebraic variety, µ minimum number of generators ofI(V ), then

µ ≥ codim(V ) = N − dim(V )

.V is called a complete intersection if µ = codim(V ).

Example: the Kimura variety W ⊂ C4n−1−1 has codimension4n−1 − 6n + 8, n = number of leaves.For n = 4, the minimum number of generators of the Kimuravariety is 8002 whereas its codimension is 48!On a neighborhood of a smooth point, any variety is a localcomplete intersection (i.e. it can be defined by codim(V )equations).Goal:

1 Find the singular points.2 Provide a set of local generators at non-singular points.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 28 / 36

Page 47: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

V ⊂ AN algebraic variety, µ minimum number of generators ofI(V ), then

µ ≥ codim(V ) = N − dim(V )

.V is called a complete intersection if µ = codim(V ).

Example: the Kimura variety W ⊂ C4n−1−1 has codimension4n−1 − 6n + 8, n = number of leaves.For n = 4, the minimum number of generators of the Kimuravariety is 8002 whereas its codimension is 48!On a neighborhood of a smooth point, any variety is a localcomplete intersection (i.e. it can be defined by codim(V )equations).Goal:

1 Find the singular points.2 Provide a set of local generators at non-singular points.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 28 / 36

Page 48: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

V ⊂ AN algebraic variety, µ minimum number of generators ofI(V ), then

µ ≥ codim(V ) = N − dim(V )

.V is called a complete intersection if µ = codim(V ).

Example: the Kimura variety W ⊂ C4n−1−1 has codimension4n−1 − 6n + 8, n = number of leaves.For n = 4, the minimum number of generators of the Kimuravariety is 8002 whereas its codimension is 48!On a neighborhood of a smooth point, any variety is a localcomplete intersection (i.e. it can be defined by codim(V )equations).Goal:

1 Find the singular points.2 Provide a set of local generators at non-singular points.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 28 / 36

Page 49: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Case n = 3

The parameterization ϕ is:

ϕ : C9 −→ C15

((1,P1C ,P1

G,P1T ),(1,P2

C ,P2G,P2

T ),(1,P3C ,P3

G,P3T )) 7→ (qxyz=P1

x P2y P3

z ){x+y+z=0}

Let H = (Z2 × Z2, ∗) and let (ε, δ) ∈ H act on C9 sending(P1, P2, P3) to ((1,εP1

C ,δP1G,εδP1

T ),(1,εP2C ,δP2

G,εδP2T ),(1,εP3

C ,δP3G,εδP3

T )).

E.g. (ε, δ) = (1,−1), in probability parameters the group actionpermutes A ↔ C and G ↔ T .

Proposition

The Kimura variety W = im(ϕ) on T3 is the affine GIT quotient C9//H.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 29 / 36

Page 50: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

Page 51: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

Page 52: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

Page 53: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

Page 54: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

In Fourier parameters:

∆3+ =

In probability parameters this is transformed into:

A C G T

S =ACGT

(a b c db a d cc d a bd c b a

) a + b + c + d = 1

a + b − c − d > 0

a− b + c − d > 0

a− b − c + d > 0

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 31 / 36

Page 55: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

In Fourier parameters:

∆3+ =

In probability parameters this is transformed into:

A C G T

S =ACGT

(a b c db a d cc d a bd c b a

) a + b + c + d = 1

a + b > 1/2

a + c > 1/2

a + d > 1/2

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 31 / 36

Page 56: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Case n = 3

(Sturmfels–Sullivant ’05): I(V ) is minimally generated by 16cubics and 18 quartics.

LemmaThe following six quartics

qAAAqATT qTCGqTGC−qACCqAGGqTAT qTTA, qCCAqCTGqTAT qTGC−qCACqCGT qTCGqTTA,

qAGGqATT qCACqCCA−qAAAqACCqCGT qCTG, qACCqATT qGAGqGGA−qAAAqAGGqGCT qGTC ,

qCACqCTGqGCT qGGA−qCCAqCGT qGAGqGTC , qGGAqGTCqTAT qTCG−qGAGqGCT qTGCqTTA

generate a local complete intersection that defines W at each pointq ∈ W+.

It does not depend on the point q. Skip proof

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 32 / 36

Page 57: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Proof.1 J = (f1 . . . , f6) defines a variety X containing W and both have the

same dimension.2 The points q ∈ W+ are smooth points of X and W .3 Both varieties coincide locally.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 33 / 36

Page 58: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Arbitrary n

quartics:

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Page 59: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Arbitrary n

quartics:

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Page 60: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Arbitrary n

quartics: J3

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Page 61: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Arbitrary n

quartics: J3

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Page 62: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Arbitrary n

quartics: J3 ∪Jn−1

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Page 63: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Arbitrary n

quartics: J3 ∪Jn−1

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Page 64: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Local complete intersection

Arbitrary n

quartics: J3 ∪Jn−1

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Page 65: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 35 / 36

Page 66: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

New results on simulated data

4 leaves

Instead of all generators, 48 polynomials that generate locally

Length 100 Length 500 Length 1000

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 36 / 36

Page 67: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

New results on simulated data

4 leaves

Instead of all generators, 48 polynomials that generate locally

Length 100 Length 500 Length 1000

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 36 / 36

Page 68: Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for phylogenetic reconstruction Author Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez)

New results on simulated data

4 leaves

Instead of all generators, 48 polynomials that generate locally

Length 100 Length 500 Length 1000

(Eriksson–Yao’07) Machine learning approach, 52 invariants thatperform better on this parameter space.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 36 / 36