copyright n. friedman, m. ninio. i. pe’er, and t. pupko. 2001recomb, april 2001 structural em for...

24
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science & Engineering Hebrew University Matan Ninio Computer Science & Engineering Hebrew University Itsik Pe’er Computer Science Tel-Aviv University Tal Pupko Inst. of Statistical Mathematics Tokyo

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Structural EM for Phylogentic Inference

Nir Friedman Computer Science & Engineering

Hebrew University

Matan Ninio Computer Science & Engineering

Hebrew University

Itsik Pe’er Computer Science

Tel-Aviv University

Tal Pupko Inst. of Statistical Mathematics

Tokyo

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Introduction Phylogentic inference:

“reconstruction of the tree of evolution based on DNA/Protein sequences of current day species”

Maximum Likelihood inference Model evolution as a stochastic process Use likelihood of observed sequences to

evaluate different trees Computational task

Construct the maximum likelihood tree We describe a new procedure that use a variant of

EM to efficiently learn better trees

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Evolution of a Character over Time

Probability of change: Pab(t)

Assumptions: Lack of memory:

Reversibility: Exist stationary probabilities

{Pa} s.t.

A

G T

C

b

cbbaca tPtPttP )'()()'(

)()( tPPtPP abbbaa

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Phylogenetic Tree

Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2

Lengths t = {ti,j} for each branch (i,j) Phylogenetic tree = (Topology, Lengths)

leaf

branch internal node

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Joint Distribution

Random variables X1... X2N-2 for all nodes. x[1...N] - Observed values of X[1...N].

Joint distribution:

Marginal distribution:

Computation of marginal distribution:by dynamic programming

Tji x

jixx

ixN

j

ji

i p

tpptTxP

),(

,

]22,1[

)(),|(

]22,...,1[ ),(

,

],,1[

)(),|(

NN j

ji

ix Tji x

jixx

ixN p

tpptTxP

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Maximum Likelihood ReconstructionObserved data: (D ) N sequences of length M Each position: an independent sample from the

marginal distribution over N current day taxa

Likelihood: Given a tree (T,t) :

Goal: Find a tree (T,t) that maximizes l(T,t:D) .

M

mN tTxP

tTDPDtTl

),|(log

),|(log):,(

],,1[

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Current Approaches

Perform search over possible topologiesT1 T3

T4

T2

Tn

Parametric optimization

(EM)

Parameter space

Local Maxima

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Computational Problem

Such procedures are computationally expansive! Computation of optimal parameters, per candidate,

requires non-trivial optimization step. Spend non-negligible computation on a candidate,

even if it is a low scoring one. In practice, such learning procedures can only

consider small sets of candidate structures

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Structural EM

Idea: Use parameters found for current topology to help evaluate new topologies.

Outline: Perform search in (T, t) space. Use EM-like iterations:

E-step: use current solution to compute expected sufficient statistics for all topologies

M-step: select new topology based on these expected sufficient statistics

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

The Complete-Data ScenarioSuppose we observe H, the ancestral sequences.

Tjijiji

Tji m mx

jimxmx

i mmx

mN

complete

StFconst

p

tpp

tTmxPHDtTl

j

ji

i

),(,,

),(

,

22...1

),(

)(loglog

),|(log,:,

),(max ,,, , jijitji StFwji

Tji

jiw),(

,

Define:

Find: topology T that maximizes

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,jF is a linear function of Si,j

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Expected Likelihood

Start with a tree (T0,t0) Compute

Formal justification: Define:

Theorem:

Consequence: improvement in expected score improvement in likelihood

m

mN

mj

miji tTxbXaXPbaSE ),,|,()],([ 00

],,1[),(

Tjijiji

complete

constSEtF

tTtTHDlEtTQ

),(,,

00

])[,(

],|),:,([),(

),:(),:(),(),( 0000 tTDltTDltTQtTQ

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Algorithm Outline

Original Tree (T0,t0)

Unlike standard EM for trees, we compute all possible pairwise statistics

Time: O(N2M)

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Pairwise weights

This stage also computes the branch length for each pair (i,j)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:Q(T’,t’) Q(T0,t0)

Thus, l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Construct bifurcation T1

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Fix Tree

Remove redundant nodesAdd nodes to break large degree

This operation preserves likelihood l(T1,t’) =l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

Construct bifurcation T1

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

New TreeThm: l(T1,t1) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Construct bifurcation T1

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

These steps are then repeated until convergence

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Evaluation

Comparison to MOLPHY (PROTML):

Evaluation on Synthetic data sets

Sampled from a tree we generated Allows us to control # taxa and #positions Can compare to “true” generating tree

Real-life data

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Number of Positions (48 taxa)

Number of Positions

10 100 1000-0.5

0

0.5

1

1.5

2SEMPHYMOLPHYOriginalOriginal(no training)

Log-

prob

abili

ty (

per

posi

tion)

re

lativ

e to

orig

inal

-12

-10

-8

-6

-4

-2

0

10 100 1000

Performance on test datarelative to original model

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

SEMPHYMOLPHYOriginalOriginal(no training)

Number of taxa (100 pos)

Number of Taxa

Log-

prob

abili

ty (

per

posi

tion)

re

lativ

e to

orig

inal

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0 20 40 60 80 100

Performance on test datarelative to original model

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Run times

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 10 20 30 40 50 60 70 80 90 100

SEMPHYMOLPHY

Tim

e in

sec

onds

Number of taxa

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Real life data

Lysozyme Mitochondrial

# taxa 43 34

# pos 122 3,578

MOLPHY likelihood

-2,916.2 -74,227.9

SEMPHY likelihood

-2,892.1 -70,533.5

Diff per position

0.19 1.03

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Discussion New algorithmic approach for optimizing the

likelihood of models SEMPHY: an implementation for protein sequences

Incorporates standard models for Pab(t) Early results shows that it outperforms current

programs for ML reconstruction In terms of running time & solution quality

Work in progress Escaping “local” maxima More elaborate models of evolution

Variable rate Co-evolution

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

10 20 30 40 50 60 70 80 90 100

log-

likel

ihoo

d re

lativ

e to

optim

ized

orig

inal

Number of Taxa

SEMPHYAnneal SEMPHY

Preliminary Results: Annealed Structural EM

Original

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

THE END