advanced programming 236512 algorithms for reconstructing phylogenetic trees spring 2006 lecturer:...

60
. Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau, Taub 700, tel 4894 Website: http://webcourse.cs.technion.ac.il/236 512/

Post on 20-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

.

Advanced programming 236512Algorithms for reconstructing

phylogenetic trees spring 2006

Lecturer: Shlomo Moran, Taub 639, tel 4363TA: Ilan Gronau, Taub 700, tel 4894Website: http://webcourse.cs.technion.ac.il/236512/

2

Evolution

Evolution of new organisms is driven by

Diversity Different individuals

carry different variants of the same basic blue print

Mutations The DNA sequence

can be changed due to single base changes, deletion/insertion of DNA segments, etc.

Selection bias

3

The Tree of Life

Sou

rce:

Alb

erts

et

al

4

Primate evolution

A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

5

Theory of Evolution

Basic idea speciation events lead to creation of different

species. Speciation caused by physical separation into

groups where different genetic variants become dominant

Any two species share a (possibly distant) common ancestor

6

Phylogenenetic trees

Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next

Aardvark Bison Chimp Dog Elephant

7

Types of Trees

A natural model to consider is that of rooted trees

CommonAncestor

8

Types of treesUnrooted tree represents the same phylogeny without

the root node

Usually, data from current day species does not distinguish between different placements of the root.

9

Rooted versus unrooted treesTree a

ab

Tree b

c

Tree c

Represents the three rooted trees

10

Positioning Roots in Unrooted Trees

We can estimate the position of the root by introducing an outgroup:

a set of species that are definitely distant from all the species of interest

Aardvark Bison Chimp Dog Elephant

Falcon

Proposed root

11

Morphological vs. Molecular

Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc.

Modern biological methods allow to use molecular features

Gene sequences Protein sequences

Analysis based on homologous sequences (e.g., globins) in different species

12

Rat QEPGGLVVPPTDA

Rabbit QEPGGMVVPPTDA

Gorilla QEPGGLVVPPTDA

Cat REPGGLVVPPTEG

From sequences to a phylogenetic tree

There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

13

Type of Data

Distance-based (The project focus on this method). Input is a matrix of distances between species Can be fraction of residue they disagree on, or

alignment score between them, or …

Character-based Examine each character (e.g., residue) separately

Not covered in this project

14

Constructing trees from distances:

Transform differences between species to numerical distances

Find a weighted tree that realizes/approximates the distances between the species.

The task is:Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

USER
לפני הבניה יש להכניס את משפט 4 הנקודות (מקובץ נפרד), שיחליף את ההוכחה הקודמת שלו בהרצאה 12. כמו כן ייתכן שכדאי לוותר על UPGMA. הערה זו משפיעה כמובן גם על הרצאה 12.שלמה 12.3.03

15

Exact solution: Additive sets

Given a set S of n objects with an n×n distance matrix:d(i,i)=0, and for i≠j, d(i,j)>0d(i,j)=d(j,i). For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k).

Can we construct a weighted tree which realizes these distances?

16

There is always a tree for 3 objects

For n=3: There is always a (unique) tree with one internal node.

( , )( , )( , )

d i j a bd i k a cd j k b c

ab

c

i

j

k

v

i j k

i 0 a+b a+c

j 0 b+c

k 0

Distance metrics on 4 objects may not have a tree.

17

The Four Points Condition

Definition: A distance metric on n objects satisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that:

d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l)

ik

lj

Theorem: A distance metric is additive , it satisfies the four points conditionNote: The four point condition implies O(n4) algorithm, which is not very efficient.

18

Constructing additive trees:The neighbor joining problem

Let i, j be neighboring leaves in a tree, let v be their parent, and let k

be any other vertex.

The formula

shows that we can compute the distances of v to all other leaves.

1

2( , ) [ ( , ) ( , ) ( , )]d k v d k i d k j d i j

d(k,v)

i

j

k

v

19

Constructing additive trees:The neighbor joining problem

This suggest the following method to construct tree from a distance

matrix:

1. Find neighboring leaves i,j in the tree,

2. Replace i,j by their parent v and recursively construct a tree T

for the smaller set.

3. Add i,j as children of v in T.

20

Neighbor Finding

How can we find from distances alone a pair of neighboring leaves (called also cherries)?

Closest vertices aren’t necessarily neighboring leaves.

AB

CD

21

Neighbor Finding: Seitou&Nei method

Theorem (Saitou&Nei) Assume all internal edge weights are positive. If Q(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.

is a leaf

For a leaf , let

For leaves

2

( , ).

, :

( , ) ( ) ( , ) ( )

iu

i j

i r d i u

i j

Q i j n d i j r r

Definitions

22

S&N Neighbor Joining Algorithm If n =3, return tree of three vertices Compute Q(i,j) for all i,j Choose i,j such that Q(i,j) is minimal Create new vertex v, and set

ij

v

k

1 (for some

2 // or could be 0

1for each vertex ,

2

( , ) [ ( , ) ( , ) ( , )] )

( , ) ( , ) ( , ) ( , ) ( , )

( , ) [ ( , ) ( , ) ( , )]

d i v d i j d i r d j r r

d j v d i j d i v d i v d j v

k d v k d i k d j k d i j

remove i,j, and add v to the set of objectsRecursively construct a tree on the smaller set, then add i,j as children of v, at distances d(i,v) and d(j,v).

d(k,v)

23

Initialization: θ(n2) to compute r(i) and Q(i,j) for all i,jL.

Each Iteration: O(n2) to find the maximal Q(i,j). O(n) to compute {D(v,k):k L} for the new node v,

and to update the matrix. O(n2) to update the values Q(i,j).

Total of O(n3).

Complexity of S&N Neighbor Joining Algorithm

ij

k

D(v,k)

24

Some remarks on S&N Neighbor Joining Algorithm

Applicable to matrices which are not additive

Known to work good in practice.

The algorithm and its variants are the most widely used

distance-based algorithms today.

Next we present a more efficient Neighbor Joining

algorithm, which is based on LCA distances.

25

Least Common Ancestor distances

Definition: Given a weighted tree T and a specific vertex r in it:

dT(r;i j)=distance in T from r to path(i,j).

dT(r;i i)=distance in T from r to i.

A E

D

CB

r

3

55

2312

2

5

2

3Edge weights:

LCA distances:DT(r;AD)= 3

78

5

76DT(r;AA)= 7

26

Least Common Ancestor distances

The distances dT(r;i,j) can be presented by a matrix:

A B C D E

A 7 0 0 3 5

B 8 5 0 0

C 7 0 0

D 5 3

E 6A E

D

CB

r

3

55

78

675

27

LCA Matrices

Definition: A symmetric nonnegative matrix L is an LCA matrix iff

1. For each i: L(i,i)=maxj{L(i,j)}

2. It satisfies the “3 points condition”:

for each 3 distinct indices i, j, k ,

L(i,j) ≥ min {L(i,k), L(j,k)}

“the smallest value appears twice”

j k

i 11 9 6

j 8 6

28

LCA Matrices

j k

i 9 6

j 6

Theorem: The following conditions are equivalent for an (n-1)(n-1) matrix L:

1. L is an LCA matrix.

2. There is a weighted tree T with n leaves and a leaf r in T such that for each pair of leaves i,j r:

L(i,j)= dT(r;ij)

29

LCA distances LCA matrix

There is a weighted tree T s.t. L(i,j)= dT(r;ij).

L is an LCA matrix: By properties of least common ancestors in trees

ij

k

L(k,i) = L(j,i) L(k,j)

r

30

LCA matrix LCA distances

Now we are given an LCA matrix L and need to construct a tree. The construction uses “maximal off diagonal” entries:

L(i,j) is a “maximal off-diagonal” in entry in row i if L(i,j)=maxk{L(i,k):k i}

1 2 k

1 18 9 8 3 7

Example: L(1,2) is maximal off diagonal entry in row 1

31

Maximal off diagonal entries

Lemma: If L(i,j) is the maximal “off-diagonal” entry in both rows i and j in L, then for all k i,j: L(i,k)=L(j,k).

Proof: By the 3 points condition on {i,j,k}.

i j k

i 18 9 8 3 7

j 9 14 8 3 7

Example for i=1, j=2

32

LCA matrix LCA distances:Proof by induction

We now prove by induction on n: L is an (n-1)(n-1) LCA matrix

There is a weighted tree T with a root r as in the theorem.

Basis: n= 2. L=[w]. T is a tree with a single edge of weight w.

4r i4

34

Induction stepInduction step: n ¸ 3. Let L be an LCA matrix of

dimension n-1. We describe an algorithm for constructing the corresponding tree:

1. Find i,j s.t. L(i,j) is the maximal off-diagonal entry in L.

i j k

i 11 9

j 9 14

L

(In the example i=1 and j=2)

35

Induction step

2. Let L` be the matrix obtained by removing rows/columns i and j, and inserting row/column v s.t. L`(v,v)=L(i,j), and for k i,j,

L`(v,k)=L(i,k) (=L(j,k))

v k

v 9 8 3 7

L`

1 2 k

1 11 9 8 3 7

2 9 14 8 3 7

L

36

Induction Step

To show that L` We is an LCA matrix we need a definition and a

simple observation:

Definition: Let L be an nn matrix, and let S {1,...,n}.

L[S] is the submatrix of L consisting of the rows and columns with

indices from S.

Observation 1: If L is an LCA matrix then for every S {1,...,n},

L[S] is also an LCA matrix.

37

Induction step Claim: L` is an LCA matrix of dimension n-2 Proof: Let S be all leaves except j. Than L` is obtained from

L(S) as follows:1. change the index i to v2. set L`(v,v)Ã L(i,j)By Observation 1 and the maximality of L(i,j), L` is also an

LCA matrix.

v k

v 9 8 3 7

L`

1 2 k

1 11 9 8 3 7

2 9 14 8 3 7

L

38

Induction step

3. Construct a tree T` for L` (with n-1 leaves)

v k

v 9 8 3 7

v

T`L`

39

Induction step

4. Add to v to childs, for i and j, with appropriate edge lengths.

v

T`i j k

i 11 9 8 3 7

j 9 14 8 3 7

2 5

ij

40

Deepest LCA neighbor joining If n · 3, return tree of n vertices Prepare a list MAX of size n, s.t.

MAX(i ) = maximal off diagonal element in row i

Recursion: Find i,j s.t. L(i,j) is maximal off diagonal entry of L Make the reduction to L` as described update the list MAX (only MAX(v) needs an update!) Construct T` for L` Add i and j as childs of v.

v

T

`i j

41

Complexity AnalysisInitialization: Constructing MAX - O(n2).

Let Time(n) be the complexity of the algorithm, given the input matrix L and the list MAX. Time(n) is given by:

Reducing L to L`: O(n) Updating MAX: O(n). Constructing T` from L`: T(n-1). Constructing T from T`: O(1).

Time(n)· Time(n-1)+O(n)

Hence Time(n)=O(n2)

42

Seitou&Nei vs. DLCA methods

DLCA like S&N can be implemented on noisy data (in many ways)

On exact data, DLCA and S&N methods have the same (correct)

output. They differ on noisy data (which occurs in practice).

One basic difference: Unlike S&N method, the DLCA algorithm

depends on selecting a root. Hence DLCA may produce many

different trees on the same output.

Some of the projects will concentrate on this difference.

43

Incremental Reconstruction via Local Queries

Incrementally reconstructing the tree:

a

bc

d

ef

g

h

6

4 1

2

3

5

a

bc

d

ef

g

h

12 3

4 5

6

When inserting a new taxon x to a given topology T, we need to find out to which

edge x should be attached.

We are allowed to ask the ‘oracle’ local queries LQ(x,v).

(x – taxon, v – internal vertex)

44

Local Queries - Motivation

Asking LQ(x,v) is equivalent to asking the topology of {x, a, b, c},

where v is the center-point of a,b,c in T.

a

bc

d

ef

g

h

6

4 1

2

3

5

f

a

bc

d

e

12 3

4

Such questions can be asked directly (using likelihood) or through a pairwise

distance matrix (which will be discussed later)

45

Balancing Vertices

We’d like to minimize the number of queries required for inserting a

new taxon.

Lower bound – log3(|ET|). (simple adversarial argument)

Upper bound – log2(|ET|).

The algorithm which achieves the upper bound uses ‘balancing vertices’:

A balancing vertex in T is an internal vertex, which splits T into 3 subtrees

of size at most ceil(|T|/2).

Using balancing vertices in the local queries, the edge to which a new

taxon should be attached can be found in ~ log2(|ET|) queries.

46

Balancing Vertices

Every tree contains either a single balancing vertex or two adjacent

balancing vertices.

Finding a balancing vertex:

Start at some arbitrary vertex v. If v is balancing, stop.

Otherwise, continue to the vertex u, adjacent to v in the ‘heaviest’ subtree.

The algorithm traverses each edge at most once

Time complexity – O(|T|).

a

c

d

ef

g

h

13 edges in T

11 edges9 edges7 edges

47

A Simple and Efficient Algorithm

Iteratively add taxa 1,2,…,n to the topology

When adding taxon x to topology T:

If T is trivial (consists of a single edge), attach x to that edge.

Otherwise: Find a balancing vertex v of T.

Ask query LQ(v,x)

Continue recursively on T’, the subtree corresponding to the answer of the query.

Complexity:

Adding taxon 1≤x≤n to T takes O(log(x)) queries and O(x) time.

Total query complexity: O(n·log(n))

Total time complexity: O(n2)

48

Interesting Issues

Two major issues are raised in this area:

Queries do not always have reliable answers- Use confidence level for answers

- Verify the answers

Reduce running time to O(n·log(n))- Finding balancing vertices leads to high overhead

- Maybe we don’t have to re-compute the balancing vertices in every stage

49

Robustness to Noise in Data

Answering local queries using a distance matrix D: We wish to assess the topology spanned by four taxa: x, a, b, c.

Observe the 4×4 submatrix of D over x, a, b, c:

a

bc

x

b x a c

bx

ac

If D is additive then there is a labeling of the taxa by i, j, k, l s.t:

D(i,j) + D(k,l) ≤ D(i,k) + D(j,l) = D(i,l) + D(j,k)

The configuration of the quartet is (ij ; kl), and the path separating them is of

length ½(D(i,k) + D(j,l) - D(i,j) + D(k,l))

If D is not additive we set the configuration of the quartet to (ij ; kl),

where D(i,j) + D(k,l) is minimal of the three sums.

Confidence of prediction can be estimated by the difference between

maximal and minimal sums.

?

50

Robustness to Noise in Data

Answering local queries using a distance matrix D: We can check several quartets of type x, a, b, c to answer a single local query.

Example: To answer LQ(1,g) we can check all quartets in

{g} ×{a} ×{c,f} ×{b,d,e}

We can choose a representative set of quartets, and answer the local

query according to (weighted) majority.

If the answer is still inconclusive, we can choose to ask another local

query.

a

bc

d

ef

12 3

4

g?

51

Improving Running Time

Separator Trees: A deterministic algorithm which inserts a new taxon x to a given topology T

can be viewed as a rooted decision tree.

• Each internal node represents a local query (internal vertex in T).

• Each internal node has three outgoing edges corresponding to the three possible

answers to the query.

• Each leaf corresponds to an edge in T.

A special case of decision trees are separator trees.

The time complexity of the algorithm is the depth of the separator

tree

a b c d e f g h i j k l m

1 2 5

3 6

S:4

a

b

d

ef

g

h

i

jl

mk

1

23

4

5

6

T:

c

52

a b c d e f g h i j k l m

Improving Running Time

Balanced Separator Trees: A balanced separator tree uses balancing vertices (of the appropriate subtrees of T)

Can be constructed in O(n·log(n)) time

Inserting a taxon does not drastically harm the balance

If we allow some imbalance, we can guarantee that the costly balancing

procedure is executed few times during construction of the whole topology.

Amortized analysis of total time complexity: O(n·log2(n))

a

b

c

d

ef

g

h

i

jl

mk

1

23

4

5

6

1 2 5

3 6

T: S:4

53

Improving Running TimeBottom-up approach: (simple separator trees)

Start with the edge-set of T

Choose disjoint edge triplets, s.t. that each triplet contains at least one leaf

Contract each triplet to a single edge

Recursively continue on the reduced topology

T: S:

a

b

c

d

ef

g

h

i

jl m

k

1

23

4

5

6

j

1

2 3 4 5 6

j3

56

5

a b c d e f g h i j k l m

1 2 4 6j

3 6j

5

54

Improving Running Time

Bottom-up approach: (simple separator trees) By simple linear traversal of T you can find θ(|T|) edge-triplets

Topology size is reduced by a constant factor each stage

• Depth of simple separator tree is O(log(n))

• Time complexity is O(n).

Insertion of taxon induces modifications propagating bottom-up through the

layers of the separator tree

a

b

c

d

ef

g

h

i

jl m

k

1

23

4

5

6

j

1

2 3 4 5 6

j3

56

5

a b c d e f g h i j k l m

1 2 4 6j

3 6j

5

IS: {1,2,4,6}

IS: {3}

IS: {5}

55

ATTCG …ATACG …ACTGG …...

Testing Reconstruction Methods on Noisy Data

We’d like to test reconstruction algorithms on actual phylogenetic data.

Problem: Confirmed phylogenetic trees are scarce and small.

Solution: Simulate the data.

Generate an edge-weighted tree under some probabilistic model

(Yule-Harding)

Choose random DNA string for root and simulate evolution on tree to obtain sequences for all leaves

SeqGenDNAdist

from

PHYLIP

Obtain pariwise distances from

sequences

00

00

00

00

0

TD

T’Reconstruction

AlgorithmCompare topologies

56

The ProjectsProject I: The DLCA algorithm

Implement algorithms: Saitou&Nei's neighbor joining DLCA neighbor-joining

mid-point reduction maximal-value reduction

Simulate data:Use pre-generated trees to simulate process of evolution (using SeqGen program)For each tree generate several sequence-sets Experiments:Test the various algorithms on the generated data:

Use DNADIST program (part of the Phylip package) to get a distance matrix corresponding to the sequence-set of the leaves.

Execute algorithms on distance matrix Check topological accuracy using the RF-score

57

The Projects

Project II: Fast Algorithms Using Local Queries

Implement algorithms: Implement advanced data structures which support the various algorithms: Algorithm using semi-balanced separator trees Algorithm using simple separator trees

Simulate data:Use pre-generated trees and/or uniform random model

Experiments: Test the various algorithms on the generated trees:

o Use the generated trees to answer the local queries asked by the algorithms.

o Compare the performance of the different algorithms on this data.

58

The Projects

Project III: Robust Algorithms Using Local Queries

Implement algorithms: Implement the O(n2) algorithm using O(n·log(n)) queries

Simulate data:Use pre-generated trees and distance matrices

Experiments: Test various approaches on the generated data:

o Use the distance-matrices to answer the local queries asked by the algorithms.

o Suggest some method of estimating the confidence level of an answer to a query.

o Check for errors in the reconstructed topology. Compare several approaches

59

Grading Scheme

10% - work plan 60% - final report + submitted code

Rough distribution of grade: 40% - meeting project requirements 10% - code organization and documentation 10% - innovation and creativeness

30% - final presentation

60

Schedule

21/3 – Introductory meeting

28/3 – Deadline for choosing a project

26-30/3 – Individual 30 minute meetings with each teem to discuss the

specification of the project.

23-27/4 – Individual 60 minute meeting with each team to discuss work

plan and design of project

2/5 – Deadline for submitting work plan

21-25/5 – Individual progress meetings

18-22/6 – Concluding 60 minute meetings with each team

27/6, 4/7 – Project presentations and submission of final draft

Final submission deadline – To be announced

61

Homework

Team up in pairs

Choose project

Send me e-mail containing:

The names, id numbers, e-mails of all students in the group

Preferred project + 2nd priority project

Two optional dates for first project meeting (next week)

Go over references of your chosen project

Good Luck !