csci6904 genomics and biological computing phylogenetics

87
CSCI6904 Genomics and Biological Computing Phylogenetics

Upload: brianna-austin

Post on 03-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCI6904 Genomics and Biological Computing Phylogenetics

CSCI6904

Genomics and Biological Computing

Phylogenetics

Page 2: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

A non-biological example

Howe, CJ., Barbrook, AC., Spencer, M., Robinson, P., Bordalejo, B. and Mooney, LR., 2001, Manuscript Evolution, Trends in Genetics, 17(3), 147-152

Page 3: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

An analogy that works well…

Genome Manuscript

DNA Polymerase Scribes

Mutations Transcription error/alterations

DNA sequences Extant manuscripts

Deletion/Insertion

Rearrangements

Lateral Gene transfer Using part of a second template to enhance a copy of the manuscript

Selective pressure Politics, Esthetics, linguistics

Gene history (tree) Manuscript history (tree)

Page 4: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

An analogy that works well…

Thanks to Gutenberg and his invention of the printing press, the rate at which manuscripts are evolving have decreased by many orders of magnitude (next to 0, actually).

Raw data

The encoding of the data has to be done in a slightly different manner as it is preferable to treat words as characters. Consequently, the alphabet is of an un-manageable size.

Page 5: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

The data is collected from extant manuscripts:

Page 6: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

… and aligned so all characters are homologous:

Page 7: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

What can be discovered:

Which manuscript is the closest to the original draft?

Are all know manuscript found in, say, Belgium are descendent of a single copy of the manuscript?

What would be the most likely text in the (long lost) original version?

Page 8: CSCI6904 Genomics and Biological Computing Phylogenetics

What can be discovered:

Whatever happened to the first chapter?

In the case to the left, there is evidence from “phylogenetic” analysis that the first half of the prologue of the manuscript El was taken from a different source than for the rest of the text.

In genomics, if a gene gets misplaced in a tree, it may indicate that the gene was acquired by transfer rather than heredity.

Page 9: CSCI6904 Genomics and Biological Computing Phylogenetics

There are evidences that the transfer of a single gene transformed a benign bacteria: Yersinia Pestis, into the agent of the “black death”.

“The study, published in the April 26 issue of Science, shows that an enzyme called phospholipase D (PLD), previously known as Yersinia murine toxin, allows Y pestis to survive in the midgut of the rat flea. By acquiring the gene that encodes PLD, "the bacterium gradually changed from a germ that causes a mild human stomach illness acquired via contaminated food or water to the flea-borne agent of the 'Black Death,' which in the 14th century killed one-fourth of Europe's population," the NIH said in a news release.”Hinnebusch BJ, Rudolph AE, Cherepanov P, et al. Role of Yersinia murine toxin in survival of Yersinia pestis in the midgut of the flea vector. Science 2002;296(5568):733-5

Page 10: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

Strategies

Discrete character approaches

Parsimonious criterion

Model likelihood criterion

Hypothesis likelihood criterion

Distance-based clustering

Least-square

Neighbor-Joining / UPGMA (Implicit topology)

Minimum Evolution

Page 11: CSCI6904 Genomics and Biological Computing Phylogenetics

UPGMA’s shortcoming

“Molecular Clock” assumption can be rejected in most cases.

In this example, un-equally evolving sequences are clustering according to their rate of evolution rather than according to the history of the genes.

Page 12: CSCI6904 Genomics and Biological Computing Phylogenetics

Neighbor-Joining algorithm

Guarantee to recover the true tree if the distance matrix is an exact reflection of the tree.

How realistic is it to assume that these distances are behaving as such?

Page 13: CSCI6904 Genomics and Biological Computing Phylogenetics

Triangle inequality

Rarely respected. Especially if any of D(A,B), D(B,C) are large.

The reason: Saturation.

Distance metrics between sequences

AC AB BCD D D

Page 14: CSCI6904 Genomics and Biological Computing Phylogenetics

What is saturation?

Time 1 2 3 4 5 6 7

A -------------------------------- P

A F A H K H P

AB

In both cases, if only the time step 1 and 7 are known, the most likely distance will be the same.

1 7

2

34

5

6

Page 15: CSCI6904 Genomics and Biological Computing Phylogenetics

Saturation is theoretically expected

Page 16: CSCI6904 Genomics and Biological Computing Phylogenetics

Maximum likelihood distances

The following describe the evaluation of distances using the maximum likelihood criterion.

This is the best method to evaluate distances between biological sequences.

Page 17: CSCI6904 Genomics and Biological Computing Phylogenetics

A G

C T

Jukes-Cantor Model

For nucleotides, there are a limited number of substitutions

Matrix with 1 expected substitution per 100 sites.

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

Page 18: CSCI6904 Genomics and Biological Computing Phylogenetics

Jukes-Cantor Model

For nucleotides there are a limited number of substitutions

Given two (short) sequences

C C A T

C C G T

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

P1 =

, , ,A G C T

The Likelihood of that these two sequences are related is then:

(1) c c c c c c a a g t t tL P P P P

Page 19: CSCI6904 Genomics and Biological Computing Phylogenetics

Jukes-Cantor Model

For nucleotides there are a limited number of substitutions

Given two (short) sequences

C C A T

C C G T

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

P1 =

, , ,A G C T

What if the distance implied by P1 are not realistic/representative?

(1) c c c c c c a a g t t tL P P P P

Page 20: CSCI6904 Genomics and Biological Computing Phylogenetics

Extrapolation of probability matrices.

As we have seen for the PAM matrix a few weeks back.

We can obtain a pij for any multiple of PAM1 by doing

a matrix multiplication.

2

,1

,2*

1, , 3, 4, ,,

,4

i

iM

j i j j j i ji j

i

Page 21: CSCI6904 Genomics and Biological Computing Phylogenetics

Extrapolation of probability matrices.

There will be then a different probability associated to each possible distances

1 1 1 1(1) c c c c c c a a g t t tL P P P P 2 2 2 2(2) c c c c c c a a g t t tL P P P P

( ) l l l lc c c c c c a a g t t tL l P P P P

Page 22: CSCI6904 Genomics and Biological Computing Phylogenetics

Extrapolation of probability matrices.

The probability is as a function of the distance between two sequences.

There is thus a value of distance (l) that maximizes the probability of observing two related sequence.

In other words, there is a t values that maximize the likelihood that two sequences are related.

'( ) ...l l l l lc c c c c c a a g t t t x x xL l P P P P P

For branch length l over k sites

Page 23: CSCI6904 Genomics and Biological Computing Phylogenetics

Arbitrary P matrices from

4 4log(5)

log( )

5

log( )

t t P

t Qt

e

P e

Q P

P e

Q is the log(P) matrix for an arbitrary unit of distance.

Page 24: CSCI6904 Genomics and Biological Computing Phylogenetics

In Practice, the model can be custom built for an input dataset

Q R

Vector of frequencies for each character (can be estimated from input dataset)

A matrix of relative rate of substitution (large amount of empirical data (PAM, JTT) or optimized (WAG))

Page 25: CSCI6904 Genomics and Biological Computing Phylogenetics

Extrapolation of probability matrices.

Now, imagine that two sequences are un-related.

– The real Branch Length (t) is equal to +– The BL estimate will converge to a value necessarily smaller due to the presence of some site being identical by coincidence.

Page 26: CSCI6904 Genomics and Biological Computing Phylogenetics

Even random sequences are going to have “matches”

Although Likelihood distance should tend to large values in this case.

Page 27: CSCI6904 Genomics and Biological Computing Phylogenetics

Even random sequences are going to have “matches”

Saturation should be compensated for in ML distances.

However, and because of:

• Non-homogenous frequencies• Rate heterogeneity• Change in the P matrix over time• Non-independence of characters in a sequence.

Long distances still a bit contentious to evaluate.

Page 28: CSCI6904 Genomics and Biological Computing Phylogenetics

Time reversibility is also assumed

Time 1 2 3 4 5 6 7

A A A P P P P

A A A H K H P

AB

Without time reversibility assumed, it would be impossible to measure a distance between two sequences without involving an undefined bifurcation.

1 7

2

34

5

6

Page 29: CSCI6904 Genomics and Biological Computing Phylogenetics

Time reversibility is also assumed

In practice, this means that the entries in our matrices of substitution have to be symmetrical such that :

This is also practical from a bioinformatics perspective since there it cut in ½ the number of parameters in the model.

1 7

2

34

5

6

| |P a b l P b a l

Page 30: CSCI6904 Genomics and Biological Computing Phylogenetics
Page 31: CSCI6904 Genomics and Biological Computing Phylogenetics

Another distance-based method that intuitively make sense

Least Square method

2n n

ij ij iji j

U w D d

D matrix entry

Sum of all t along the path from i to j.

Weight

21 11, , ,...ij

ij ij

wD D

Page 32: CSCI6904 Genomics and Biological Computing Phylogenetics

One last distance-based method that we would intuitively use

Once abstracted :We are looking for an acyclic, binary graph with n terminal vertices that conforms the best to a set of n2 constraints.

i o f j x

2n n

ij ij iji j

U w D d

t1t2 t3 t4 t5

t45

t345t12

1 12 345 45 4ijd t t t t t

There is a danger of time traveling

with some tk < 0

Page 33: CSCI6904 Genomics and Biological Computing Phylogenetics

One last distance-based method that we would intuitively use

Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)

i o f j x

t1t2 t3 t4 t5

t45

t345t12

iofjx

jxIn the path

Not in path

One last distance-based method that we would intuitively use

Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)

i o f j x

t1t2 t3 t4 t5

t45

t345

iofjx

jxIn the path

Not in path

One last distance-based method that we would intuitively use

Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)

i o f j x

t1t2 t3 t4 t5

t45

t345

iofjx

jxIn the path

Not in path

One last distance-based method that we would intuitively use

Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)

i o f j x

t4 t5

t45

t345

iofjx

jxIn the path

Not in path

One last distance-based method that we would intuitively use

Once abstracted :Although there is n terminal nodes, there will be 2n-1 total nodes in the tree/graph (rooted tree)

i o f j x

t4 t5

t45

t345

iofjx

jxIn the path

Not in path

One last distance-based method that we would intuitively use

Once abstracted :Although there is n terminal nodes, there will be 2n-1 nodes in the tree/graph (rooted tree)

i o f j x

t4 t5

t45

t345

iofjx

jx

2

,

n n

ij ij ij k ki j k

U w D x t

,

1

0ij kx

In the path

Not in path

Page 34: CSCI6904 Genomics and Biological Computing Phylogenetics

One last distance-based method that we would intuitively use

There is a straightforward solution to this linear algebra problem.

i o f j x

t1t2 t3 t4 t5

t45

t345t12

iofjx

jx

2

,

n n

ij ij ij k ki j k

U w D x t

,

1

0ij kx

In the path

Not in path

Page 35: CSCI6904 Genomics and Biological Computing Phylogenetics

One last distance-based method that we would intuitively use

Minimum EvolutionCan be used as a selection criterion between Least-Square tree topologies.

This is done by selecting the topology amongst a collection of suitable topology that minimizes :

i o f j x

t1t2 t3 t4 t5

t45

t345t12

iofjx

jx kk all edges

t

Page 36: CSCI6904 Genomics and Biological Computing Phylogenetics

Tree space

Unlike UPGMA and NJ, the problem with this previous method is that you have to provide a

topology prior to the calculation….

Page 37: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

Strategies

Discrete character approaches

Parsimonious criterion

Model likelihood criterion

Bayesian statistics

Distance-based clustering

Least-square

Neighbor-Joining / UPGMA (Implicit topology)

Minimum Evolution

Page 38: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny

Discrete-character signal versus distance

Distance : Use the characters and a function to evaluate distance metrics. These are used to determine the length of the branch/edges between nodes/vertices. These internal nodes/edges are simply there to maximally reconcile the distance data into a binary tree.

Character : Use discrete characters implicitly or explicitly to define the state of each nodes.

Page 39: CSCI6904 Genomics and Biological Computing Phylogenetics

Parsimony

Intuitive method that can be run manually

Assumes that everything observed in the data is connected by the most straightforward relationships.

Page 40: CSCI6904 Genomics and Biological Computing Phylogenetics

Parsimony

Algorithm Postorder tree transversal : from terminal nodes toward the “center”.

At each node:

1. Create an intersection of the set of observation in the immediate descendent nodes.

2. If the intersection set is null. Create a set that is the union of the two descendents.Add one to the count of

changes recorded.

Page 41: CSCI6904 Genomics and Biological Computing Phylogenetics

Parsimony

Algorithm

The most parsimonious tree will be the topology which will minimize the number of changes to explain the data over all sites (columns).

Statistics

Consistency

Retention

minCCI

C

max

max min

C CRI

C C

Page 42: CSCI6904 Genomics and Biological Computing Phylogenetics

Parsimony

Side-effects

The reconstruction is assuming that the most parsimonious explanation is the correct one.

It also assumes that all changes have a similar “cost”.

Therefore, the parsimony method does not seem to be designed to deal with saturation.

Page 43: CSCI6904 Genomics and Biological Computing Phylogenetics

Maximum likelihood criterion

AbstractionWe have a collection of items (sequences). We know that all the instances in the collection are stochastically derived from a unique parent in the hierarchy. We also have a have a model for this stochastic process represented as a Markov process.

We are thus looking for a tree (topology+distances) that will maximize the likelihood of the data, given the Markov process.

Page 44: CSCI6904 Genomics and Biological Computing Phylogenetics

Jukes-Cantor Model

For nucleotides there are a limited number of substitutions

Given two (short) sequences

C C A T

C C G T

A G T C

A 0.99

G 0.03 0.99

T 0.03 0.03 0.99

C 0.03 0.03 0.03 0.99

P1 =

, , ,A G C T

What if the distance implied by P1 are not realistic/representative?

(1) c c c c c c a a g t t tL P P P P

Page 45: CSCI6904 Genomics and Biological Computing Phylogenetics

Extrapolation of probability matrices.

There will be an optimal distance between two sequences.

1 1 1 1(1) c c c c c c a a g t t tL P P P P 2 2 2 2(2) c c c c c c a a g t t tL P P P P

( ) l l l lc c c c c c a a g t t tL l P P P P

Page 46: CSCI6904 Genomics and Biological Computing Phylogenetics

Distance to an internal node

There will be an optimal distance between two sequences:

( ) t t t tc c c c c c a a g t t tL t P P P P

If the sequence of only one of the node is known, the other end could be any possible characters:

( ) t t t tc c x c c x a a x g g x

x x x x

L t P P P P

Page 47: CSCI6904 Genomics and Biological Computing Phylogenetics

Model based phylogeny

It is possible to compute likelihood of internal nodes by summing over all possibilities.

6 1 2 8 3 7 4 5( ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , )x y z w

P x P y x t P A y t P C y t P z x t P C z t P w z t P C w t P G w t

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7

Page 48: CSCI6904 Genomics and Biological Computing Phylogenetics

Model based phylogeny

The structure of the equation once the summation are pushed as far right as possible is the same as the structure of the tree.

6 1 2 8 3 7 4 5( ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( | , )x y z w

P x P y x t P A y t P C y t P z x t P C z t P w z t P C w t P G w t

( , ), ( , ( , ))A C C C G

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7

Page 49: CSCI6904 Genomics and Biological Computing Phylogenetics

Model based phylogeny

The calculation at one node thus depend on the conditional likelihood of each possible character S in the children nodes.

( )4 5

( ) ( ) ( ) ( )1 2

( ) ( | , ) ( | , )

( ), ( ),..., ( )

iw

i i i iw w w w n

L s P C S t P G S t

L L s L s L s

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7

Page 50: CSCI6904 Genomics and Biological Computing Phylogenetics

Model based phylogeny

For terminal nodes:

For internal nodes:

This is done for each site i.

The log(L) are stored rather than L.

( ) 0,1,0,0,0,0,0,0,...,0iL

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7( ) ( )( ) ( | , ) ( )i iy child child

child a

L s P a s t L a

Page 51: CSCI6904 Genomics and Biological Computing Phylogenetics

Model based phylogeny

For terminal nodes:

For internal nodes:

For innermost nodes:

( ) 0,1,0,0,0,0,0,0,...,0iL

( ) ( ) ( )i ia x

a

L L aFor a tree:

( )i

i

L L

( ) ( )( ) ( | , ) ( )i iy child child

child a

L s P a s t L a

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

t7t7t7t7

Page 52: CSCI6904 Genomics and Biological Computing Phylogenetics

Tree Space

… and an epic complexity in search space

( ) 3 5 7 9 11 ...(2 3)T n n

1

(2 3)!( )

2 1 !n

nT n

n

Where n is of interest in the 20-100 range.

Page 53: CSCI6904 Genomics and Biological Computing Phylogenetics

Tree data structure

What if we don’t constrain the binary connectivity?

Multifurcation is not necessary because any multifurcated tree is a case of a binary tree with at least one internal branch length set to zero.

Page 54: CSCI6904 Genomics and Biological Computing Phylogenetics

Introduce one sequence at the time to the branch which gives the best likelihood.

Exploring the topology space

Incremental method – Stepwise addition

( ) 3 5 7 9 11 ... (2 3)T n n

This method of building a topology is intrinsically greedy.

Page 55: CSCI6904 Genomics and Biological Computing Phylogenetics

Exploring the topology space

Local rearrangement – Nearest Neighbor Interchange

For a tree with four lineages, there is only three possible topologies, one of which is already computed.

Page 56: CSCI6904 Genomics and Biological Computing Phylogenetics

Exploring the topology space

Greedy Methods – Nearest Neighbor Interchange

Sensitive to the initial tree, may not recover from some types of error early in the optimization.

Page 57: CSCI6904 Genomics and Biological Computing Phylogenetics

Exploring the topology space

Greedy Methods – Nearest Neighbor Interchange

Sensitive to the initial tree, may not recover from some types of error early in the optimization.

Page 58: CSCI6904 Genomics and Biological Computing Phylogenetics

Exploring the topology space

Global Methods

Methods that have the potential to sample the entire breadth of the search space in only a few consecutive iterations.

Subtree Pruning Regrafting (SPR)Tree Bisection and Reconnection (TBR)

Page 59: CSCI6904 Genomics and Biological Computing Phylogenetics

Exploring the topology space

Subtree Pruning Regrafting

Search space per step is narrowed down to:

2O n

Page 60: CSCI6904 Genomics and Biological Computing Phylogenetics

Exploring the topology space

Tree bisection and reconnection

This algorithms randomizes the site of reconnection in both subtree.

Search space is narrowed down to:

3O n

Page 61: CSCI6904 Genomics and Biological Computing Phylogenetics

Meaning of all this

Trees

Biologists love them. As we have seen, they attempt to go beyond the logical clustering of data items. Instead, they are used to reconstruct the process under which the data was generated.

For example: It is because of phylogenetic trees that we know that we (modern eukaryotes) are originating from the symbiosis between a cyanobacteria and a elusive ancestral cell.

The likelihood of a tree is a reflection of the goodness of fit of the alignment to a tree and a model of substitution.

Page 62: CSCI6904 Genomics and Biological Computing Phylogenetics

How good is a tree?

Problem:Not only the cluster matter, but each individual internal nodes

contains usable information. How can we ascertain that a node is any good?

There are no reference set that can be used and for which we know the true answer.

Cause of error:

Model misspecification, mixed history within a gene, sampling error, …

Page 63: CSCI6904 Genomics and Biological Computing Phylogenetics

Significance of difference in likelihood values.

If the likelihood evaluation depends on a single parameter:

This relationship is however impractical since the likelihood calculation rarely depend on a single parameter.

An (imperfect) example to this would be if we were interested in

evaluating the certainty on a single branch length at the time.

Real trees have a lot more parameters: 2n-1 branches, the rate distribution shape parameter, etc…

20 1

ˆ2 ln lnL L

Page 64: CSCI6904 Genomics and Biological Computing Phylogenetics

Re-sampling using the bootstrap

Getting around the sampling error

Assumption : All the significant signal is present in the data, but the signal’s blend is affected by the size of the sample.

Given a dataset of n sequences that are k character long:

Create a new dataset by randomly and uniformly choosing site

indices “i” until the resampled dataset has a size of n X k.

1 1 11

1

... ...

... ... ...

...

i k

n nk

a a a

a a

Page 65: CSCI6904 Genomics and Biological Computing Phylogenetics

Re-sampling using the bootstrap

In practice, this is used to generate a large number of

replicates and count the frequency of observing a

given internal node.

High “bootstrap value” node stable to sampling error

Does not mean that a given internal node is “real”.

Page 66: CSCI6904 Genomics and Biological Computing Phylogenetics

Re-sampling using the Jackknife

Getting around the sampling error in the sequence axis

Principle : Randomly delete a small fraction of the data

The term “Jackknife” is also used in cases where trees are reconstructed by randomly deleting whole sequences from the

dataset.

Page 67: CSCI6904 Genomics and Biological Computing Phylogenetics

Using simulation

This is known as the parametric bootstrap

Principle : Create a distribution of likelihood values generated from simulated datasets

The test is done by evaluating the probability that the “real” likelihood value is part of the distribution of tree likelihood from simulated datasets.

Page 68: CSCI6904 Genomics and Biological Computing Phylogenetics

Using simulation

Simulating multiple sequence alignment from a tree is the reverse problem of inferring a tree from an

alignment. ? ?

? ? ?

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

x = {…}, random, drawn from

On a per site basis, the probability vector of each site in the node y can be calculated with:

6 ,

,...,i i

i

Qty

x A x Yx y y y

P e

P P P

? ?? ? ?

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

Page 69: CSCI6904 Genomics and Biological Computing Phylogenetics

Using simulation

This is known as the parametric bootstrap

This test can be used to evaluate whether the data can be simulated from a given combination of tree and model.

If the test tree is wrong, the simulated dataset should not include the “real” data.

If the test tree is the true tree and the model is relevant to what really happened during the evolution of the gene, the likelihood the of “real” data and the simulated series should not be statistically different.

Page 70: CSCI6904 Genomics and Biological Computing Phylogenetics

Using simulation

This is known as the parametric bootstrap

This test can be used to evaluate whether the data can be simulated from a given combination of tree and model.

The test is expected to be conservative because the simulated dataset are generated and recovered using the same parameters while the “real” data comes from a true process.

Page 71: CSCI6904 Genomics and Biological Computing Phylogenetics

Time consideration

Bootstrapping requires building distributions

This usually means that the long calculation has to be re-run on permuted datasets about 100-1000 times over. All this, just to

harvest a few numbers.

Page 72: CSCI6904 Genomics and Biological Computing Phylogenetics

Paired-site tests

Can be used to compare two trees

There is a number of techniques that compare two trees on the basis of their site likelihoods.

Winning site test, z test, t-test, Wilcoxon signed rank test, …

These test are more appropriate to estimate error bars in the topology dimension of a solution.

Page 73: CSCI6904 Genomics and Biological Computing Phylogenetics

Paired-site tests

In our research group, we are using such test in our optimization strategy

ln lni ii ref testL L

ref is better

It is possible to eliminate statistically worst trees rapidly, without re-sampling, and treat the solution not as a data point but rather as an area in topology space.

Page 74: CSCI6904 Genomics and Biological Computing Phylogenetics

Our research are showing that

Sums of likelihood are sensitive to the variance of the poorly modeled site.

Single thread search are not very robust to local minima.

Page 75: CSCI6904 Genomics and Biological Computing Phylogenetics

Meaning of all this

Site likelihood

Bioinformaticians love them. They provide information that is not contained in individual sequences. (i.e.: no matter how hard one will scan one genome). Further, they contain information on properties that may be impossible to physically observe.

Site likelihoods are a reflection of the goodness of fit of one position in a protein given a solution optimized with all the available data.

Page 76: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogeny allow to assemble sequences into an informative, time-dependent

structure

For the next few slides we will look at how phylogenetic information can be used to detect new signal in sequence information.

Site-wise rate of evolution.Rate dependent functional shifts.Rate independent functional shifts.

This framework offers a new source of information for pattern detection and recognition.

Page 77: CSCI6904 Genomics and Biological Computing Phylogenetics

Estimating rates amongst sites

Basic calculation assumed a constant rate. Variable constant rate can be approximated on a site-per-site basis:

A CC C G

y

x

z

w

t1t2

t3

t4 t5

t6

t7

t8

( ) ( )

0

( ) ( )

1

, ( )

( )

i ii i i

ki i

k kj

L r L r dr

L w L r

This will be true as long as the mean rate is 1:

1k kk

w r

Page 78: CSCI6904 Genomics and Biological Computing Phylogenetics

Extracting information from rates estimates

Sequence alignments were first used to identify which positions were “conserved”.

The rationale was that if the same character was conserved across all sequences, it was constrained and played an important role.

We can refer this to “eyeball bioinformatics”.

This method of predicting function is very fragile to the source dataset.

Sampling homogeneityCharacter similarity

Page 79: CSCI6904 Genomics and Biological Computing Phylogenetics

Extracting information from rates estimates

Case 1 Sampling homogeneity

2 Alignments for protein sequences of gene X.

The conclusion will be necessary that there is more conserved sites using the spider data. While in fact some of the same sequences are present in the second dataset.

35 Spider sequences 5 Spider 5 Mammal5 Bacterial5 Fungal5 Nematodes 5 Primate5 Rice plants sequences

Page 80: CSCI6904 Genomics and Biological Computing Phylogenetics

Fast Slow

Maximum-Likelihood Site-Rates are Biologically Relevant

Rhodopsin-like G-protein receptors

Pfam (dataset 1Tml_7) 69 taxa

Page 81: CSCI6904 Genomics and Biological Computing Phylogenetics

Maximum-Likelihood Site-Rates are Biologically Relevant

Tubulin

34 taxa 33 taxa

The constraints imposed by co-evolution far outweigh the

structural constraints.

Fast Slow

Page 82: CSCI6904 Genomics and Biological Computing Phylogenetics

What can be done with rate of evolution

Predict functionally important regions in proteins

Example : We know that gene G is binding a drug, but the mechanism is unknown. Using site rate estimate, a patch of slow evolving sites is detected at the surface of the protein’s structure. This is potentially a good place to investigate further.

Why bother with computational methods?

Time to gather data:

Sequence << Structure << Biological activity

Computational methods are best used trying to fill the gap between genomic data and the real world.

Page 83: CSCI6904 Genomics and Biological Computing Phylogenetics

What can be done with rate of evolution

The technique gain in power if used in a comparative strategy

Often, an un-characterized gene will have a relative protein that is already well known. It is possible to compare the two dataset of sequences in a 3D context to predict the presence or absence of function.

The computational technique to do this are usually based on site likelihood methods

Comparing rates scalar value has no statistical meaning.There are many different schemes, many exploit a variant of site likelihood ratio statistic for two aligned datasets of sequences a and b:

( ) ( )( )

( )a b

a b

ab

i ir ri

r r ir

L LL

L

Page 84: CSCI6904 Genomics and Biological Computing Phylogenetics

Inferring Function in Homologs of eF1

Evolutionary Patterns in Elongation Factors and

paralogs

eF1 34 taxa(a)eF1 33 taxa

HBS1 10 taxaeRF3 20 taxa

Pairwise comparison using bivariate site rate

estimation

eRF3

eF1

(a)eF1

HBS1

Page 85: CSCI6904 Genomics and Biological Computing Phylogenetics

Inferring Function in Homologs of eF1

eRF3Eukaryotic Release

Factor

Loss of eF1–analog interfaceLoss of constraint in 1’ loop

Interact with eRF1 (a tRNA mimic)

eF1eF1

Slow in eRF3Slow in Eukaryotes

Differently Evolving Sites

Inferring Function in Homologs of eF1

eRF3Eukaryotic Release

Factor

Loss of eF1–analog interfaceLoss of constraint in 1’ loop

Interact with eRF1 (a tRNA mimic)

eF1eF1

Slow in eRF3Slow in Eukaryotes

Differently Evolving Sites

Inferring Function in Homologs of eF1

eRF3Eukaryotic Release

Factor

Loss of eF1–analog interfaceLoss of constraint in 1’ loop

Interact with eRF1 (a tRNA mimic)

eF1eF1

Slow in eRF3Slow in Eukaryotes

Differently Evolving Sites

Inferring Function in Homologs of eF1

eRF3Eukaryotic Release

Factor

Loss of eF1–analog interfaceLoss of constraint in 1’ loop

Interact with eRF1 (a tRNA mimic)

eF1eF1

Slow in eRF3Slow in Eukaryotes

Differently Evolving Sites

Inferring Function in Homologs of eF1

eRF3Eukaryotic Release

Factor

Loss of eF1–analog interfaceLoss of constraint in 1’ loop

Interact with eRF1 (a tRNA mimic)

eF1eF1

Slow in eRF3Slow in Eukaryotes

Differently Evolving Sites

Page 86: CSCI6904 Genomics and Biological Computing Phylogenetics

Inferring Function in Homologs of eF1

HBS1Unknown Function

Loss of eF1–analog interface

Most likely no tRNA binding

eF1eF1

Slow in HBS1Slow in Eukaryotes

Differently Evolving Sites

Page 87: CSCI6904 Genomics and Biological Computing Phylogenetics

Phylogenetics and bioinformatics

Phylogenetics existed long before bioinformatics

Comes from mathematic, statistic and genetic circles.

Phylogenetic is very relevant to bioinformatics

Capture a dimension of the data that is not visible from a collection of sequences.

Phylogenetic is become an increasingly central theme in sequence analyses.