week 36: trees, alignment & database search

Week 36: Trees, Alignment & Database Search

Trees

Alignment

Database Search

Trees

Tree Concepts

Reconstruction Methods

Parsimony

Likelihood

Distance

The Molecular Clock

Topologies

4 sequences with root ignored:

s1 s3 s1 s2 s1 s2 \ / \ / \ / \ / \ / \ / -------- -------- -------- / \ / \ / \ / \ / \ / \ s2 s4 s3 s4 s4 s3

For unrooted trees with k leaves and internal nodes with 3 incoming edges, the following holds:

Edges = 2k-3; Internal nodes = k-2; T(k) = (2k-5)T(k-1); T(3)=1;

k 3 4 5 6 7 20-----------------------------------------------T(k) 1 3 15 105 945 1023

-----------------------------------------------Edges 3 5 7 9 11 37-----------------------------------------------Internal Nodes 1 2 3 4 5 18

Central Principles of Phylogeny Reconstruction

Parsimony

Distance

Likelihood

TTCAGT

TCCAGT

GCCAAT

GCCAAT

s2

s1

s4

s3

s2

s1

s4

s3

s2

s1

s4

s3

0

1

12

0Total Weight: 4

1

1 2

3 2 10.4

0.6

0.3

0.71.5

L=3.1*10-7

Parameter estimates

The Small Parsimony Problem(Fitch-Hartigan-Sankoff)

?/ \/ \

? ?/ \ / \

C G T C

L’N

/ /\ \ / / \ \ / / \ \ / LA LC LG LT / \ RA RG .. \

Recursion: L’N = min{LNL + d(N,NL)} + min{LNR + d(N,NR)}

Initialisation: Lleaf(N) = 0, if N at leaf - else infinity

Fitch-Hartigan-Sankoff Algorithm

(A,C,G,T) (9,7,7,7) / \ / \ Costs: Transition 2, / \ (A ,C,G, T) \ Transversion 5, indel 10. (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *

Distance Concepts on Trees I

A: Metric, d( , ) : i: d(a,b)=0 <=> a=b ii: d(a,b)=d(b,a) iii: d(a,b) <= d(a,c) + d(c,b)

a

c

b

Tree Metric: (distance function originates from tree)

d(x,y) + d(z,æ) = d(x,z) + d(y,æ) > d(x,æ) + d(y,z), where z,y,z,æ is a permutation of a,b,c,d.

(> implies that no branch has length 0)

Distance Concepts on Trees II

s2

s1

s4

s3

Reconstruction Principle: d(s1,i) = (d(s1,s2) + d(s1,s3) - d(s2,s3))/2

s3

s2s1

i

Ultra Metric (distance function originates from tree)

d(x,y) = d(x,z) > d(x,y), where z,y,z is a permutation of a,b,c.(> implies that no branch has length 0)

Distance Concepts on Trees III

i

s1 s3s2

Reconstruction Principle: d(s1,i) = d(s1,s2)/2

Evolutionary Substitution Process

t1 t2

CCA

Pi,j(t) = probability of going from i to j in time t.

Probability of a pattern - summing over internal states

A C G T

A C G T A C G T

A

A

A

?

? ?

?

T

GC

Felsenstein's recursion

Conditional probability, CL(v,N) is the probability for observing the nucleotides at the leaves at a subtree hanging from v, if nucleotide N is found at v.

P(v1,v2,N1,N2) = probability for N1 at v1, N2 at v2.

I) Initial contition: CL(leaf,N)= 1{N is at leaf}

II) Recursion

P(v,vl,N,Nl)*CL(vl,Nl) * CL(v,N) = P(v,vr,N,Nr)*CL(vr,Nr)

where r refers to right son, l to left son.

III) L(p) = πN*BL(Rod,N,p).

IV) Total likelihood:Product over all positions:L = Li(p).

€

v→ vL

∑

€

v→vR

∑

€

∑

€

i

∑

Output from Likelihood Method With Clock: Without Clock: s5 s4 23 5.2 \ / /\ 40.9 20.4 / \ \ / / \ ! / \ 1.6 5.6 23 sd4.6 124.4 / \ s1---6-------22---------------11---3 /\ \ ! ! 44.9 /\ \ /\ 7 3.4 4 sd.1.4 / \ \ / \ ! s1 s2 s3 s4 s5 s2

Likelihood: 7.9*10-14 = 0.31.1,0.18.1 6.2*10-12 = 0.34.1 0.16.1

ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom.

The Felsenstein ZoneFelsenstein-Cavendar (1979)

s4

s3s2

s1

Patterns:(16 only 8 shown)

0 1 0 0 0 0 0 0

0 0 1 0 0 1 0 1

0 0 0 1 0 1 1 0

0 0 0 0 1 0 1 1

True Tree Reconstructed Tree

s3

s1

s2

s4

The Molecular Clock

First noted by Zuckerkandl & Pauling (1964) as an empirical fact.

How can one detect it?

Known Ancestor Time Unknown AncestorTime

/\ a at time T. / \ / \ ? \ / \ /\ \ / \ / \ \ / \ / \ \s1 s2 s1 s2 s3

History of Phylogenetic Methods

1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.

1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.

1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.

1967 First large molecular phylogenies by Fitch and Margoliash.

1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.

1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.

1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.

1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.

1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.

1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP).

1981 Parsimony tree problem is shown to be NP-Complete.

1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.

1986 Bandelt and Dress introduces split decompostion as a generalization of trees.

1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.

1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.

2000 Rambaut (and others) makes methods that can find trees with non-

contemporaneous leaves.

Alignment

Pairwise Alignment Again

Triple – Quadruple - Many

Similarity-Distance Conversion

Local Alignment

Statistical alignment

Conclusion

Parsimony Alignment of two strings.Sequences: s1=CTAGG s2=TTGT.

Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10.

{CTA,TT} + GG

{CTAG,TTG} = {CTA,TTG} + G-

{CTAG,TT} + -G

Initial condition: D0,0=0. (Di,j = D(s1[1:i], s2[1:j]))

Di,j=min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g}

DCTA,TT + w(GG) =12 + 0 = 12D4,3=DCTAG,TTG=min{DCTA,TTG + w(G-) = 4 + 10 = 14} DCTAG,TT + w(-G) = 22 + 10 = 32

40 32 22 14 9 17T / 30 22 12 4 12 22G / 20 12 2 - 12 22 32T / 10 2 10 20 30 40T / 0 10 20 30 40 50 C T A G G

CTAGG Alignment: i v Cost 17 TT-GT

CA

A?

Alignment of three sequences.

s1=ATCG s2=ATGCC s3=CTCC

Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC

Configurations in an alignment column:

- - n n n - n -- n - n - n n -n - - - n n n -

Recursion: Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')}

Initial condition: D0,0,0 = 0.

Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3

New phenomena: ancestral sequence.

G

G

C

C

Parsimony Alignment of four sequencess1=ATCG s2=ATGCC s3=CTCC s4=ACGCG

Alignment: AT-CG ATGCC CT-CC ACGCG

Configurations in alignment columns

- - - n - - - n n n - n n n n -- - n - n n - n - - n - n n n -- n - - n - n - n - n n - n n -n - - - - n n - - n n n n - n -

Recursion: Di = min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4]

Initial condition: D0 = 0.

Computation time: l1*l2*l3*l4*15 Memory : l1*l2*l3*l4

Alignment of many sequences.s1=ATCG, s2=ATGCC, ......., sn=ACGCG

Alignment: AT-CG s1 s3 s4 ATGCC \ ! / ..... ---------- ..... / \ ACGCG s2 s5

Configurations in an alignment column: 2n-1

Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n]

Initial condition: D0,0,..0 = 0.

Computation time: ln*(2n-1)*n (l:sequence length, n:number of sequences)

Memory requirement: ln

Close-to-Optimal Alignments(Waterman & Byers, 1983)

A. Alignments within of optimal. Ex. = 2

40 32 22 14 9 * 17T * / 30 22 12 4 12 22G * / 20 12 2 - 12 22 32T / 10 2 10 20 30 40T / 0 10 20 30 40 50 C T A G G CTAGG Alignment: i iv Cost 19 TTGT- Caveat: There are enormous numbers of suboptimal alignments.

B. Sets of postions that are on some suboptimal alignment.

Longer Indels

TCATGGTACCGTTAGCGTGCA-----------GCAT

gk : cost of indel of length k.

Initial condition: D0,0=0

Di,j = min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g1,Di,j-2 + g2,Di,j-3 + g3,, Di-1,j + g1,Di-2,j + g2,Di-3,j + g3,, }

Cubic running time. Quadratic memory.

If gk = a + b*k, then quadratic running time.

Gotoh (1982) Di,j is split into 3 types:

1. D0i,j as Di,j, except s1[i] must mactch s2[j].

2. D1i,j as Di,j, except s1[i] is matched with "-".

3. D2i,j as Di,j, except s2[i] is matched with "-".

Then: D0i,j = min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1) + d(s1[i],s2[j])

D1i,j = min(D1i,j-1 + b, D0i,j-1 + a + b)

D2i,j = min(D2i-1,j + b, D0i-1,j + a + b)

Comment:1. Evolutionary Consistency Condition: gi + gj > gi+j

Gotoh Alignment,1981

Let all substitutions cost 2 og let gk= 3 + k, then align ACGT with AT.

The alignment must be: ACGT with a cost 5. A--T

- n

5 6 2 6 5 5 4 10 11 12T T 4 0 4 5 6 4 8 9 10 11A A 0 4 5 6 7 - - - - - A C G T A C G T

n n n -

- 6 2 6 5 - 8 9 6 7T T - 0 6 7 8 - 8 4 5 6A A 0 - - - - - 4 5 6 7 A C G T A C G T

Distance-Similarity.(Smith-Waterman-Fitch,1982)

Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 +g, Di-1,j +g}

Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}

Distance: Transitions:2 Transversions 5 Indels:10

M largest distance between two nucleotides (5).

Similarity s(n1,n2) M - d(n1,n2) wk k/(2*M) + gk w 1/(2*M) + g

Similarity Parameters: Transversions:0 Transitions:3 Identity:5 Indels: 10 + 1/10

40/-40.4 32/-27.3 22/-12.2 14/0.9 9/11.0 17/2.9T 30/-30.3 22/-17.2 12/-2.1 4/11.0 12/2.9 22/-7.2G 20/-20.2 12/-7.1 2/8.0 12/-2.1 22/-12.2 32/-22.3T 10/-10.1 2/3.0 10/-7.1 20/-17.2 30/-27.3 40/-37.4T 0/0 10/-10.1 20/-20.2 30/-30.3 40/-40.4 50/-50.5

C T A G G

Comments1. The Switch from Dist to Sim is highly analogous to Maximizing {-f(x)} instead of Minimizing {f(x)}.

2. Dist will based on a metric: i. d(x,x) =0, ii. d(x,y) >=0, iii. d(x,y) = d(y,x) & iv. d(x,z) + d(z,y) >= d(x,y).

There are no analogous restrictions on Sim, giving it a larger parameter space.

Local alignment

Global Alignment: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}Local: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w,0}

0 1 0 .6 1 2 .6 1.6 1.6 3 2.6 Score Parameters: C 0 0 1 0 1 .3 .6 0.6 2 3 1.6 Match: 1 A 0 0 0 1.3 0 1 1 2 3.3 2 1.6 Mismatch -1/3 G / 0 0 .3 .3 1.3 1 2.3 2.3 2 .6 1.6 Gap 1 + k/3C /

0 0 .6 1.6 .3 1.3 2.6 2.3 1 .6 1.6 GCC-UCGU / GCCAUUG 0 0 2 .6 .3 1.6 2.6 1.3 1 .6 1 A ! 0 1 .6 0 1 3 1.6 1.3 1 1.3 1.6 C / 0 1 0 0 2 1.3 .3 1 .3 2 .6 C / 0 0 0 1 .3 0 0 .6 1 0 0 G / 0 0 0 .6 1 0 0 0 1 1 2 U 0 0 1 .6 0 0 0 0 0 0 0 A 0 0 1 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 C A G C C U C G C U U

SodhSodb Sodl

sddm

Sdmz

sods Sdpb

Progressive Alignment(Feng-Doolittle 1987 J.Mol.Evol.)

Can align alignments and given a tree make a multiple alignment.

* *alkmny-trwq acdeqrtakkmdyftrwq acdehrtkkkmemftrwq

[ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6

* * *** * * * * * *Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrkSodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrkSodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrkSddm atkavcvlkgdgpqvq—-infeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq—infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp LsrkSods vatkavcvlkgdgpqvq—infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk Sdpb datkavcvlkgdgpqvq—infeqkesdgpv---wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk

< * A # C G

## ##

#

T= 0

T = t

Thorne-Kishino-Felsenstein (1991) Process

i. P(s) = (1-)()l A#A* .. * T

#T l =length(s)ii. Time reversible

& into Alignment Blocks

A. Amino Acids Ignored:

# - - - # - - - - * - - - -# # # # - # # # # * # # # # k k k

e-t[1-(t)]((t))k-1 [1-e-t-(t)][1-(t)]((t))k-1 [1-(t)]((t))k

pk(t) p’k(t) p’’k(t)

p’0(t)= (t) where (t)=[1-e()t]/[]

B. AA Considered: T - - - R Q S W Pt(T-->R)*Q*..*W*p4(t) 4

T - - - - - R Q S W R *Q*..*W*p’4(t) 4

Basic Pairwise Recursion (O(length3))

i

j

Survives Dies

i-1j

i-1 i-1i

j-2

j-1i

ijj

i

j

€

€

P(s1i−1 → s2j−2)* p2 * f (s1[i],s2[ j −1])

i-1j-1

-globin (141) and -globin (146)

430.108 : -log(-globin) 327.320 : -log(-globin ‡ -globin) 730.428 : -log(l(sumalign))

*t: 0.0371805 +/- 0.0135899*t: 0.0374396 +/- 0.0136846s*t: 0.91701 +/- 0.119556

E(Length) E(Insertions,Deletions) E(Substitutions)

143.499 5.37255 131.59

Maximum contributing alignment:

V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF

TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Ratio l(maxalign)/l(sumalign) = 0.00565064

1966: Levenstein formulates distance measure between sequences and instroduces dynamica programming algorithm finding the distance.

1970: Needleman and Wunch compares proteins maximising a similarity score.

1972: Sankoff & Sellers reinvents the basic algorithm. Sankoff can also align subject to the constraint that there must be exactly k indels.

1973: Sankoff makes multiple alignment and phylogeny - both exact & heuristic.

1975: Hirschberg gives linear memory algorithm.

1976: Waterman gives cubic algorithm allowing for indels of arbitrary length. 1976: Waterman introduces alignment without reference to phylogeny.

1981: Waterman, Smith and Fitch shows duality of simiarity and distance.

1982: Gotoh gives quadratic algorithm if gap penalty functionen is gk = a + b*k (for indel of length k). Uses 3 matrices in stead of 1.

1983: Waterman and Byers introduces close-to-optimal alignments.

1984-5: Ukkonen, Myers, Fickett accelerates algorithmen considerably.

1984: Hogeveg and Hespers introduces heuristic multiple phylogenetic alignment.

1984: Fredman introduces triple alignment generalisation of Needleman-Wunch.

1985: Lipman & Wilbur uses hashing.

1989: Myers introduces alignment with concave gap penalty function.

1991: Thorne, Kishino & Felsenstein makes good model for statistical alignment, partially introduced in 1986 by Thomson & Bishop.

1991: States & Botstein compares a DNA string with a protein in search of frameshift mutations.

1993-4: Gusfield, Lander, Waterman and others introduces parametric alignment.

1987: Feng-Doolittle introducesphylogenetisk alignment: "Once a gap always a gap".

1989: Kececioglou makes strong acceleration of Sankoff's exact algorithm.

1994: Krogh et al & Baldi et al. introduces Hidden Markov Models for multiple alignment.

1995: Mitcheson & Durbin introduces Tree-HMMs allowing sequences generated by an HMM to have a "tree correlation" structure, but not based on an explicit evolutionary process.

1999 - resurgence of interest in statistical alignment

Database Search

What is the probability model for database search?

What is PAM matrix

Statistical Alignment & Homology Testing.

Illustration of database search.Query sequence: q=YQPVNPAL

Database: s1=CVDAEGKYL and s2=TTEQRPKNPATYCG

i. No mismatches - no gaps: longest common segment

q = YQPVNPALs2 = TTEQRPKNPATYCG length = 3 ***

ii. Mismatches allowed - no gaps

q = YQPVNPALs2 = TTEQRPKNPATYCG * ***

Similarity function s( , ).

Total score, S = s(P,P) + s(V,K) + ..+ s(A,A)

iii.Both mismatches & gaps:Local alignment (LA).

q = YQ-PVNPALs2= TTEQRPKNPATYCG * * ***

Here I1 = QPVNPA and J1 = QRPKNPA.

S = s(Q,Q) - g + s(P,P) + s(V,K) + ..+ s(A,A)

i. If g & mismatch cost: = infinity, LA reduces to longest common segment.

ii. If g: = infinity, LA reduces to best segment.

Distributions of Scores.

Model: q and database is a series of iid (independent, identically distributed) random variables, Xi's, where P(Xi=j) = pj (j any of the 20 amino acids).

Scoring scheme:

i. E(s(Xi,Xj)) = pi*pj*s(i,j) < 0

ii. max s(i,j) > 0 and g < 0.

iii. s(i,j) = s(j,i)

€

i, j

∑

i. Longest Common Segments. The mean grows proportionally to log(nm), where n and m are the length of the 2 sequences.

ii. Best Segment with score S will follow an extreme value distribution.

P(S>x) = exp(exp(-*x/u)),

u is a positioning parameter, a parameter that determines how fast the distribution tails off.

P(S>x) ) = exp(K*m*n*exp(- x))

is the x that solves pi*pj*exp(s(i,j)*x) = 1

K is also a known function of the pi's and the sij's.

iii. Local Alignment: Distribution unknowm. Looks like Extreme Value Distribution.

€

i, j

∑

From http://www.vuse.vanderbilt.edu/~mahas1/ce207/type1extremevalue.html

The PAM matrix - Point Accepted Mutations

Wi,j= -ln(i*P2.5i,j/(i*j))

s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE

Z = [score(s1,s2)-E{score(s1,s2*)}]/ s.d.{score(s1,s2*)}]

s2* is a random permutation of s2

From

W.P

earson

The tactics of BLAST I

The most widely used program for database searches in biological sequence databases is BLAST (Altschul et al., 1990) and variants of BLAST.

i. It defines a neighbourhood of segments to the segments composing the query sequence.

ii. It finds segments in the database that matches these neighborhood segments very quickly.

q=YQPVNPAL

{YQPVN, QPVNP, PVNPA, VNPAL}

YQPVN

{AQPVN, BQPVN,.., YAPVN,..}

I

II

The tactics of BLAST II

iii. It heuristically finds large segments giving a good score.

iv. If the score of this good segment is statistically significant, then this is extended 5'-ward and 3'-ward by a local alignment

algorithm, giving a proposed local alignment.

Homology Test

Wi,j= -ln(i*P2.5i,j/(i*j))

D(s1,s2) is evaluated in D(s1,s2*)

Real s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE *** ** * * *

This test:

1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart.

2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary.

3. It samples in (i*j) by permuting the order of amino acids in the second. I.e. uses drawing without replacement – a hypergeometric distribution.

Questions & SummarySummary:i. Database search is a local alignment problem

ii. Scores are evaluated in an extreme value distribution.

iii. Databases can have problems with internal similarities.

Comments & Questions:1. Fuzzy Problem - in principle are all sequences homologous.

2. Combined model of shift in functionality with sequence evolution would be optimal.

3. If the query sequence is a set of homologous sequences, it is possible to weight important positions.

week 36: trees, alignment & database search

Documents