week 36: trees, alignment & database search
DESCRIPTION
Week 36: Trees, Alignment & Database Search. Trees Alignment Database Search. Trees. Tree Concepts Reconstruction Methods Parsimony Likelihood Distance The Molecular Clock. Topologies 4 sequences with root ignored: s1 s3 s1 s2 s1 s2 - PowerPoint PPT PresentationTRANSCRIPT
Week 36: Trees, Alignment & Database Search
Trees
Alignment
Database Search
Trees
Tree Concepts
Reconstruction Methods
Parsimony
Likelihood
Distance
The Molecular Clock
Topologies
4 sequences with root ignored:
s1 s3 s1 s2 s1 s2 \ / \ / \ / \ / \ / \ / -------- -------- -------- / \ / \ / \ / \ / \ / \ s2 s4 s3 s4 s4 s3
For unrooted trees with k leaves and internal nodes with 3 incoming edges, the following holds:
Edges = 2k-3; Internal nodes = k-2; T(k) = (2k-5)T(k-1); T(3)=1;
k 3 4 5 6 7 20-----------------------------------------------T(k) 1 3 15 105 945 1023
-----------------------------------------------Edges 3 5 7 9 11 37-----------------------------------------------Internal Nodes 1 2 3 4 5 18
Central Principles of Phylogeny Reconstruction
Parsimony
Distance
Likelihood
TTCAGT
TCCAGT
GCCAAT
GCCAAT
s2
s1
s4
s3
s2
s1
s4
s3
s2
s1
s4
s3
0
1
12
0Total Weight: 4
1
1 2
3 2 10.4
0.6
0.3
0.71.5
L=3.1*10-7
Parameter estimates
The Small Parsimony Problem(Fitch-Hartigan-Sankoff)
?/ \/ \
? ?/ \ / \
C G T C
L’N
/ /\ \ / / \ \ / / \ \ / LA LC LG LT / \ RA RG .. \
Recursion: L’N = min{LNL + d(N,NL)} + min{LNR + d(N,NR)}
Initialisation: Lleaf(N) = 0, if N at leaf - else infinity
Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7) / \ / \ Costs: Transition 2, / \ (A ,C,G, T) \ Transversion 5, indel 10. (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *
Distance Concepts on Trees I
A: Metric, d( , ) : i: d(a,b)=0 <=> a=b ii: d(a,b)=d(b,a) iii: d(a,b) <= d(a,c) + d(c,b)
a
c
b
Tree Metric: (distance function originates from tree)
d(x,y) + d(z,æ) = d(x,z) + d(y,æ) > d(x,æ) + d(y,z), where z,y,z,æ is a permutation of a,b,c,d.
(> implies that no branch has length 0)
Distance Concepts on Trees II
s2
s1
s4
s3
Reconstruction Principle: d(s1,i) = (d(s1,s2) + d(s1,s3) - d(s2,s3))/2
s3
s2s1
i
Ultra Metric (distance function originates from tree)
d(x,y) = d(x,z) > d(x,y), where z,y,z is a permutation of a,b,c.(> implies that no branch has length 0)
Distance Concepts on Trees III
i
s1 s3s2
Reconstruction Principle: d(s1,i) = d(s1,s2)/2
Evolutionary Substitution Process
t1 t2
CCA
Pi,j(t) = probability of going from i to j in time t.
Probability of a pattern - summing over internal states
A C G T
A C G T A C G T
A
A
A
?
? ?
?
T
GC
Felsenstein's recursion
Conditional probability, CL(v,N) is the probability for observing the nucleotides at the leaves at a subtree hanging from v, if nucleotide N is found at v.
P(v1,v2,N1,N2) = probability for N1 at v1, N2 at v2.
I) Initial contition: CL(leaf,N)= 1{N is at leaf}
II) Recursion
P(v,vl,N,Nl)*CL(vl,Nl) * CL(v,N) = P(v,vr,N,Nr)*CL(vr,Nr)
where r refers to right son, l to left son.
III) L(p) = πN*BL(Rod,N,p).
IV) Total likelihood:Product over all positions:L = Li(p).
€
v→ vL
∑
€
v→vR
∑
€
∑
€
i
∑
Output from Likelihood Method With Clock: Without Clock: s5 s4 23 5.2 \ / /\ 40.9 20.4 / \ \ / / \ ! / \ 1.6 5.6 23 sd4.6 124.4 / \ s1---6-------22---------------11---3 /\ \ ! ! 44.9 /\ \ /\ 7 3.4 4 sd.1.4 / \ \ / \ ! s1 s2 s3 s4 s5 s2
Likelihood: 7.9*10-14 = 0.31.1,0.18.1 6.2*10-12 = 0.34.1 0.16.1
ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom.
The Felsenstein ZoneFelsenstein-Cavendar (1979)
s4
s3s2
s1
Patterns:(16 only 8 shown)
0 1 0 0 0 0 0 0
0 0 1 0 0 1 0 1
0 0 0 1 0 1 1 0
0 0 0 0 1 0 1 1
True Tree Reconstructed Tree
s3
s1
s2
s4
The Molecular Clock
First noted by Zuckerkandl & Pauling (1964) as an empirical fact.
How can one detect it?
Known Ancestor Time Unknown AncestorTime
/\ a at time T. / \ / \ ? \ / \ /\ \ / \ / \ \ / \ / \ \s1 s2 s1 s2 s3
History of Phylogenetic Methods
1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.
1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.
1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.
1967 First large molecular phylogenies by Fitch and Margoliash.
1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.
1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.
1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.
1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.
1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.
1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP).
1981 Parsimony tree problem is shown to be NP-Complete.
1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.
1986 Bandelt and Dress introduces split decompostion as a generalization of trees.
1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.
1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.
2000 Rambaut (and others) makes methods that can find trees with non-
contemporaneous leaves.
Alignment
Pairwise Alignment Again
Triple – Quadruple - Many
Similarity-Distance Conversion
Local Alignment
Statistical alignment
Conclusion
Parsimony Alignment of two strings.Sequences: s1=CTAGG s2=TTGT.
Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10.
{CTA,TT} + GG
{CTAG,TTG} = {CTA,TTG} + G-
{CTAG,TT} + -G
Initial condition: D0,0=0. (Di,j = D(s1[1:i], s2[1:j]))
Di,j=min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g}
DCTA,TT + w(GG) =12 + 0 = 12D4,3=DCTAG,TTG=min{DCTA,TTG + w(G-) = 4 + 10 = 14} DCTAG,TT + w(-G) = 22 + 10 = 32
40 32 22 14 9 17T / 30 22 12 4 12 22G / 20 12 2 - 12 22 32T / 10 2 10 20 30 40T / 0 10 20 30 40 50 C T A G G
CTAGG Alignment: i v Cost 17 TT-GT
CA
A?
Alignment of three sequences.
s1=ATCG s2=ATGCC s3=CTCC
Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC
Configurations in an alignment column:
- - n n n - n -- n - n - n n -n - - - n n n -
Recursion: Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')}
Initial condition: D0,0,0 = 0.
Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3
New phenomena: ancestral sequence.
G
G
C
C
Parsimony Alignment of four sequencess1=ATCG s2=ATGCC s3=CTCC s4=ACGCG
Alignment: AT-CG ATGCC CT-CC ACGCG
Configurations in alignment columns
- - - n - - - n n n - n n n n -- - n - n n - n - - n - n n n -- n - - n - n - n - n n - n n -n - - - - n n - - n n n n - n -
Recursion: Di = min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4]
Initial condition: D0 = 0.
Computation time: l1*l2*l3*l4*15 Memory : l1*l2*l3*l4
Alignment of many sequences.s1=ATCG, s2=ATGCC, ......., sn=ACGCG
Alignment: AT-CG s1 s3 s4 ATGCC \ ! / ..... ---------- ..... / \ ACGCG s2 s5
Configurations in an alignment column: 2n-1
Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n]
Initial condition: D0,0,..0 = 0.
Computation time: ln*(2n-1)*n (l:sequence length, n:number of sequences)
Memory requirement: ln
Close-to-Optimal Alignments(Waterman & Byers, 1983)
A. Alignments within of optimal. Ex. = 2
40 32 22 14 9 * 17T * / 30 22 12 4 12 22G * / 20 12 2 - 12 22 32T / 10 2 10 20 30 40T / 0 10 20 30 40 50 C T A G G CTAGG Alignment: i iv Cost 19 TTGT- Caveat: There are enormous numbers of suboptimal alignments.
B. Sets of postions that are on some suboptimal alignment.
Longer Indels
TCATGGTACCGTTAGCGTGCA-----------GCAT
gk : cost of indel of length k.
Initial condition: D0,0=0
Di,j = min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g1,Di,j-2 + g2,Di,j-3 + g3,, Di-1,j + g1,Di-2,j + g2,Di-3,j + g3,, }
Cubic running time. Quadratic memory.
If gk = a + b*k, then quadratic running time.
Gotoh (1982) Di,j is split into 3 types:
1. D0i,j as Di,j, except s1[i] must mactch s2[j].
2. D1i,j as Di,j, except s1[i] is matched with "-".
3. D2i,j as Di,j, except s2[i] is matched with "-".
Then: D0i,j = min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1) + d(s1[i],s2[j])
D1i,j = min(D1i,j-1 + b, D0i,j-1 + a + b)
D2i,j = min(D2i-1,j + b, D0i-1,j + a + b)
Comment:1. Evolutionary Consistency Condition: gi + gj > gi+j
Gotoh Alignment,1981
Let all substitutions cost 2 og let gk= 3 + k, then align ACGT with AT.
The alignment must be: ACGT with a cost 5. A--T
- n
5 6 2 6 5 5 4 10 11 12T T 4 0 4 5 6 4 8 9 10 11A A 0 4 5 6 7 - - - - - A C G T A C G T
n n n -
- 6 2 6 5 - 8 9 6 7T T - 0 6 7 8 - 8 4 5 6A A 0 - - - - - 4 5 6 7 A C G T A C G T
Distance-Similarity.(Smith-Waterman-Fitch,1982)
Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 +g, Di-1,j +g}
Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}
Distance: Transitions:2 Transversions 5 Indels:10
M largest distance between two nucleotides (5).
Similarity s(n1,n2) M - d(n1,n2) wk k/(2*M) + gk w 1/(2*M) + g
Similarity Parameters: Transversions:0 Transitions:3 Identity:5 Indels: 10 + 1/10
40/-40.4 32/-27.3 22/-12.2 14/0.9 9/11.0 17/2.9T 30/-30.3 22/-17.2 12/-2.1 4/11.0 12/2.9 22/-7.2G 20/-20.2 12/-7.1 2/8.0 12/-2.1 22/-12.2 32/-22.3T 10/-10.1 2/3.0 10/-7.1 20/-17.2 30/-27.3 40/-37.4T 0/0 10/-10.1 20/-20.2 30/-30.3 40/-40.4 50/-50.5
C T A G G
Comments1. The Switch from Dist to Sim is highly analogous to Maximizing {-f(x)} instead of Minimizing {f(x)}.
2. Dist will based on a metric: i. d(x,x) =0, ii. d(x,y) >=0, iii. d(x,y) = d(y,x) & iv. d(x,z) + d(z,y) >= d(x,y).
There are no analogous restrictions on Sim, giving it a larger parameter space.
Local alignment
Global Alignment: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}Local: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w,0}
0 1 0 .6 1 2 .6 1.6 1.6 3 2.6 Score Parameters: C 0 0 1 0 1 .3 .6 0.6 2 3 1.6 Match: 1 A 0 0 0 1.3 0 1 1 2 3.3 2 1.6 Mismatch -1/3 G / 0 0 .3 .3 1.3 1 2.3 2.3 2 .6 1.6 Gap 1 + k/3C /
0 0 .6 1.6 .3 1.3 2.6 2.3 1 .6 1.6 GCC-UCGU / GCCAUUG 0 0 2 .6 .3 1.6 2.6 1.3 1 .6 1 A ! 0 1 .6 0 1 3 1.6 1.3 1 1.3 1.6 C / 0 1 0 0 2 1.3 .3 1 .3 2 .6 C / 0 0 0 1 .3 0 0 .6 1 0 0 G / 0 0 0 .6 1 0 0 0 1 1 2 U 0 0 1 .6 0 0 0 0 0 0 0 A 0 0 1 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 C A G C C U C G C U U
SodhSodb Sodl
sddm
Sdmz
sods Sdpb
Progressive Alignment(Feng-Doolittle 1987 J.Mol.Evol.)
Can align alignments and given a tree make a multiple alignment.
* *alkmny-trwq acdeqrtakkmdyftrwq acdehrtkkkmemftrwq
[ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6
* * *** * * * * * *Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrkSodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrkSodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrkSddm atkavcvlkgdgpqvq—-infeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq—infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp LsrkSods vatkavcvlkgdgpqvq—infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk Sdpb datkavcvlkgdgpqvq—infeqkesdgpv---wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk
< * A # C G
## ##
#
T= 0
T = t
Thorne-Kishino-Felsenstein (1991) Process
i. P(s) = (1-)()l A#A* .. * T
#T l =length(s)ii. Time reversible
& into Alignment Blocks
A. Amino Acids Ignored:
# - - - # - - - - * - - - -# # # # - # # # # * # # # # k k k
e-t[1-(t)]((t))k-1 [1-e-t-(t)][1-(t)]((t))k-1 [1-(t)]((t))k
pk(t) p’k(t) p’’k(t)
p’0(t)= (t) where (t)=[1-e()t]/[]
B. AA Considered: T - - - R Q S W Pt(T-->R)*Q*..*W*p4(t) 4
T - - - - - R Q S W R *Q*..*W*p’4(t) 4
Basic Pairwise Recursion (O(length3))
i
j
Survives Dies
i-1j
i-1 i-1i
j-2
j-1i
ijj
i
j
€
€
P(s1i−1 → s2j−2)* p2 * f (s1[i],s2[ j −1])
i-1j-1
-globin (141) and -globin (146)
430.108 : -log(-globin) 327.320 : -log(-globin ‡ -globin) 730.428 : -log(l(sumalign))
*t: 0.0371805 +/- 0.0135899*t: 0.0374396 +/- 0.0136846s*t: 0.91701 +/- 0.119556
E(Length) E(Insertions,Deletions) E(Substitutions)
143.499 5.37255 131.59
Maximum contributing alignment:
V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF
TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Ratio l(maxalign)/l(sumalign) = 0.00565064
1966: Levenstein formulates distance measure between sequences and instroduces dynamica programming algorithm finding the distance.
1970: Needleman and Wunch compares proteins maximising a similarity score.
1972: Sankoff & Sellers reinvents the basic algorithm. Sankoff can also align subject to the constraint that there must be exactly k indels.
1973: Sankoff makes multiple alignment and phylogeny - both exact & heuristic.
1975: Hirschberg gives linear memory algorithm.
1976: Waterman gives cubic algorithm allowing for indels of arbitrary length. 1976: Waterman introduces alignment without reference to phylogeny.
1981: Waterman, Smith and Fitch shows duality of simiarity and distance.
1982: Gotoh gives quadratic algorithm if gap penalty functionen is gk = a + b*k (for indel of length k). Uses 3 matrices in stead of 1.
1983: Waterman and Byers introduces close-to-optimal alignments.
1984-5: Ukkonen, Myers, Fickett accelerates algorithmen considerably.
1984: Hogeveg and Hespers introduces heuristic multiple phylogenetic alignment.
1984: Fredman introduces triple alignment generalisation of Needleman-Wunch.
1985: Lipman & Wilbur uses hashing.
1989: Myers introduces alignment with concave gap penalty function.
1991: Thorne, Kishino & Felsenstein makes good model for statistical alignment, partially introduced in 1986 by Thomson & Bishop.
1991: States & Botstein compares a DNA string with a protein in search of frameshift mutations.
1993-4: Gusfield, Lander, Waterman and others introduces parametric alignment.
1987: Feng-Doolittle introducesphylogenetisk alignment: "Once a gap always a gap".
1989: Kececioglou makes strong acceleration of Sankoff's exact algorithm.
1994: Krogh et al & Baldi et al. introduces Hidden Markov Models for multiple alignment.
1995: Mitcheson & Durbin introduces Tree-HMMs allowing sequences generated by an HMM to have a "tree correlation" structure, but not based on an explicit evolutionary process.
1999 - resurgence of interest in statistical alignment
Database Search
What is the probability model for database search?
What is PAM matrix
Statistical Alignment & Homology Testing.
Illustration of database search.Query sequence: q=YQPVNPAL
Database: s1=CVDAEGKYL and s2=TTEQRPKNPATYCG
i. No mismatches - no gaps: longest common segment
q = YQPVNPALs2 = TTEQRPKNPATYCG length = 3 ***
ii. Mismatches allowed - no gaps
q = YQPVNPALs2 = TTEQRPKNPATYCG * ***
Similarity function s( , ).
Total score, S = s(P,P) + s(V,K) + ..+ s(A,A)
iii.Both mismatches & gaps:Local alignment (LA).
q = YQ-PVNPALs2= TTEQRPKNPATYCG * * ***
Here I1 = QPVNPA and J1 = QRPKNPA.
S = s(Q,Q) - g + s(P,P) + s(V,K) + ..+ s(A,A)
i. If g & mismatch cost: = infinity, LA reduces to longest common segment.
ii. If g: = infinity, LA reduces to best segment.
Distributions of Scores.
Model: q and database is a series of iid (independent, identically distributed) random variables, Xi's, where P(Xi=j) = pj (j any of the 20 amino acids).
Scoring scheme:
i. E(s(Xi,Xj)) = pi*pj*s(i,j) < 0
ii. max s(i,j) > 0 and g < 0.
iii. s(i,j) = s(j,i)
€
i, j
∑
i. Longest Common Segments. The mean grows proportionally to log(nm), where n and m are the length of the 2 sequences.
ii. Best Segment with score S will follow an extreme value distribution.
P(S>x) = exp(exp(-*x/u)),
u is a positioning parameter, a parameter that determines how fast the distribution tails off.
P(S>x) ) = exp(K*m*n*exp(- x))
is the x that solves pi*pj*exp(s(i,j)*x) = 1
K is also a known function of the pi's and the sij's.
iii. Local Alignment: Distribution unknowm. Looks like Extreme Value Distribution.
€
i, j
∑
From http://www.vuse.vanderbilt.edu/~mahas1/ce207/type1extremevalue.html
The PAM matrix - Point Accepted Mutations
Wi,j= -ln(i*P2.5i,j/(i*j))
s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE
Z = [score(s1,s2)-E{score(s1,s2*)}]/ s.d.{score(s1,s2*)}]
s2* is a random permutation of s2
From
W.P
earson
From
W.P
earson
From
W.P
earson
The tactics of BLAST I
The most widely used program for database searches in biological sequence databases is BLAST (Altschul et al., 1990) and variants of BLAST.
i. It defines a neighbourhood of segments to the segments composing the query sequence.
ii. It finds segments in the database that matches these neighborhood segments very quickly.
q=YQPVNPAL
{YQPVN, QPVNP, PVNPA, VNPAL}
YQPVN
{AQPVN, BQPVN,.., YAPVN,..}
I
II
The tactics of BLAST II
iii. It heuristically finds large segments giving a good score.
iv. If the score of this good segment is statistically significant, then this is extended 5'-ward and 3'-ward by a local alignment
algorithm, giving a proposed local alignment.
Homology Test
Wi,j= -ln(i*P2.5i,j/(i*j))
D(s1,s2) is evaluated in D(s1,s2*)
Real s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE *** ** * * *
This test:
1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart.
2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary.
3. It samples in (i*j) by permuting the order of amino acids in the second. I.e. uses drawing without replacement – a hypergeometric distribution.
Questions & SummarySummary:i. Database search is a local alignment problem
ii. Scores are evaluated in an extreme value distribution.
iii. Databases can have problems with internal similarities.
Comments & Questions:1. Fuzzy Problem - in principle are all sequences homologous.
2. Combined model of shift in functionality with sequence evolution would be optimal.
3. If the query sequence is a set of homologous sequences, it is possible to weight important positions.