multiple alignment
DESCRIPTION
One of the most essential tools in molecular biology Finding highly conserved subregions or embedded patterns of a set of biological sequences Conserved regions usually are key functional regions, prime targets for drug developments Estimation of evolutionary distance between sequences - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/1.jpg)
1Multiple sequence alignment Bioinformatics Spring 2005
Multiple alignment
• One of the most essential tools in molecular biology
– Finding highly conserved subregions or embedded patterns of a set of biological sequences
• Conserved regions usually are key functional regions, prime targets for drug developments
– Estimation of evolutionary distance between sequences
– Prediction of protein secondary/tertiary structure• Practically useful methods only since 1987 (D.
Sankoff) – Before 1987 they were constructed by hand – Dynamic programming is expensive
![Page 2: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/2.jpg)
2Multiple sequence alignment Bioinformatics Spring 2005
Alignment between globins (human beta globin, horse beta globin, human alpha globin, horse alpha globin, cyanohaemoglobin,
whale myoglobin, leghaemoglobin) produced by Clustal. Boxes mark the seven alpha helices composing each globin.
.
![Page 3: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/3.jpg)
3Multiple sequence alignment Bioinformatics Spring 2005
![Page 4: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/4.jpg)
4Multiple sequence alignment Bioinformatics Spring 2005
Definition
• Given strings x1, x2 … xk a multiple (global)
alignment maps them to strings x’1, x’2 …
x’k that may contain spaces where
– |x’1| = |x’2| = … = |xk’|
– The removal of all spaces from x’i leaves xi, for 1 i k
![Page 5: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/5.jpg)
5Multiple sequence alignment Bioinformatics Spring 2005
Definitions
• Multiple Alignment– A rectangular arrangement, where each row consists
of one protein sequence padded by gaps, such that the columns highlight similarity/conservation between positions
• Motif– A conserved element of a protein sequence alignment
that usually correlates with a particular function – Motifs are generated from a local multiple protein
sequence alignment corresponding to a region whose function or structure is known
![Page 6: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/6.jpg)
6Multiple sequence alignment Bioinformatics Spring 2005
Example of motif
• Motifs are conserved and hence predictive of any subsequent occurrence of such a structural/functional region in any other novel protein sequence
NAYCDEECKNAYCDKLC--GYCN-ECTNDYC-RECR
![Page 7: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/7.jpg)
7Multiple sequence alignment Bioinformatics Spring 2005
Scoring multiple alignments
• Ideally, a scoring scheme should– Penalize variations in conserved positions higher– Relate sequences by a phylogenetic tree
• Tree alignment
• Usually assume – Independence of columns– Quality computation
• Entropy-based scoring– Compute the Shannon entropy of each column– Minimize the total entropy
• Steiner string• Sum-of-pairs (SP) score
![Page 8: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/8.jpg)
8Multiple sequence alignment Bioinformatics Spring 2005
Tree alignment
• Ideally:– Find alignment that maximizes probability that
sequences evolved from common ancestor
x
yz
w
v
?
![Page 9: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/9.jpg)
9Multiple sequence alignment Bioinformatics Spring 2005
Tree alignment
• Model the k sequences with a tree having k leaves (1 to 1 correspondence)
• Compute a weight for each edge, which is the similarity score
• Sum of all the weights is the score of the tree
• Assign sequences to internal nodes so that score is maximized
![Page 10: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/10.jpg)
10Multiple sequence alignment Bioinformatics Spring 2005
Tree alignme
nt exampl
e
• Match +1, gap -1, mismatch 0
• If x=CT and y=CG, score of 6
CAT
GT
CTG
CG
x y
![Page 11: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/11.jpg)
11Multiple sequence alignment Bioinformatics Spring 2005
Analysis
• The tree alignment problem is NP-complete– Holds even for the special case of star alignment
– “lifting alignment” gives a 2-approximate algorithm
• The generalized tree alignment problem (find the best tree) is also NP-complete
• Special cases for different kinds of scoring metrics– Size of alphabet
– Triangle inequality
![Page 12: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/12.jpg)
12Multiple sequence alignment Bioinformatics Spring 2005
Consensus representations • Relative frequencies of symbols in each column
– Adds up to 1 in each column
• Steiner string– Minimize the consensus error
– May not belong to the set of input strings
• Consensus string for a given multiple alignment– Choose optimal character in every column
– Consensus string is the concatenation of these characters
– Alignment error of a column is the distance-sum to the optimal character of all symbols in the column
– Alignment error of a consensus string is the sum of all column errors
• Optimal consensus string: optimize over all multiple alignments
• Signature representation– Regular expression
– Helicase protein: [&H][&A]D[DE]xn[TSN]x4[QK]Gx7[&A]
• & is any amino acid in {I,L,V,M,F,Y,W}
![Page 13: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/13.jpg)
13Multiple sequence alignment Bioinformatics Spring 2005
Steiner string and consensus error
metric• Minimize Σ D(s,xi), over all possible strings s
• String smin is called the Steiner string
– May not belong to the set of inputs– NP-complete
• Consensus error metric based on similarity to the steiner string– center string provides an approximation factor of 2
i
![Page 14: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/14.jpg)
14Multiple sequence alignment Bioinformatics Spring 2005
Relating alignment error and consensus error
• Let s be the steiner string for a string set X = {xi} and c be the optimal consensus string
• For any multiple alignment M of X, – Let xM be the consensus string– Alignment error of xM = consensus error using xM ≥
consensus error using s
• Consider the star multiple alignment N using s– Alignment error of N using s = consensus error using s– Alignment error of N using s ≤ Alignment error of any
multiple alignment – N is the optimal multiple alignment and s (after removing
gaps) is the consensus string
• Steiner string provides the optimal consensus string
![Page 15: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/15.jpg)
15Multiple sequence alignment Bioinformatics Spring 2005
• Profile– Apply dynamic programming– Score depends on the profile
• Consensus string– Apply dynamic programming
• Signature representations– Align to regular expressions / CFG/ …
Aligning to family representations
![Page 16: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/16.jpg)
16Multiple sequence alignment Bioinformatics Spring 2005
Scoring Function: Sum of Pairs
Definition: Induced pairwise alignmentA pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
![Page 17: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/17.jpg)
17Multiple sequence alignment Bioinformatics Spring 2005
Sum of Pairs (cont’d)
• The sum-of-pairs (SP) score of a multiple alignment A is the sum of the scores of all induced pairwise alignments
S(A) = i<j S(Aij)
Aij is the induced alignment of xi, xj
• Drawback: no evolutionary characterization– Every sequence derived from all others
![Page 18: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/18.jpg)
18Multiple sequence alignment Bioinformatics Spring 2005
Optimal solution for SP scores
• Multidimensional Dynamic Programming • Generalization of pair-wise alignment• For simplicity, assume k sequences of length n• The dynamic programming array is k-dimensional
hyperlattice of length n+1 (including initial gaps)
• The entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]
• Initialize values on the faces of the hyperlattice
![Page 19: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/19.jpg)
19Multiple sequence alignment Bioinformatics Spring 2005
NANSNV
s
NA-N
NVs
V
S
A
NANS-N
s
-NNS-N
s
-N-N
NVs
NNN
s
AS-
A--
-NNSNV
-S-
NA-N
NV--V
NANS-N
-SV
NA-N-N
A-V
-NNS-N
AS-
-N-N
NVASV
NNN
NANSNV
s
s
s
s
s
s
s
s max
k=3 2k –1=7
![Page 20: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/20.jpg)
20Multiple sequence alignment Bioinformatics Spring 2005
• Space complexity: O(nk) for k sequences each n long.
• Computing at a cell: O(2k). cost of computing δ.• Time complexity: O(2knk). cost of computing δ.• Finding the optimal solution is exponential in k • Proven to be NP-complete for a number of cost
functions
Complexity
![Page 21: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/21.jpg)
21Multiple sequence alignment Bioinformatics Spring 2005
• Faster Dynamic Programming– Carrillo and Lipman 88 (CL)
– Pruning of hyperlattice in DP
– Practical for about 6 sequences of length about 200.
• Star alignment• Progressive methods
– CLUSTALW
– PILEUP
• Iterative algorithms• Hidden Markov Model (HMM) based methods
Algorithms
![Page 22: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/22.jpg)
22Multiple sequence alignment Bioinformatics Spring 2005
• Find pairwise alignment• Trial multiple alignment produced by a tree, cost = d• This provides a limit to the volume within which optimal
alignments are found• Specifics
– Sequences x1,..,xr.
– Alignment A, score = s(A)
– Optimal alignment A*
– Aij = induced alignment on xi,..,xj on account of A
– D(xi,xj) = score of optimal pairwise alignment of xi,xj ≥ s(Aij )
CL algorithm
![Page 23: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/23.jpg)
23Multiple sequence alignment Bioinformatics Spring 2005
• d ≤ s(A*) = s(A*uv) + Σ Σ s(A*ij) ≤
s(A*uv) + Σ Σ D(xi,xj)
• s(A*uv) ≥ d - Σ Σ D(xi,xj) = B(u,v)• Compute B(u,v) for each (u,v) pair• Consider any cell f with projection (s,t) on u,v plane.• If A* passes through f then A*uv passes through (s,t)
– beststuv = best pairwise alignment of xu,xv that passes through
(s,t). – bestst
uv = score of the prefixes up to (s,t) + cost(xsi,xs
j) + score of suffixes after (s,t)
CL algorithm
i < j(i,j) ≠ (u,v)
i < j(i,j) ≠ (u,v)
i
i
![Page 24: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/24.jpg)
24Multiple sequence alignment Bioinformatics Spring 2005
• If beststuv < B(u,v), then
– A* cannot pass through cell f – Discard such cells from computation of DP– Can prune for all (u,v) pairs
CL algorithm
![Page 25: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/25.jpg)
25Multiple sequence alignment Bioinformatics Spring 2005
Star alignment
• Heuristic method for multiple sequence alignments
• Select a sequence c as the center of the star• For each sequence x1, …, xk such that index i c,
perform a Needleman-Wunsch global alignment• Aggregate alignments with the principle “once a
gap, always a gap.”• Consider the case of distance (not scores)
– Find multiple alignment with minimum distance
![Page 26: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/26.jpg)
26Multiple sequence alignment Bioinformatics Spring 2005
Star alignment example
s2
s1s3
s4
S1: MPES2: MKES3: MSKES4: SKE
MPE
| |
MKE
MSKE
| ||
M-KE
SKE
||
MKE MPEMKE
M-PEM-KEMSKE
M-PEM-KEMSKES-KE
![Page 27: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/27.jpg)
27Multiple sequence alignment Bioinformatics Spring 2005
Choosing a center
• Try them all and pick the one with the least distance• Let D(xi,xj) be the optimal distance between sequences
xi and xj.• Given a multiple alignment A, let c(Aij) be the distance
between xi and xj that is induced on account of A.• Calculate all O(k2) alignments, and pick the sequence
xi that minimizes the following as xc
Σ D(xi,xj)• The resulting multiple alignment A has the property
that c(Aci) = D(xc,xi).
j ≠ i
![Page 28: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/28.jpg)
28Multiple sequence alignment Bioinformatics Spring 2005
Analysis
• Assuming all sequences have length n• O(k2n2) to calculate center• Step i of iterative pairwise alignment takes
O((i.n).n) time– two strings of length n and i.n
• O(k2n2) overall cost• Produces multiple sequence alignments whose SP
values are at most twice that of the optimal solutions, provided triangle inequality holds.
![Page 29: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/29.jpg)
29Multiple sequence alignment Bioinformatics Spring 2005
• Let M = Σ c(A1i) = Σ D(x1,xi), assume x1 is the center
• 2 c(A) = Σ Σ c(Aij) ≤ Σ Σ [c(A1i) + c(A1j)] =
2(k-1) Σ c(A1i) = 2(k-1) M
• 2 c(A*) = Σ Σ c(A*ij) ≥ Σ Σ D(xi,xj) ≥ k Σ c(A1i) = k M
• c(A)/c(A*) <= 2(k-1)/k <= 2
Bound analysis
i
i = 2
j ≠ i
i = 2
j ≠ ii
j ≠ ii i j ≠ i i = 2
i = 2
![Page 30: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/30.jpg)
30Multiple sequence alignment Bioinformatics Spring 2005
• Center string c also provides an approximation factor of 2 under consensus error (score) metric
• Assume triangle inequality
• Let E(x) denote the consensus error wrt string x.
• Let z be the Steiner string
• E(z) = Σ D(z,xi)
Consensus error
i
![Page 31: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/31.jpg)
31Multiple sequence alignment Bioinformatics Spring 2005
• For any string y in the input set,– E(y) = Σ D(y,xi) ≤ Σ [D(y,z) + D(z,xi)] =
(k-2) D(y,z) + D(y,z) + Σ D(z,xi) = (k-2) D(y,z) + E(z)
• Pick y* from input set that is closest to z.– E(z) = Σ D(z,xi) ≥ k D(y*,z)
• E(y*)/E(z) ≤ [(k-2) D(y*,z) +E(z)]/E(z)
≤ (k-2) D(y*,z) / [k D(y*,z)] + 1 ≤ 2-2/k <= 2
• E(c) ≤ E(y*)
Consensus error
i
y ≠ xi
y ≠ xi
i
![Page 32: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/32.jpg)
32Multiple sequence alignment Bioinformatics Spring 2005
ClustalW
• Progressive alignment• 3 steps:
– All pairs of sequences are aligned to produce a distance matrix (or a similarity matrix)
– A rooted guide tree is calculated from this matrix by the neighbor-joining (NJ) method
• Neighbor Joining – Saitou, 1987
– The sequences are aligned progressively according to the branching order in the guide tree
![Page 33: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/33.jpg)
33Multiple sequence alignment Bioinformatics Spring 2005
ClustalW example
S1 ALSKS2 TNSDS3 NASKS4 NTSD
![Page 34: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/34.jpg)
34Multiple sequence alignment Bioinformatics Spring 2005
ClustalW example
S1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
Distance Matrix
All pairwisealignments
![Page 35: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/35.jpg)
35Multiple sequence alignment Bioinformatics Spring 2005
ClustalW example
S1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
Distance Matrix
S3
S1
S2
S4
Rooted Tree
All pairwisealignments
NeighborJoining
![Page 36: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/36.jpg)
36Multiple sequence alignment Bioinformatics Spring 2005
ClustalW example
S1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
1. Align S1 with S3
2. Align S2 with S4
3. Align (S1, S3) with (S2, S4)
Distance Matrix
S3
S1
S2
S4
Rooted Tree
Multiple Alignment Steps
All pairwisealignments
NeighborJoining
![Page 37: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/37.jpg)
37Multiple sequence alignment Bioinformatics Spring 2005
ClustalW example
S1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
1. Align S1 with S3
2. Align S2 with S4
3. Align (S1, S3) with (S2, S4)
Distance MatrixRooted Tree
Multiple Alignment Steps
All pairwisealignments
NeighborJoining
-ALSKNA-SK
-TNSDNT-SD
-ALSK-TNSDNA-SKNT-SDMultiple
Alignment
S1
S3
S2
S4
Distance
![Page 38: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/38.jpg)
38Multiple sequence alignment Bioinformatics Spring 2005
Other progressive approaches
• PILEUP– Similar to CLUSTALW– Uses UPGMA to produce tree
![Page 39: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/39.jpg)
39Multiple sequence alignment Bioinformatics Spring 2005
Problems with progressive alignments
• Depend on pairwise alignments
• If sequences are very distantly related, much higher likelihood of errors
• Care must be made in choosing scoring matrices and penalties
![Page 40: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/40.jpg)
40Multiple sequence alignment Bioinformatics Spring 2005
Iterative refinement in progressive alignment
One problem of progressive alignment:• Initial alignments are “frozen” even when new
evidence comes
Example:x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear that correct y = GA-CTT
![Page 41: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/41.jpg)
41Multiple sequence alignment Bioinformatics Spring 2005
Multiple alignment tools
• Clustal W (Thompson, 1994)– Most popular
• PRRP (Gotoh, 1993)• HMMT (Eddy, 1995)• DIALIGN (Morgenstern, 1998)• T-Coffee (Notredame, 2000)• MUSCLE (Edgar, 2004)• Align-m (Walle, 2004)• PROBCONS (Do, 2004)
![Page 42: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/42.jpg)
42Multiple sequence alignment Bioinformatics Spring 2005
Evaluating multiple alignments
• Balibase benchmark (Thompson, 1999)
• De-facto standard for assessing the quality of a multiple alignment tool
• Manually refined multiple sequence alignments
• Quality measured by how good it matches the core blocks
• Clustal W performs the best– Problems of Clustal W
• Once a gap, always a gap
• Order dependent
![Page 43: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/43.jpg)
43Multiple sequence alignment Bioinformatics Spring 2005
Computationally challenging problems
• Scalable multiple alignment– Dynamic programming is exponential in number
of sequences– Practical for about 6 sequences of length about
200.
![Page 44: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/44.jpg)
44Multiple sequence alignment Bioinformatics Spring 2005
Quick Primer on NP completeness
• Polynomial-time Reductions– If we could solve X in polynomial time, then
we could also solve Y in polynomial time– YP X
• Class NP– Set of all problems for which there exists an
efficient certifier
• P = NP?– General transformation of checking a solution
to finding a solution
![Page 45: Multiple alignment](https://reader036.vdocument.in/reader036/viewer/2022081515/5681443d550346895db0d974/html5/thumbnails/45.jpg)
45Multiple sequence alignment Bioinformatics Spring 2005
• NP-completeness– X is NP-complete if
• XNP
• For all YNP, YPX
– If X is NP-complete, X is solvable in polynomial time iff P=NP
– Satisfiability is NP-complete– If Y is NP-complete and X is in NP with the
property that YPX, then X is NP complete