10/18/20151 multiple sequence alignment. 10/18/20152 copyright notice many of the images in this...
TRANSCRIPT
![Page 1: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/1.jpg)
04/20/23 1
Multiple sequence alignment
![Page 2: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/2.jpg)
04/20/23 2
Copyright notice
• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.
• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!
![Page 3: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/3.jpg)
04/20/23 3
Multiple sequence alignment: definition
• a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned
• Homologous residues are aligned in columns across the length of the sequences
• residues are homologous in an evolutionary sense
• residues are homologous in a structural sense
![Page 4: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/4.jpg)
04/20/23 4
Multiple sequence alignment: properties
• not necessarily one “correct” alignment of a protein family
• protein sequences evolve...
• ...the corresponding three-dimensional structures of proteins also evolve
• may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment
• for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures
![Page 5: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/5.jpg)
04/20/23 5
Multiple sequence alignment: features
• some aligned residues, such as Cysteines that form disulfide bridges, may be highly conserved
• there may be conserved motifs such as a transmembrane domain
• there may be conserved secondary structure features
• there may be regions with consistent patterns of insertions or deletions (indels)
![Page 6: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/6.jpg)
04/20/23 6
Multiple sequence alignment: uses
• MSA is more sensitive than pairwise alignment to detect homologs
• BLAST output can take the form of a MSA, and can reveal conserved residues or motifs
• Population data can be analyzed in a MSA (PopSet)
• A single query can be searched against a database of MSAs
• Regulatory regions of genes may have consensus sequences identifiable by MSA
![Page 7: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/7.jpg)
04/20/23 7
Multiple Sequence Alignment: Approaches
• Optimal Global Alignments -Dynamic programming
• Global Progressive Alignments - Match closely-related sequences first using a guide tree. (Feng & Doolittle)
• Global Iterative Alignments - Multiple re-building attempts to find best alignment
• Local alignments– Profiles, Blocks, Patterns
![Page 8: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/8.jpg)
04/20/23 8
Dynamic Programming
• Generalization of Needleman-Wunsch– Find alignment that maximizes a score function
• Computationally expensive: Time grows as product of sequence lengths– 2 sequences: O(n2)– 3 sequences: O(n3)– 4 sequence: O(n4)– N sequences: O(nN)
• Can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that
![Page 9: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/9.jpg)
04/20/23 9
Progressive Alignment
• Find succession of pairwise alignments
• Heurisic – cannot separate scoring and optimization
• Works well for closely related sequences
• Very sensitive to initial alignments
![Page 10: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/10.jpg)
10
Progressive Alignment
• Use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree– Align the most closely related sequence then
add the next more closely related sequence, iteratively
– Full DP algorithm is used by aligning two existing alignments or sequences
– Gaps in present/older alignments remain fixed
![Page 11: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/11.jpg)
04/20/23 11
Progessive Alignment Examples
• Feng-Doolittle (1987)
• ClustalW
• T-coffee
![Page 12: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/12.jpg)
04/20/23 12
Feng-Doolittle MSA occurs in 3 stages
• [1] Do a set of global pairwise alignments (Needleman and Wunsch)
• [2] Create a guide tree
• [3] Progressively align the sequences
![Page 13: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/13.jpg)
04/20/23 13
Progressive MSA stage 1 of 3:generate global pairwise alignments
five distantly related lipocalins
best score
![Page 14: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/14.jpg)
04/20/23 14
Progressive MSA stage 1 of 3:generate global pairwise alignments
Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 84Sequences (1:3) Aligned. Score: 84Sequences (1:4) Aligned. Score: 91Sequences (1:5) Aligned. Score: 92Sequences (2:3) Aligned. Score: 99Sequences (2:4) Aligned. Score: 86Sequences (2:5) Aligned. Score: 85Sequences (3:4) Aligned. Score: 85Sequences (3:5) Aligned. Score: 84Sequences (4:5) Aligned. Score: 96
five closely related lipocalins
best score
![Page 15: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/15.jpg)
04/20/23 15
Number of pairwise alignments needed
For N sequences, (N-1)(N)/2
For 5 sequences, (4)(5)/2 = 10
![Page 16: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/16.jpg)
04/20/23 16
Feng-Doolittle stage 2: guide tree
• Convert similarity scores to distance scores
• A tree shows the distance between objects
• ClustalW provides a syntax to describe the tree
• A guide tree is not a phylogenetic tree
![Page 17: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/17.jpg)
04/20/23 17
Guide Tree
• UPGMA – Unweighted Pair Group Method by Arithmetic Mean– Simplest method of tree construction– Assumes equal rates of mutation along the branches
• UPGMA Algorithm– Definition: Node in a tree is called an Operational
Taxonomic Unit (OTU)– From distance matrix, cluster pair of OTUs with
smallest distance, and calculate new distance– Repeat previous step until clusters converge
![Page 18: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/18.jpg)
04/20/23 18
Guide Tree - UPGMA
• Cluster pair with smallest distance
• Recalculate distance matrix
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
![Page 19: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/19.jpg)
04/20/23 19
Guide Tree - UPGMA
• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is
the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU
dist (A,B),C = (dist A,C + dist B,C) / 2 = (4 + 4) / 2 = 4dist (A,B),D = (dist A,D + dist B,D) / 2 = (6 + 6) / 2 = 6dist (A,B),E = (dist A,E + dist B,E) / 2 = (6 + 6) / 2 = 6 dist (A,B),F = (dist A,F + dist B,F) / 2 = (8 + 8) / 2 = 8
![Page 20: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/20.jpg)
04/20/23 20
Guide Tree - UPGMA
• Calculate new distance using composite OTU(A,B):– Distance between a simple OTU and a composite OTU is
the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
![Page 21: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/21.jpg)
04/20/23 21
Guide Tree - UPGMA
• Second Iteration
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
![Page 22: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/22.jpg)
04/20/23 22
Guide Tree - UPGMA
• Third Iteration
A,B C D,E
C 4
D,E 6 6
F 8 8 8
![Page 23: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/23.jpg)
04/20/23 23
Guide Tree - UPGMA
• Fourth Iteration
AB,C D,E
D,E 6
F 8 8
![Page 24: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/24.jpg)
04/20/23 24
Guide Tree - UPGMA
• Fifth Iteration
ABC,DE
F 8
![Page 25: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/25.jpg)
25
Guide Tree
• ClustalW uses Neighbor-Joining• Neighbor Joining corrects the UPGMA method for its
(frequently invalid) assumption that the same rate of evolution applies to each branch of a tree.
• Neighbor Joining has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)
• Neighbor-Joining Algorithm• Assumes unequal rates of mutation along each branch• Find pairs of OTUs that minimize total branch length at
each stage of clustering starting with a starlike tree (Minimum-Evolution Tree).The distance matrix is adjusted for differences in the rate of evolution of each taxon (branch).
![Page 26: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/26.jpg)
NJ Algorithm
Neighbor Joining to Calculate the Guide Tree Phase:– does not require a uniform molecular clock– the raw data are provided as a distance matrix– the initial tree is a star tree– distance matrix is modified
• distance between node pairs is adjusted on the basis of their average divergence from all other nodes.
– the least-distant pair of nodes are linked.
![Page 27: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/27.jpg)
NJ Algorithm
Neighbor Joining to Calculate the Guide Tree Phase:– When two nodes are linked:
• Add their common ancestral node to the tree• delete the terminal nodes with their branches • the common ancestor is now a terminal node on a smaller
tree
– At each step, two terminal nodes are replaced by one new node
– The process is complete when there are only two nodes separated by a single branch
![Page 28: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/28.jpg)
NJ Algorithm
• Advantages of Neighbor Joining– Fast.
• Can be used on large datasets• Can support bootstrap analysis
– Can handle lineages with largely different branch lengths (different molecular evolutionary rates)
– Can be used with methods that use correction for multiple substitutions
![Page 29: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/29.jpg)
NJ Algorithm
• Disadvantages of Neighbor Joining– sequence information is reduced
• Sequences are boiled down to distances• No secondary or tertiary features used
– gives only one possible tree – strongly dependent on the model of evolution used
![Page 30: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/30.jpg)
NJ Algorithm
• NJ example from: http://www.icp.ucl.ac.be/~opperd/private/neighbor.html
• Consider the following tree:
• Notice that the branches for D and B are longer.
• This expresses the idea that they have a faster molecular clock than the other OTUs.
![Page 31: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/31.jpg)
NJ Algorithm
The distance matrix for the tree is:
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
Normally, we create the tree from the distances.
In this example, we use to tree to derive the distances.
![Page 32: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/32.jpg)
NJ Algorithm
• We start with a star tree.• Notice that we have 6 operational taxonomic
units (OTUs)• The start tree has a leaf for each OTU
A
B
C D
E
F
![Page 33: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/33.jpg)
NJ Algorithm
Step 1: Calculate the net divergence for each OTU.The net divergence is the sum of distances from i to all
other OTUs.
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
r(A) = 5+4+7+6+8=30r(B) = 42r(C) = 32r(D) = 38r(E) = 34r(F) = 44
N
i jiijiXX D
NLr
1 1
1
![Page 34: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/34.jpg)
NJ Algorithm
Step 2: Calculate a new distance matrix based on average divergence:M(ij)=d(ij) - [r(i) + r(j)]/(N-2)
Example: A,B
M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = -13
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
Recall:r(A) =30r(B) = 42
![Page 35: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/35.jpg)
NJ Algorithm
Step 2: continuedM(ij)=d(ij) - [r(i) + r(j)]/(N-2)
A B C D EB -13.0C -11.5 -11.5D -10.0 -10.0 -10.5E -10.0 -10.0 -10.5 -13.0F -10.5 -10.5 -11.0 -11.5 -11.5
A B C D EB 5C 4 7D 7 10 7E 6 9 6 5F 8 11 8 9 8
Distance matrix Average divergence matrix
![Page 36: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/36.jpg)
NJ Algorithm
Step 3: choose two OTUs for which Mij is the smallest.– the possible choices are: A,B and D,E– arbitrarily choose A and B– form a new node called U, the parent of A & B.– calculate the branch length from U to A and B.
S(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1S(BU) =d(AB) -S(AU) = 4
![Page 37: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/37.jpg)
NJ Algorithm
• The tree after U is added.
A
B C
D
E
F
U 1
4
![Page 38: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/38.jpg)
NJ Algorithm
Step 4: define distances from U to other terminal nodes:– d(CU) = d(AC) + d(BC) - d(AB) / 2 = 3– d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6– d(EU) = d(AE) + d(BE) - d(AB) / 2 = 5– d(FU) = d(AF) + d(BF) - d(AB) / 2 = 7– Note: no change in paired distances {C,D,E,F}
U C D EC 3D 6 7E 5 6 5F 7 8 9 8
![Page 39: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/39.jpg)
NJ Algorithm
• Now N = N-1 = 5• Repeat steps 1 through 4• Stop when N = 2
![Page 40: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/40.jpg)
04/20/23 40
Progressive MSA stage 2 of 3:generate a guide tree calculated from
the distance matrix
![Page 41: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/41.jpg)
04/20/23 41
Progressive MSA stage 2 of 3:generate a guide tree calculated from
the distance matrix
![Page 42: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/42.jpg)
04/20/23 42
Progressive MSA stage 2 of 3:generate guide tree
((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);
five closely related lipocalins
![Page 43: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/43.jpg)
04/20/23 43
Progressive MSA stage 2 of 3:generate guide tree
((gi|5803139|ref|NP_006735.1|:0.04284,(gi|6174963|sp|Q00724|RETB_MOUS:0.00075,gi|132407|sp|P04916|RETB_RAT:0.00423):0.10542):0.01900,gi|89271|pir||A39486:0.01924,gi|132403|sp|P18902|RETB_BOVIN:0.01902);
five closely related lipocalins
![Page 44: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/44.jpg)
04/20/23 44
Feng-Doolittle stage 3: progressive alignment
• Make a MSA based on the order in the guide tree
• Start with the two most closely related sequences
• Then add the next closest sequence
• Continue until all sequences are added to the MSA
• Rule: “once a gap, always a gap.”
![Page 45: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/45.jpg)
04/20/23 45
Use Clustal W to do a progressive MSA
http://www2.ebi.ac.uk/clustalw/
![Page 46: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/46.jpg)
04/20/23 46
Progressive MSA stage 3 of 3:progressively align the sequences
following the branch order of the tree
![Page 47: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/47.jpg)
04/20/23 47
Clustal W alignment of 5 closely related lipocalins
CLUSTAL W (1.82) multiple sequence alignment
gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:*****
gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:**************
gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********
![Page 48: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/48.jpg)
04/20/23 48
Why “once a gap, always a gap”?
• There are many possible ways to make a MSA
• Where gaps are added is a critical question
• Gaps are often added to the first two (closest) sequences
• To change the initial gap choices later on would be to give more weight to distantly related sequences
• To maintain the initial gap choices is to trust that those gaps are most believable
![Page 49: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/49.jpg)
04/20/23 49
Progressive Alignment: Discussion
• Strengths:– Speed– Progression biologically sensible (aligns using a tree)
• Weaknesses:– No objective function.– No way of quantifying whether or not the alignment is
good– Local minimum problem– Any errors in the initial alignment are carried through, no way to
correct an early mistake– More efficient for closely related sequences than for divergent
sequences
![Page 50: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/50.jpg)
04/20/23 50
Iterative Methods for Multiple Sequence Alignment
• Seeks to increase MSA score by randomly altering the alignment.
• Usually used to refine alignment• Attempt to correct initial alignment problems by
repeatedly aligning subgroups of the sequences and then by aligning these subgroups into a global alignment of all the sequences– Starts with a multiple sequence alignment.– Refine it. – Repeat until one MSA doesn’t change significantly
from the next.
![Page 51: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/51.jpg)
04/20/23 51
MultAlign
• Pairwise scores recalculated during progressive alignment
• Tree is recalculated
• Alignment is refined
![Page 52: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/52.jpg)
04/20/23 52
PRRP
• Initial pairwise alignment predicts tree
• Tree produces weights
• Locally aligned regions considered to produce new alignment and tree
• Continue until alignments converge
![Page 53: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/53.jpg)
04/20/23 53
DIALIGN
• Pairs of sequences aligned to locate ungapped aligned regions
• Diagonals of various lengths identified
• Collection of weighted diagonals provide alignment
![Page 54: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/54.jpg)
04/20/23 54
SAGA: Genetic Algorithms
• Generate as many different MSAs by rearrangements simulating gaps and recombination events
• SAGA (Serial Alignment by Genetic Algorithm) is one approach
![Page 55: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/55.jpg)
04/20/23 55
Simulated Annealing
• Obtain a higher-scoring multiple alignment
• Rearranges current alignment using probabalistic approach to identify changes that increase alignment score
• MSASA: Multiple Sequence Alignment by Simulated Annealing
![Page 56: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/56.jpg)
MUSCLE: next-generation progressive MSA
[1] Build a draft progressive alignmentDetermine pairwise similarity through k-mer counting (not by
alignment)
Compute distance (triangular distance) matrix
Construct tree using UPGMA
Construct draft progressive alignment following tree
![Page 57: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/57.jpg)
MUSCLE: next-generation progressive MSA
[2] Improve the progressive alignment Compute pairwise identity through current MSA
Construct new tree with Kimura distance measures
Compare new and old trees: if improved, repeat this step, if not improved, then we’re done
![Page 58: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/58.jpg)
MUSCLE: next-generation progressive MSA
[3] Refinement of the MSA Split tree in half by deleting one edge
Make profiles of each half of the tree
Re-align the profiles
Accept/reject the new alignment
![Page 59: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/59.jpg)
![Page 60: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/60.jpg)
MUSCLE output (formatted with SeaView)
SeaView is a graphical multiple sequence alignment editor available at http://pbil.univ-lyon1.fr/software/seaview.html
![Page 61: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/61.jpg)
04/20/23 61
Scoring Multiple Alignments
• Because we can’t see the ancestral sequences, it is often impossible to ever know what is the “correct” multiple alignment. (Since some residues may not be structurally superposable, there may not be a correct alignment.)
• The best we can do is to define a “scoring function” for evaluating the “goodness” of a multiple alignment.
• We then try to find the multiple alignment that maximizes this function.
• This is entirely analogous to the scoring function used in pairwise alignment algorithms.
![Page 62: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/62.jpg)
04/20/23 62
Scoring Function Features
• The key difference between multiple alignments and pairwise alignments is the fact that different pairs of sequences are separated by different evolutionary distances.
• Any set of sequences we wish to align is related by a phylogenetic tree.
• Ideally, our scoring system should model molecular sequence evolution.
![Page 63: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/63.jpg)
04/20/23 63
Ideal Scoring Function
• Sequences are related by an evolutionary tree.
• Assume a probabilistic model of molecular evolution.
• Multiple alignment score, S, is
S = ΣX Pr(Tree|Root=X) Pr(X)
D
A B
C
Root
![Page 64: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/64.jpg)
04/20/23 64
Ideal Function is Too Complex
• In most cases, we don’t have nearly enough information to model evolution accurately enough.
• The probability depends on knowing the length of each branch in the tree accurately.
• Evolution is not constant at each column in the alignment since selective pressure is stronger on critical residues.
![Page 65: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/65.jpg)
04/20/23 65
Scoring Function Features cont’d
• As with pairwise alignments, the scoring function take the chemical/physical properties of residues into account.
![Page 66: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/66.jpg)
04/20/23 66
Simple Score Functions
• If we assume that the columns of the alignment are independent, the scoring function can be written as a sum of column scores plus a gap score:
S(m) = G + Σi S(mi)
where mi is column i of the alignment and G is
a function for scoring gaps.
![Page 67: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/67.jpg)
04/20/23 67
Sum of Pairs: SP Scores
• Using BLOSUM62 matrix, gap penalty -8
• In column 1, we have pairs-,S-,SS,S
• k(k-1)/2 pairs per column
- I K
S I K
S S E
-8 - 8 + 4 = -12
![Page 68: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/68.jpg)
04/20/23 68
Problems with Sum of Pairs Scores
SP scores are very commonly used, but they have problems:
• They have no probabilistic justification.
• The relative difference in score between the correct and incorrect alignment decreases as the evidence increases—this is counter-intuitive.
![Page 69: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/69.jpg)
04/20/23 69
Minimum Entropy Scores
• This is a probabilistic (well, information theoretic) way of saying how “pure” or “good” an alignment column is.
• Intuition: good alignment columns will contain very few different letters
• Method: We convert the alignment column into a probability vector and compute the entropy of the vector.
![Page 70: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/70.jpg)
04/20/23 70
Entropy
• Entropy is a very useful concept from Information Theory.
• If X is a random variable that can have values X1,X2,…,Xk, the entropy of X is defined as:
H(X) = −Σj Pr(Xj) log Pr(Xj)• The maximum entropy is log k. when the
distribution is uniform, eg, Pr(X) = (¼, ¼, ¼, ¼).• The minimum entropy is 0, when the distribution
puts all its weight on one letter, eg, Pr(X) = (0,0,1,0).
![Page 71: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/71.jpg)
Entropy
• Define frequencies for the occurrence of each letter in each column of multiple alignment– pA = 1, pT=pG=pC=0 (1st column)
– pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)
– pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)
• Compute entropy of each column
CGTAX
XX pp,,,
log
AAAAAAAATATC
![Page 72: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/72.jpg)
Entropy: Example
0
A
A
A
A
entropy
2)24
1(4
4
1log
4
1
C
G
T
A
entropy
Best case
Worst case
![Page 73: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/73.jpg)
Multiple Alignment: Entropy Score
Entropy for a multiple alignment is the sum of entropies of its columns:
over all columns X=A,T,G,C pX logpX
![Page 74: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/74.jpg)
Entropy of an Alignment: Example
column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)
•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0
•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811
•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0
•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811
A A A
A C C
A C G
A C T
![Page 75: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/75.jpg)
04/20/23 75
Pros and Cons of Entropy Scores
• The entropy scores are probabilistic.
• They don’t take into account the fact that the sequences are related by a phylogenetic tree. This can be “fixed” by weighting the sequences so that sequences from close species are downweighted relative to sequences from distant species.
![Page 76: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/76.jpg)
04/20/23 76
Multiple sequence alignment to profile: HMMs
• Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged In a column of a multiple sequence alignment
• HMMs are probabilistic models
• An HMM gives more sensitive alignments than traditional techniques such as progressive alignments
![Page 77: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/77.jpg)
Simple Hidden Markov Model
Observation: YNNNYYNNNYN
(Y=goes out, N=doesn’t go out)
What is underlying reality (the hidden state chain)?
R
S
0.15
0.85
0.2
0.8
P(dog goes out in rain) = 0.1
P(dog goes out in sun) = 0.85
![Page 78: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/78.jpg)
04/20/23 78
GTWYA (hs RBP)GLWYA (mus RBP)GRWYE (apoD)GTWYE (E Coli)GEWFS (MUP4)
An HMM is constructed from a MSA
Example: five lipocalins
![Page 79: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/79.jpg)
04/20/23 79
GTWYAGLWYAGRWYEGTWYEGEWFS
Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2
![Page 80: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/80.jpg)
04/20/23 80
GTWYAGLWYAGRWYEGTWYEGEWFS
Prob. 1 2 3 4 5p(G) 1.0p(T) 0.4p(L) 0.2p(R) 0.2p(E) 0.2 0.4p(W) 1.0p(Y) 0.8p(F) 0.2p(A) 0.4p(S) 0.2
P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064
log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75
![Page 81: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/81.jpg)
04/20/23 81
GTWYAGLWYAGRWYEGTWYEGEWFS
P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064
log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75
G:1.0T:0.4L:0.2R:0.2E:0.2
W:1.0Y:0.8F:0.2
E:0.4A:0.4S:0.2
![Page 82: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/82.jpg)
Structure of a hidden Markov model (HMM)
main state
insert state
delete state
![Page 83: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/83.jpg)
04/20/23 83
From MSA to Profile
• Profile HMMs are important because they provide a powerful way to search databases for distantly related homologs.
• HMMs can be created using the HMMER program.
![Page 84: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/84.jpg)
04/20/23 84
HMMER: search an HMM against GenBankScores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---gi|20888903|ref|XP_129259.1| (XM_129259) ret 461.1 1.9e-133 1gi|132407|sp|P04916|RETB_RAT Plasma retinol- 458.0 1.7e-132 1gi|20548126|ref|XP_005907.5| (XM_005907) sim 454.9 1.4e-131 1gi|5803139|ref|NP_006735.1| (NM_006744) ret 454.6 1.7e-131 1gi|20141667|sp|P02753|RETB_HUMAN Plasma retinol- 451.1 1.9e-130 1..gi|16767588|ref|NP_463203.1| (NC_003197) out 318.2 1.9e-90 1
gi|5803139|ref|NP_006735.1|: domain 1 of 1, from 1 to 195: score 454.6, E = 1.7e-131 *->mkwVMkLLLLaALagvfgaAErdAfsvgkCrvpsPPRGfrVkeNFDv mkwV++LLLLaA + +aAErd Crv+s frVkeNFD+ gi|5803139 1 MKWVWALLLLAA--W--AAAERD------CRVSS----FRVKENFDK 33
erylGtWYeIaKkDprFErGLllqdkItAeySleEhGsMsataeGrirVL +r++GtWY++aKkDp E GL+lqd+I+Ae+S++E+G+Msata+Gr+r+L gi|5803139 34 ARFSGTWYAMAKKDP--E-GLFLQDNIVAEFSVDETGQMSATAKGRVRLL 80
eNkelcADkvGTvtqiEGeasevfLtadPaklklKyaGvaSflqpGfddy +N+++cAD+vGT+t++E dPak+k+Ky+GvaSflq+G+dd+ gi|5803139 81 NNWDVCADMVGTFTDTE----------DPAKFKMKYWGVASFLQKGNDDH 120
![Page 85: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/85.jpg)
Two kinds of multiple sequence alignment resources
Text-based or query-based searches:CDD, Pfam (profile HMMs), PROSITE
[2] Multiple sequence alignment programs
Muscle, ClustalW, ClustalX
[1] Databases of multiple sequence alignments
Page 329
![Page 86: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/86.jpg)
BLOCKSCDD Pfam SMARTDOMO (Gapped MSA)INTERPROiProClassMetaFAMPRINTSPRODOM (PSI-BLAST)PROSITE
Databases of multiple sequence alignments
TheseUseHMMs
![Page 87: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/87.jpg)
04/20/23 87
Multiple sequence alignment programs
• AMAS• CINEMA• ClustalW• ClustalX• DIALIGN• HMMT• Match-Box• MultAlin• MSA• Musca• PileUp• SAGA• T-COFFEE
![Page 88: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/88.jpg)
04/20/23 88
Multiple sequence alignment algorithms
Progressive
Iterative
Local Global
PIMA
DIALIGN SAGA
CLUSTALPileUpother
![Page 89: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/89.jpg)
04/20/23 89
performance of alignment programs depends on (McClure et al., 1994)
• the number of sequences,
• the degree of similarity between sequences
• the number of insertions in the alignment.
• the length of the sequences
• the existence of large insertions and N/C-terminal extensions
• over-representation of some members of the protein family.
![Page 90: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/90.jpg)
04/20/23 90
Strategy for assessment of alternativemultiple sequence alignment algorithms
• [1] Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define “true” homologs using structural criteria.
• [2] Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers).
• [3] Compare the answers.
![Page 91: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/91.jpg)
04/20/23 91
BAliBASE: A benchmark alignments database for the evaluation of multiple
sequence alignment programs
• BAliBASE is a database of manually-refined multiple sequence alignments specifically designed for the evaluation and comparison of multiple sequence alignment programs. The alignments are categorised by sequence length, similarity, and presence of insertions and N/C- terminal extensions. Core blocks are identified excluding non-superposable regions.
• http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/
![Page 92: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/92.jpg)
04/20/23 92
BaliBase
• Thompson et al., 1999, Nuc. Acids. Res. 27, 2682-2690).
• DIALIGN was found to be the best method for local multiple alignment.
• CLUSTAL W, PRRP and SAGA were superior on globally related sequence sets
![Page 93: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/93.jpg)
04/20/23 93
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [1] As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity.– Proteins <25% identity: 65% of residues
align well– Proteins <40% identity: 80% of residues
align well
![Page 94: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/94.jpg)
04/20/23 94
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.
![Page 95: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/95.jpg)
04/20/23 95
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [3] Separate multiple sequence alignments can be combined (e.g. RBPs and lactoglobulins).– Iterative algorithms (PRRP, SAGA)
outperform progressive alignments (ClustalX)
![Page 96: 10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and](https://reader031.vdocument.in/reader031/viewer/2022012918/56649ece5503460f94bdb929/html5/thumbnails/96.jpg)
04/20/23 96
Conclusions: assessment of alternativemultiple sequence alignment algorithms
• [4] When proteins have large N-terminal or C-terminal extensions, local alignment algorithms are superior. PileUp (global) is an exception.