![Page 1: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/1.jpg)
Multiple sequence comparison (MSC)
Reading: Setubal/Meidanis, 3.4
Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14
![Page 2: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/2.jpg)
Why care about similarity?
• Similar sequences have similar structure
![Page 3: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/3.jpg)
Similar structure -> similar sequence?• No, the converse is not true!
• Convergent evolution. Outwardly similar solutions to similar problems may be internally different.
• Tiger and ‘Tasmanian tiger’. Fish and dolphin. Bat and bird.
• Same is true of molecular ‘species’ and ‘anatomies’!
![Page 4: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/4.jpg)
Sequence --> function
• Similar sequences have similar function
• ‘[T]he same genes that work in flies are the ones that work in humans.’ -- Eric Wieshaus 1995 Nobel for drosophila work
![Page 5: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/5.jpg)
Common origins• Similar sequences have common origins
• ‘Descent with modification’ is Nature’s design mechanism
• Strong similarity may imply recent common origin (what do we mean by ‘strong’ and ‘recent’?)
• Strong similarity may imply strong conservation of sequence or motif
![Page 6: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/6.jpg)
Is multiple sequence comparison a generalization?
• From cs point of view, we’re going from two strings to many strings, a generalization
• Yes, in that it helps detect faint similarities
• No, in that we go from known biological similarity to suspected sequence similarity
![Page 7: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/7.jpg)
‘Big’ uses for MSC
• Represent protein families
• Identify conserved sequence features
• Deduce evolutionary history
![Page 8: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/8.jpg)
Profile representation
• Definition Given a multiple alignment of a set of strings, a profile specifies for each column the frequency of each character
![Page 9: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/9.jpg)
Profile example
Alignment
a b c - a
a b a b a
a c c b -
c b - b c
Profile
C1 C2 C3 C4 C5
a .75 .25 .50
b .75 .75
c .25 .25 .50 .25
d .25 .25 .25
![Page 10: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/10.jpg)
Fit string S to profile P
• Given a profile P and a string S, what is the best alignment (fit) of S to P?
• Example:
S: A a b - b c
P: 1 - 2 3 4 5
![Page 11: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/11.jpg)
Two key issues
• How to score an alignment of a string to a profile
• How to compute an optimal alignment, given a scoring system
![Page 12: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/12.jpg)
Scoring and alignment of profile
• Scoring Assuming letter-to-letter scores are given, use the weighted sum for each column
• Optimal alignment By DP, similar to S-S optimal alignment
• Q: How would you do profile-to-profile scoring and alignment?
![Page 13: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/13.jpg)
Signature (motif) representation
• A motif is a regular expression (re)• Example: a helicase motif
[&H][&AD[DE]xn[TSN][x4][QK]Gx7[&A], where– [abc] = any of a,b,c– & = [ILVMFYW]– x = any amino
– a3 = up to 3 a’s
– an = any number of a’s
• Find a motif by grep-ing
![Page 14: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/14.jpg)
Finding optimal MS alignment
• Need a scoring system
• Given a scoring system, an (efficient) method of calculation
• If no efficient method of getting the right answer, an efficient way of getting a plausible answer
![Page 15: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/15.jpg)
Need MSC measure
• Desirable characteristics:– variable number of sequences– column-wise calculation– order independence
MQPILLL
MLR-LL-
MK-ILLL
MPPVLIL
![Page 16: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/16.jpg)
Sum-of-pairs (SP) measure
• Column score = sum pairwise scores
• k Choose 2 pairs
• Reduces to pairwise alignment when k = 2
• Need to assign (-,-) value
• May compute in either row or column order
![Page 17: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/17.jpg)
DP approach
• Generalization of two-sequence comparison
• k-dimensional array
• space complexity is O(nk)
• MSC with SP measure is NP-complete
![Page 18: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/18.jpg)
MSA speedup heuristic
• This ‘heuristic’ guarantees the right answer!
• But .. it doesn’t guarantee the speedup
• General idea:– find a lower bound on L – if value for a cell exceeds L, it cannot enter into
opt solution
![Page 19: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/19.jpg)
Commonly method -- iterative
• Simplest implementation
• Begin with Si and Sj which are pairwise closest
• Iteratively merge in additional string with smallest edit distance from any in multiple alignment
• Equivalent to finding MSP on edit tree
![Page 20: Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14](https://reader036.vdocument.in/reader036/viewer/2022083005/56649f215503460f94c39d16/html5/thumbnails/20.jpg)
Clustering method
• Almost any clustering algorithm can be adapted to MSC
• Usually start with small clusters and build big ones
• Also possible start with big cluster, and divide-and-conquer
• Not clear which method is best