multiple sequence alignment arthur w. chou fall, 2005

28
Multiple sequence alignment Arthur W. Chou Fall, 2005

Upload: lydie

Post on 02-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Multiple sequence alignment Arthur W. Chou Fall, 2005. Multiple sequence alignment: definition. Given: • Set of sequences • Similarity score matrix • Gap penalties Find: Alignment of sequences such that optimal score is achieved. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple sequence alignment Arthur W. Chou Fall, 2005

Multiple sequence alignment

Arthur W. Chou

Fall, 2005

Page 2: Multiple sequence alignment Arthur W. Chou Fall, 2005

Multiple sequence alignment: definition

Given: • Set of sequences• Similarity score matrix• Gap penalties

Find:Alignment of sequences such that optimal score

is achieved.

Result: a collection of three or more protein or nucleic acid sequences that are partially or completely aligned, such that homologous residues are aligned in columns across the length of the sequences.

Page 3: Multiple sequence alignment Arthur W. Chou Fall, 2005

Why do we care about protein MA?

1. Useful way to summarize the sequences of related proteins.

What do globin sequences look like?

4mbn . ----------VLSEGEWQLVLHVWAKVE--ADVAGH1myt . --------------ADFDAVLKCWGPVE--ADYTTM2hhb A ----------VLSPADKTNVKAAWGKVG--AHAGEY2mhb A ----------VLSAADKTNVKAAWSKVG--GHAGEY1pbx A ----------SLSDKDKAAVRALWSKIG--KSADAI2hhb B ---------VHLTPEEKSAVTALWGKV----NVDEV2mhb B ---------VQLSGEEKAAVLALWDKV----NEEEV2lhb . -PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETS1mba . ----------SLSAAEADLAGKSWAPVFA--NKNAN1sdh A --PSVYDAAAQLTADVKKDLRDSWKVIGS--DKKGN1lh1 . ---------GALTESQAALVKSSWEEFN--ANIPKH1hlb . GGTLAIQAQGDLTLAQKKIVRKTWHQLMRN--KTSF1ith A ----------GLTAAQIKAIQDHWFLNI-KGCLQAA1ecd . -----------LSADQISTVQASFDKVK------GD2hbg . ----------GLSAAQRQVIAATWKDIAGADNGAGV

Page 4: Multiple sequence alignment Arthur W. Chou Fall, 2005

Why do we care about protein MA?

2. Useful way to find important functional amino acids by assessing conservation over many sequences.

What is conserved?

DRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGPKFAGI-AQADIAGNAAISAHGATVLKKLGELLKAKGPHF-DLSH-----GSAQVKGHGKKVADALTNAVAHVDPHF-DLSH-----GSAQVKAHGKKVGDALTLAVGHLDSHWPDVTP-----GSPHIKAHGKKVMGGIALAVSKIDESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDPKFKGLTTADELKKSADVRWHAERIINAVDDAVASMDADFKGKSVAD-IKASPKLRDVSSRIFTRLNEFVNNAAKRLGNVS---QGMANDKLRGHSITLMYALQNFIDQLDSFLKGT--SEVPQNNPELQAHAGKVFKLVYEAAIQLEPQMAGM-SASQLRSSRQMQAHAIRVSSIMSEYVEELDHKFS-SVPLYGLRSNPAYKAQTLTVINYLDKVVDALGTQFAG-KDLESIKGTAPFETHANRIVGFFSKIIGELPGFSGA--------SDPGVAALGAKVLAQIGVAVSHLG

Page 5: Multiple sequence alignment Arthur W. Chou Fall, 2005

Why do we care about protein MA?

3. Establish evolutionary relationships between sequences.

What was sequence of events leading to current species?

4mbn .EAIIHVLHSRHPGDFGADAQGAMNKA1myt .EVLVKVMHEKAGLD--AGGQTALRNV2hhb AHCLLVTLAAHLPAEFTPAVHASLDKF2mhb AHCLLSTLAVHLPNDFTPAVHASLDKF1pbx AHCILVVISTMFPKEFTPEAHVSLDKF2hhb BNVLVCVLAHHFGKEFTPPVQAAYQKV2mhb BNVLVVVLARHFGKDFTPELQASYQKV2lhb .AVIADTVAAG---------DAGFEKL1mba .SMFPGFVASVAA--PPAGADAAWTKL1sdh AGPIKKVLASK---NFGDKYANAWAKL1lh1 .EAILKTIKEVVGAKWSEELNSAWTIA1hlb .MEALQAELGSD---FNEKTRDAWAKA1ith AKLVGGVFQEE--FSADPTTVAAWGDA1ecd .AGFVSYMKAHT--DF-AGAEAAWGAT2hbg .ASLLSAMEHRIGGKMNAAAKDAWAAA

Page 6: Multiple sequence alignment Arthur W. Chou Fall, 2005

Why do we care about protein MA?

4. More precisely understand how to model 3D structures.

What other amino acids are acceptable in this structure?

4mbn .EAIIHVLHSRHPGDFGADAQGAMNKA1myt .EVLVKVMHEKAGLD--AGGQTALRNV2hhb AHCLLVTLAAHLPAEFTPAVHASLDKF2mhb AHCLLSTLAVHLPNDFTPAVHASLDKF1pbx AHCILVVISTMFPKEFTPEAHVSLDKF2hhb BNVLVCVLAHHFGKEFTPPVQAAYQKV2mhb BNVLVVVLARHFGKDFTPELQASYQKV2lhb .AVIADTVAAG---------DAGFEKL1mba .SMFPGFVASVAA--PPAGADAAWTKL1sdh AGPIKKVLASK---NFGDKYANAWAKL1lh1 .EAILKTIKEVVGAKWSEELNSAWTIA1hlb .MEALQAELGSD---FNEKTRDAWAKA1ith AKLVGGVFQEE--FSADPTTVAAWGDA1ecd .AGFVSYMKAHT--DF-AGAEAAWGAT2hbg .ASLLSAMEHRIGGKMNAAAKDAWAAA

Page 7: Multiple sequence alignment Arthur W. Chou Fall, 2005

What is the protein MA Gold Standard?

Structural AlignmentStructural Alignment

If sequences can be aligned, the alignment should reflect structural similarities.

Thus, the alignment should lead to “match” of common structural and functional elements.

Page 8: Multiple sequence alignment Arthur W. Chou Fall, 2005

Aligning non-coding DNA sequences

• Conserved signals in DNA for control of expression

• Can infer evolutionary relationships

• Can identify Important functional regions

• A much harder problem!

Page 9: Multiple sequence alignment Arthur W. Chou Fall, 2005

Methods for Multiple Alignment

1. Exhaustive search: extension of DP to multiple dimensions. E.g. MSA algorithm

2. Progressive alignment: compute tree of sequences, based on hierarchical clustering, and then merge closest first, greedily. E.g. ClustalW

3. Anchor on locally conserved blocks: find highly conserved regions and then grow alignment around these regions. E.g. BLAST

4. Iterative search: based on genetic algorithm search5. Probabilistic/statistical: E.g. Gibbs Sampling, HMM

Page 10: Multiple sequence alignment Arthur W. Chou Fall, 2005

How to score a Multiple Alignment?

Sum of Pairs = SP

Compute the pairwise score of all pairs of characters and then sum them up, for each aligned column of the sequences, :

SP-score ( I , - , I, V ) = s(I, -) + s(I, I) + s(I, V) + s(-, l) + s(-, V) + s(I, V)

Note that s( - , - ) = 0

Gap penalty: can be constant or linear

MSA algorithm uses constant

Page 11: Multiple sequence alignment Arthur W. Chou Fall, 2005

Multidimensional Dynamic Programming

Why not just use same technique as forpairwise alignment?

Instead of 2-dimensional matrix, use N-dimensional; N = the number of sequences.

Complexity increases with the number ofsequences, so only N < 10 and lengths ~ 200 can beaccommodated.

Page 12: Multiple sequence alignment Arthur W. Chou Fall, 2005

Dynamic Programming with scores and penaltiesDynamic Programming with scores and penalties

from ‘i-th’ pos. in A and ‘j-th’ pos. in B, ‘k-th’ pos. in C onward

SP-score (A[i] , B[j], c[k]) + S[i+1, j+1, k+1]

S[i , j, k] = max max { S[i+x, j, k] – w( x ); }

max { S[i, j+y, k] – w( y ); }

max { S[i, j, k+z] – w( z ); }

max { S[ i+x, j+y, k ] – w( x ) – w( y ); }

. . . . . . . . . . . . .

best score from

i, j, k onward

Page 13: Multiple sequence alignment Arthur W. Chou Fall, 2005

MSA Algorithm Based on dynamic programming concept, using some bounds :

1. Compute optimal pairwise alignments to get anupper bound on any pair of alignments. MSA can’t doany better than sum of optimal pairwise alignments.

2. Create heuristic multiple alignment in ad hocfashion to create a lower bound on MA score (using a guide tree).

3. Search N-dimensional scoring matrix for the best score including i-th element of sequence 1, j-th of sequence 2, k-th of sequence 3, …, etc.

Page 14: Multiple sequence alignment Arthur W. Chou Fall, 2005

AGTA-T-GT

A-T-GT

Page 15: Multiple sequence alignment Arthur W. Chou Fall, 2005
Page 16: Multiple sequence alignment Arthur W. Chou Fall, 2005

Problem of Sequence Weights

The available sequences are not randomly sampled,but reflect biases in how we collect sequences.

If weight everything equally, then closely relatedsequences will be allowed to dominate the multiplealignment. As a result, conclusions about

1) conservation2) evolutionary distance3) reliability of predictions

will be wrong.

Page 17: Multiple sequence alignment Arthur W. Chou Fall, 2005

Sequence Weighting Example

CYEGNGHF Human-1CYEGNGDF Human-2CYHGNGDS MouseCYHGNGQS RatCFNGNGHS Fruitfly

Solutions: don’t weight the two humans equally with the others. Use a measure of similarity to down-weight their influence on the multiple alignment.

Page 18: Multiple sequence alignment Arthur W. Chou Fall, 2005
Page 19: Multiple sequence alignment Arthur W. Chou Fall, 2005

Feng-Doolittle Progressive MSA

1. Do global pairwise alignments (Needleman and Wunsch) for every pair of

sequences

2. Create a guide tree based on them (e.g., neighbor joining)

3. Progressively align the sequences with weights from the guide tree

Page 20: Multiple sequence alignment Arthur W. Chou Fall, 2005

Progressive MSA stage 1 of 3:generate global pairwise alignments

five distantly related lipocalins

best score

Page 21: Multiple sequence alignment Arthur W. Chou Fall, 2005

Number of pairwise alignments needed

For N sequences, (N-1)(N)/2

For 5 sequences, (4)(5)/2 = 10

~ N2 / 2

Page 22: Multiple sequence alignment Arthur W. Chou Fall, 2005

Feng-Doolittle stage 2: guide tree

• Convert similarity scores to distance scores

• Use some clustering algorithm to construct the guide tree (UPGMA)

• A tree shows the distance between objects

• A guide tree is not a phylogenetic tree

Page 23: Multiple sequence alignment Arthur W. Chou Fall, 2005

Progressive MSA stage 2 of 3:generate a guide tree calculated from

the distance matrix

1

2

3

5

4

Page 24: Multiple sequence alignment Arthur W. Chou Fall, 2005

Feng-Doolittle stage 3: progressive alignment

• Make successive alignment based on the order in the guide tree

• Start with the two most closely related sequences

• Then add the next closest sequence (or cluster)

• Continue until all sequences are added

• Rule: “once a gap, always a gap.”

Page 25: Multiple sequence alignment Arthur W. Chou Fall, 2005

Progressive MSA stage 3 of 3:progressively align the sequences

Page 26: Multiple sequence alignment Arthur W. Chou Fall, 2005

Why “once a gap, always a gap”?

• Where gaps are added is a critical question

• Gaps are often added to the first two (closest) sequences

• To change the initial gap choices later on would beto give more weight to distantly related sequences

• To maintain the initial gap choices is to trustthat those gaps are most believable

Page 27: Multiple sequence alignment Arthur W. Chou Fall, 2005

Problem with Progressive algorithms

1. Dependence of the ultimate MSA on the initial pairwise sequence alignment with the highest score

2. Errors in initial alignments are propagated

3. Gaps can proliferate, if not careful

4. Gaps can be amino-acid specific, so that you penalize introduction of gaps into segments that are less likely to have gaps (e.g. hydrophobic core)

Page 28: Multiple sequence alignment Arthur W. Chou Fall, 2005

Multiple sequence alignment to profile HMMs

• Hidden Markov models (HMMs) are “states”that describe the probability of having aparticular amino acid residue at arrangedIn a column of a multiple sequence alignment

• HMMs are probabilistic models

• Like a hammer is more refined than a blast,an HMM gives more sensitive alignments traditional techniques such as progressive alignments