global sequence alignment by dynamic programming

83
Global Sequence Alignment by Dynamic Programming

Upload: peyton

Post on 11-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Global Sequence Alignment by Dynamic Programming. Needleman-Wunsch Algorithm. General Algorithm for sequence comparison Maximize a similarity score to give the maximum match - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Global Sequence Alignment by Dynamic Programming

Global Sequence Alignment by Dynamic

Programming

Page 2: Global Sequence Alignment by Dynamic Programming

Needleman-Wunsch Algorithm

• General Algorithm for sequence comparison

• Maximize a similarity score to give the maximum match

• Maximum match=largest number of amino acids or nucleotides of one sequence that can be matched with another, allowing for all possible deletions.

Page 3: Global Sequence Alignment by Dynamic Programming

Needleman-Wunsch Algorithm

• Finds the best GLOBAL alignment of any two sequences.

• N-M involves an iterative matrix method of calculation

• All possible pairs (nucleotides or amino acids) are represented in a two-dimensional array.

• All possible alignments are represented by pathways through this array.

Page 4: Global Sequence Alignment by Dynamic Programming

Needleman-Wunsch Algorithm

• Sequence alignment methods predate dot-matrix searches, and all of the alignment methods in use today are related to the original method of Needleman and Wunsch (1970).

• Needleman and Wunsch wanted to quantify the similarity between two sequences.

Page 5: Global Sequence Alignment by Dynamic Programming

Needleman-Wunsch Algorithm

• Over the course of evolution, some positions undergo base or amino acid substitutions, and bases or amino acids can be inserted or deleted.

• Any measurement of similarity must therefore be done with respect to the best possible alignment between two sequences.

• Because insertion/deletion events are rare compared to base substitutions, it makes sense to penalize gaps more heavily than mismatches when calculating a similarity score.

Page 6: Global Sequence Alignment by Dynamic Programming

Dynamic Programming

• Finding the best alignment of 2 sequences is a hard problem solved by a computational method called dynamic programming

• Multiple sequence alignment of 3 or more sequences can be solved by dynamic programming and statistical methods

Page 7: Global Sequence Alignment by Dynamic Programming

How Do We Generate the Correct Alignment ?

• We can't. • We can never guarantee that a

particular alignment is correct except for the simplest, unambiguous alignments!

• Such an alignment would require aligning each sequence in turn to the ancestral sequence first.

• Since there is no possibility to know the ancestral sequence and the evolutionary steps, the evolutionary correctness of any alignment cannot be determined.

Page 8: Global Sequence Alignment by Dynamic Programming

The Optimal Alignment.

• If the optimal alignment does not support homology, then the correct alignment (which has a smaller or equal score) will not support homology either.

• But again: there is no guarantee that the optimal alignment is the correct alignment, even though it may be the best guess.

Page 9: Global Sequence Alignment by Dynamic Programming

Dynamic Programming

• Dynamic programming is a term from operations research, where it was first used to describe a class of algorithms for the optimization of dynamic systems.

Page 10: Global Sequence Alignment by Dynamic Programming

Dynamic Programming

• In dynamic programming the principle of divide-and-conquer is used extensively: subdivide a problem that is to large to be computed, into smaller problems that may be efficiently computed.

• Then assemble the answers to give a solution for the large problem.

• When you do not know which smaller problem to solve, simply solve all smaller problems, store the answers and assembled them later to a solution for the large problem.

Page 11: Global Sequence Alignment by Dynamic Programming

Dynamic Programming

• Global optimal alignment is a difficult problem.

• The major difficulty comes from the fact, that one cannot simply slide one sequence along another and sum over the similarity scores looked up in the appropriate mutation data matrix.

• This will not work, because biological sequences may have gaps or insertions of sequences relative to each other.

Page 12: Global Sequence Alignment by Dynamic Programming

Three steps in Dynamic Programming

1. Initialization

2 Matrix fill or scoring

3. Traceback and alignment

Page 13: Global Sequence Alignment by Dynamic Programming

Sample Matrix

• To align with a cell in the diagonal means an alignment in the next position.

• An increasing diagonal line means a stretch of sequence identity.

• To align with an off-diagonal cell requires the insertion of a corresponding number of gaps.

Page 14: Global Sequence Alignment by Dynamic Programming

Needleman-Wunsch Algorithm

• We can compute for every cell the highest possible score that can be obtained for a path originating from that cell.

• That is done by looking in all the elements permissible for extending the path, and adding the highest value found to the contents of the cell.

Page 15: Global Sequence Alignment by Dynamic Programming

Needleman-Wunsch Algorithm

• If this is done in an orderly way, the highest score found is the global maximum alignment score.

• Then the optimal path consists of all those cells that contributed to the global maximum alignment score.

• There can be several equivalent optimal paths.

Page 16: Global Sequence Alignment by Dynamic Programming

Sample Matrix

• Create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned.

• Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0.

Page 17: Global Sequence Alignment by Dynamic Programming
Page 18: Global Sequence Alignment by Dynamic Programming
Page 19: Global Sequence Alignment by Dynamic Programming
Page 20: Global Sequence Alignment by Dynamic Programming

Matrix Fill Step

• One possible solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j for each position in the matrix.

• In order to find Mi,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j.

• In terms of matrix positions, it is necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1.

Page 21: Global Sequence Alignment by Dynamic Programming

Matrix Fill Step

• In the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue.

• Since the gap penalty (w) is 0, the rest of row 1 and column 1 can be filled in with the value 1.

Page 22: Global Sequence Alignment by Dynamic Programming
Page 23: Global Sequence Alignment by Dynamic Programming

Matrix Fill Step

• Now let's look at column 2. The location at row 2 will be assigned the value of the maximum of 1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1.

• At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2.

Page 24: Global Sequence Alignment by Dynamic Programming
Page 25: Global Sequence Alignment by Dynamic Programming
Page 26: Global Sequence Alignment by Dynamic Programming
Page 27: Global Sequence Alignment by Dynamic Programming

Traceback Step• After the matrix fill step, the maximum

alignment score for the two test sequences is 6. • The traceback step determines the actual

alignment(s) that result in the maximum score. • With a simple scoring algorithm like this one,

there are likely to be multiple maximal alignments.

• The traceback step begins in the M,J position in the matrix- the position that leads to the maximal score.

• In this case, there is a 6 in that location.

Page 28: Global Sequence Alignment by Dynamic Programming

Traceback Step• Traceback takes the current cell and looks

to the neighbor cells that could be direct predecessors.

• This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1).

• The algorithm for traceback chooses as the next cell in the sequence one of the possible predecessors. The neighbors are marked in red and are also equal to 5.

Page 29: Global Sequence Alignment by Dynamic Programming
Page 30: Global Sequence Alignment by Dynamic Programming
Page 31: Global Sequence Alignment by Dynamic Programming
Page 32: Global Sequence Alignment by Dynamic Programming

Traceback Step

• Since the current cell has a value of 6 and the scores are 1 for a match and 0 for anything else, the only possible predecessor is the diagonal match/mismatch neighbor.

• If more than one possible predecessor exists, any can be chosen.

Page 33: Global Sequence Alignment by Dynamic Programming

Traceback Step

• This gives us a current alignment of (Seq #1) A | (Seq #2) A• So now we look at the current cell

and determine which cell is its direct predecessor.

• In this case, it is the cell with the red 5.

Page 34: Global Sequence Alignment by Dynamic Programming
Page 35: Global Sequence Alignment by Dynamic Programming

• This is One Traceback Giving an alignment of :

G A A T T C A G T T A | | | | | | G G A _ T C _ G _ _ A

Page 36: Global Sequence Alignment by Dynamic Programming
Page 37: Global Sequence Alignment by Dynamic Programming

• This is an alternative Traceback giving an alignment of :

G _ A A T T C A G T T A | | | | | | G G _ A _ T C _ G _ _ A

Page 38: Global Sequence Alignment by Dynamic Programming

What Is the Best Alignment Between These Two Amino

Acid Sequences?

• A zinc-finger core sequence:CKHVFCRVCI

• A sequence fragment from a viral protein:

CKKCFCKCV

Page 39: Global Sequence Alignment by Dynamic Programming

Practice Matrix

C K H V F C R V C I +--------------------C | K | K | C | F | C | K | C | V |

Page 40: Global Sequence Alignment by Dynamic Programming

Place a 1 Where There Is a Match in the Matrix

C K H V F C R V C I +--------------------C | 1 1 1 K | 1 K | 1 C | 1 1 1 F | 1 C | 1 1 1 K | 1 C | 1 1 1 V | 1 1

Page 41: Global Sequence Alignment by Dynamic Programming

Place Zeroes on the Ends Since They Can’t Align

C K H V F C R V C I +--------------------C | 1 1 1 0K | 1 0K | 1 0C | 1 1 1 0F | 1 0C | 1 1 1 0K | 1 0C | 1 1 1 0V | 0 0 0 1 0 0 0 1 0 0

Page 42: Global Sequence Alignment by Dynamic Programming

Loading the Matrix

• Then proceed to the next row and column, adding to each matrix cell the maximal value of any other cell that could be the next step on a path to the matrix.

• For instance the value 1 at (C6,C8) now becomes a 2, since it could be extended from the 1 at (V8,V9) through a 1.

Page 43: Global Sequence Alignment by Dynamic Programming
Page 44: Global Sequence Alignment by Dynamic Programming

C K H V F C R V C I +--------------------C | 1 1 1 0K | 1 0 0K | 1 0 0C | 1 1 1 0F | 1 0 0C | 1 1 1 0K | 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Filling the Matrix

Page 45: Global Sequence Alignment by Dynamic Programming

C K H V F C R V C I +--------------------C | 1 1 1 1 0K | 1 1 0 0K | 1 1 0 0C | 1 1 1 1 0F | 1 1 0 0C | 1 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Repeat This Procedure for the Next Row and Column:

Page 46: Global Sequence Alignment by Dynamic Programming

And the Next …

C K H V F C R V C I +--------------------C | 1 1 1 1 1 0K | 1 1 1 0 0K | 1 1 1 0 0C | 1 1 1 1 1 0F | 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Page 47: Global Sequence Alignment by Dynamic Programming

And the Next Row…

C K H V F C R V C I +--------------------C | 1 2 1 1 1 0K | 1 1 1 1 0 0K | 1 1 1 1 0 0C | 1 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Page 48: Global Sequence Alignment by Dynamic Programming

And the Next …

C K H V F C R V C I +--------------------C | 1 2 2 1 1 1 0K | 1 2 1 1 1 0 0K | 1 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Page 49: Global Sequence Alignment by Dynamic Programming

And the Next …

C K H V F C R V C I +--------------------C | 1 3 2 2 1 1 1 0K | 1 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Page 50: Global Sequence Alignment by Dynamic Programming

And the Next …

C K H V F C R V C I +--------------------C | 1 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Page 51: Global Sequence Alignment by Dynamic Programming

Another Path Marked

C K H V F C R V C I +--------------------C | 5 3 3 3 2 2 1 1 1 0K | 4 4 3 3 2 1 1 1 0 0K | 3 4 3 3 2 1 1 1 0 0C | 4 3 3 3 2 2 1 1 1 0F | 3 2 2 2 3 1 1 1 0 0C | 4 2 2 2 2 2 1 1 1 0K | 2 3 2 2 2 1 1 1 0 0C | 2 1 1 1 1 2 1 0 1 0V | 0 0 0 1 0 0 0 1 0 0

Page 52: Global Sequence Alignment by Dynamic Programming

Filling the Matrix

• The globally optimal score is the highest score in the first row or column - for the N-terminus of either protein sequence.

• In our example it is 5. Therefore, the globally optimal alignment will have 5 matches. Let us now back trace our path ending at the cell with value 5 to see how we arrived at the value.

• Strip away all the cells that could not have contributed to the sum.

Page 53: Global Sequence Alignment by Dynamic Programming

C K H V F C R V C I +--------------------C | 5 K | 4 K | 4 3 3 C | 3 3 F | 3 C | 2 K | 2 1 1 C | 2 1 1 V | 1

There are several equally good paths to a score of 5

Page 54: Global Sequence Alignment by Dynamic Programming

A Number of Different Alignments Are Possible:

• C K H V F C R V C IC K K C F C - K C V

• C K H V F C R V C IC K K C F C K - C V

• C - K H V F C R V C IC K K C - F C - K C V

• C K H - V F C R V C IC K K C - F C - K C V

Page 55: Global Sequence Alignment by Dynamic Programming

Advanced Dynamic Programming

• An advanced scoring scheme is assumed where – Si,j = 2 if the residue at position i of

sequence #1 is the same as the residue at position j of sequence #2 (match score); otherwise

– Si,j = -1 (mismatch score) – w = -2 (gap penalty)

Page 56: Global Sequence Alignment by Dynamic Programming

Adding Gaps and Mismatches

• The second A-G pair is a mismatch. G A A T T C A G T T A | | | | | | G G A T _ C _ G _ _ A

• The fifth T has no pair so this is a gap.

• Or it could be an insertion or gain of sequence.

Page 57: Global Sequence Alignment by Dynamic Programming

Creating the Matrix

• The first step in the global alignment dynamic programming approach is to create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned.

• The first row and first column of the matrix can be initially filled with 0.

Page 58: Global Sequence Alignment by Dynamic Programming

Creating the Matrix

• For each position, Mi,j is defined to be the maximum score at position i,j; i.e.

• Mi,j = MAXIMUM[• Mi-1, j-1 + Si,j (match/mismatch in the

diagonal),• Mi,j-1 + w (gap in sequence #1),• Mi-1,j + w (gap in sequence #2)]• Note that in the example, Mi-1,j-1 will be

red, Mi,j-1 will be green and Mi-1,j will be blue

Page 59: Global Sequence Alignment by Dynamic Programming
Page 60: Global Sequence Alignment by Dynamic Programming

• Moving down the first column to row 2, we can see that there is once again a match in both sequences. Thus, S1,2 = 2. So M1,2 = MAX[M0,1 + 2, M1,1 - 2, M0,2 -2] = MAX[0 + 2, 2 - 2, 0 - 2] = MAX[2, 0, -2].

• A value of 2 is then placed in position 1,2 of the scoring matrix and an arrow is placed to point back to M[0,1] which led to the maximum score.

Page 61: Global Sequence Alignment by Dynamic Programming
Page 62: Global Sequence Alignment by Dynamic Programming

Filling the Matrix

• Looking at column 1 row 3, there is not a match in the sequences, so S 1,3 = -1. M1,3 = MAX[M0,2 - 1, M1,2 - 2, M0,3 - 2] = MAX[0 - 1, 2 - 2, 0 - 2] = MAX[-1, 0, -2].

• A value of 0 is then placed in position 1,3 of the scoring matrix and an arrow is placed to point back to M[1,2] which led to the maximum score.

Page 63: Global Sequence Alignment by Dynamic Programming
Page 64: Global Sequence Alignment by Dynamic Programming

• Eventually, we get to column 3 row 2. Since there is not a match in the sequences at this position, S3,2 = -1. M3,2 = MAX[ M2,1 - 1, M3,1 - 2, M2,2 - 2] = MAX[0 - 1, -1 - 2, 1 -2] = MAX[-1, -3, -1].

• So you begin to get loss of scores.

Page 65: Global Sequence Alignment by Dynamic Programming
Page 66: Global Sequence Alignment by Dynamic Programming

Filling the Matrix and Using Pointers

• Note that in the above case, there are two different ways to get the maximum score. In such a case, pointers are placed back to all of the cells that can produce the maximum score.

Page 67: Global Sequence Alignment by Dynamic Programming
Page 68: Global Sequence Alignment by Dynamic Programming
Page 69: Global Sequence Alignment by Dynamic Programming

Traceback

• This gives an alignment of G A A T T C A G T T A | | | | | | G G A _ T C _ G _ _ A

Page 70: Global Sequence Alignment by Dynamic Programming
Page 71: Global Sequence Alignment by Dynamic Programming

Traceback

• This gives another alignment of G A A T T C A G T T A | | | | | | G G A T _ C _ G _ _ A

Page 72: Global Sequence Alignment by Dynamic Programming

Gap Penalty

• The gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue-to-residue at some other neighboring point in the sequence.

• You need a penalty, because an unreasonable 'gappy' alignment would result.

• Biologically, it should in general be easier for a protein to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted.

Page 73: Global Sequence Alignment by Dynamic Programming

Gap Penalty

• Gaps/insertions should therefore be more rare than point mutations (substitutions).

• Some different possibilities: – A single gap-open penalty. This will tend to

stop gaps from occurring, but once they have been introduced, they can grow unhindered.

– A gap penalty proportional to the gap length. This will work against larger gaps.

– A gap penalty that combines a gap-open value with a gap-length value.

Page 74: Global Sequence Alignment by Dynamic Programming

Substitution Matrix

• A substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time.

• This is used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.

Page 75: Global Sequence Alignment by Dynamic Programming

Substitution Matrix

• An amino acid substitution matrix is a symmetrical 20*20 matrix, where each element contains the score for substituting a residue of type i with a residue of type j in a protein, where i and j are one of the 20 amino-acid residue types.

• Same residues should obviously have high scores, but if we have different residues in a position, how should that be scored?

Page 76: Global Sequence Alignment by Dynamic Programming

Substitution Matrix Scoring

• The same residues in a position give the score value 1, and different residues give 0.

• The same residues give a score 1, similar residues (for example: Tyr/Phe, or Ile/Leu) give 0.5, and all others 0.

• One may calculate, using well established sequence alignments, the frequencies (probabilities) that a particular residue in a position is exchanged for another.

Page 77: Global Sequence Alignment by Dynamic Programming

Substitution Matrix Scoring

• This was done originally be Margaret Dayhoff, and her matrices are called the PAM (Point Accepted Mutation) matrices, which describe the exchange frequencies after having accepted a given number of point mutations over the sequence.

• Typical values are PAM 120 (120 mutations per 100 residues in a protein) and PAM 250.

• There are many other substitution matrices: BLOSUM, Gonnet, etc.

Page 78: Global Sequence Alignment by Dynamic Programming

Scoring Similarity1) Can only score aligned sequences2) DNA is usually scored as identical or not3) Modified scoring for gaps - single vs.

multiple base gaps (gap extension)4) AAs have varying degrees of similarity

– a. # of mutations to convert one to another– b. chemical similarity– c. observed mutation frequencies

5) PAM matrix calculated from observed mutations in protein families

Page 79: Global Sequence Alignment by Dynamic Programming

The PAM 250 Scoring Matrix

Page 80: Global Sequence Alignment by Dynamic Programming

GCG Wisconsin Package GAP

• GAP is the implementation of the Needleman-Wunsch algorithm in the GCG program package.

• The NW algorithm will present you with a single globally optimal alignment, not all possible optimal alignments - different alignments may exist that give the same score.

• GAP presents you with one member of the family of best alignments that align the full length of one sequence to the full length of a second sequence.

• There may be many members of this family, but no other member has a higher score.

Page 81: Global Sequence Alignment by Dynamic Programming

GCG Wisconsin Package GAP

• The primary use of a global alignment algorithm is when you really want the whole of two sequences to be aligned, without truncation.

• GAP could completely bypass a region of high local homology, if a better (or even just as good) path can be found in a different way.

• This is problematic if one short sequence is aligned against a longer one with internal repeats.

• If there is weak or unknown similarity between two sequences, a local alignment algorithm (BESTFIT) is the better choice.

• Use GAP only when you believe the similarity is over the whole length.

Page 82: Global Sequence Alignment by Dynamic Programming

Limitations to Needleman-Wunsch

• The problem with Needleman-Wunsch is the amount of processor memory resources it requires.

• Because of this, it is not favored for practical use, despite the guarantee of an optimal alignment.

• The other difficulty is that the concept of global alignment is not used in pairwise sequence comparison searches.

Page 83: Global Sequence Alignment by Dynamic Programming