bioiformatics i fall 20021 dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 1

Dynamic programming algorithm: pairwise comparisons

Dynamic programming algorithm: pairwise comparisons


• Need a method that is both reliable and efficient to compare two sequences

• Exhaustive comparison of every possible alignment will give good answers but takes too much time

• Need a method that is both reliable and efficient to compare two sequences

• Exhaustive comparison of every possible alignment will give good answers but takes too much time


Dynamic programming: strategyDynamic programming: strategy• Break alignment problem into small

pieces• Optimize first piece• Then extend into second piece; since

first piece is optimized already, program only needs to optimize extension

• Continue until end of comparison

• Break alignment problem into small pieces

• Optimize first piece• Then extend into second piece; since

first piece is optimized already, program only needs to optimize extension

• Continue until end of comparison


GapsGaps

• Remember we said we need to penalize gaps (mimicking evolution)

• Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic

• More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring

• Remember we said we need to penalize gaps (mimicking evolution)

• Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic

• More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring


Global alignment: Needleman-WunschGlobal alignment: Needleman-Wunsch• What you need to start:• Matrix of sequences to be aligned

example: sequence example from text• Substitution matrix (choose one that

makes sense) example: BLOSUM50• Gap penalty example: -8• Start at 0 (top left) – this allows a “gap” in

the beginning of the alignment

• What you need to start:• Matrix of sequences to be aligned

example: sequence example from text• Substitution matrix (choose one that

makes sense) example: BLOSUM50• Gap penalty example: -8• Start at 0 (top left) – this allows a “gap” in

the beginning of the alignment


Dynamic programming processDynamic programming process• Fill in the matrix starting from the top

left; each time you move away from a diagonal you add a gap penalty to the score in the position you started in; each time you move on a diagonal you add the score from the substitution matrix

• Fill in the matrix starting from the top left; each time you move away from a diagonal you add a gap penalty to the score in the position you started in; each time you move on a diagonal you add the score from the substitution matrix


Fill in the values for “gaps” at the beginning (start with 0)

Fill in the values for “gaps” at the beginning (start with 0)

H E A G

0 -8 -16 -24 -32

P

A

W

H


• For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example

•HEAG•-PAWH

• Arrow indicates adding score from 0

• For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example

•HEAG•-PAWH

• Arrow indicates adding score from 0


• If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8.

•HEAG•--PAWH

• If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8.

•HEAG•--PAWH


• Similar reasoning allows you to fill in the first column

• Similar reasoning allows you to fill in the first column

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32


Now, there are 3 possibilities to fill each remaing matrix element.So, if you aligned P with H, you move from 0 along the diagonal, so you add the substitution matrix value of -2.

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-2


• Or, you could start with H aligned with a gap, and

then align P with a gap

H-

-P

• Or, you could start with H aligned with a gap, and

then align P with a gap

H-

-P

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-16


• Or, you could start with P aligned with a gap, and

then align H with a gap

-H

P-

• Or, you could start with P aligned with a gap, and

then align H with a gap

-H

P-

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-16


We choose the highest value, and preserve it and the information about where we started to get there (arrow)

We choose the highest value, and preserve it and the information about where we started to get there (arrow)

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-2 -16-16


• Now we get to the P/E matrix element. There are 3 ways we could get to this position:

•HE..•-P..•HE ...•P-..•HE-..•--P..

• Now we get to the P/E matrix element. There are 3 ways we could get to this position:

•HE..•-P..•HE ...•P-..•HE-..•--P..


• Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally

• These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score

• Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally

• These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score


•HE..•-P..•HE ...•P-..•HE-.•--P.

•HE..•-P..•HE ...•P-..•HE-.•--P.

Score = -8 + -1 = -9

Score = -2 + -8 = -10

Score = -16 + -8 = -24


H E A G

0 -8 -16 -24 -32

P -8 -2 -9

A -16

W -24

H -32

In this case, the highest score from the three parent matrix elements was along the diagonal


• Using the same logic, you can fill in all the other cells in the matrix

• We can also express this process using matrix notation

• X and Y are sequences; X1…i, Y1…j

• Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to xi) and the initial part of y (to yj)

• Using the same logic, you can fill in all the other cells in the matrix

• We can also express this process using matrix notation

• X and Y are sequences; X1…i, Y1…j

• Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to xi) and the initial part of y (to yj)


• Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down

• Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one

• Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down

• Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one


• We can express this by:

F(i-1, j-1) + s(xi, yj),

F(i-1, j) - d

F(i, j-1) – d

where s = score from substitution matrix and d = linear gap penalty

• We can express this by:

F(i-1, j-1) + s(xi, yj),

F(i-1, j) - d

F(i, j-1) – d

where s = score from substitution matrix and d = linear gap penalty

F(i,j) = max


So now what?So now what?

• So now, we look for the path through the matrix that gives the final score – in this kind of global alignment, the last cell of the matrix is by definition the best score for the alignment. Looking for the path is called traceback – you follow the pointers that got you to the end (like Hansel and Gretel …)

• So now, we look for the path through the matrix that gives the final score – in this kind of global alignment, the last cell of the matrix is by definition the best score for the alignment. Looking for the path is called traceback – you follow the pointers that got you to the end (like Hansel and Gretel …)


• By following the arrows, you can arrive at the alignment

• Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one

• see example from the text

• By following the arrows, you can arrive at the alignment

• Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one

• see example from the text


In-class exercise IIIn-class exercise II

• Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide

• Do a traceback to find the optimal alignment

• Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide

• Do a traceback to find the optimal alignment


In-class exercise II: complete the matrix In-class exercise II: complete the matrix

G A A C T T A

0

A

C

C

T

T

T


In-class exercise IIIIn-class exercise III

• Use Gap program to align sequences in nosalign file

• Vary the gap initiation penalty and the gap extension penalty; compare alignments

• Change the substitution matrix keeping all other variables same; compare alignments

• Use Gap program to align two unrelated sequences

• Use Gap program to align sequences in nosalign file

• Vary the gap initiation penalty and the gap extension penalty; compare alignments

• Change the substitution matrix keeping all other variables same; compare alignments

• Use Gap program to align two unrelated sequences


Instructions for Gap exerciseInstructions for Gap exercise• In seqlab, bioinfI.list, select nosalign; get

into Editor• Select 2 sequences; use info button if

necessary to find out what these sequences are; select Edit Remove gaps All gaps

• Select Functions Pairwise Comparison Gap

• Select Options; select penalize end gaps like other gaps, then Close, then Run

• In seqlab, bioinfI.list, select nosalign; get into Editor

• Select 2 sequences; use info button if necessary to find out what these sequences are; select Edit Remove gaps All gaps

• Select Functions Pairwise Comparison Gap

• Select Options; select penalize end gaps like other gaps, then Close, then Run


• Note the quality score of this alignment• Now systematically vary the gap penalties

and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation

• See what happens if you don’t penalize end gaps

• Don’t save this as anything, just go to main list when you are done

• Note the quality score of this alignment• Now systematically vary the gap penalties

and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation

• See what happens if you don’t penalize end gaps

• Don’t save this as anything, just go to main list when you are done


• Go back to main list; select unrelated; use info button to find out what these sequences are

• Run the Gap program (penalizing end gaps)• Is this alignment meaningful? Check by using the

Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment

• See what happens when you don’t penalize end gaps

• Go back to main list; select unrelated; use info button to find out what these sequences are

• Run the Gap program (penalizing end gaps)• Is this alignment meaningful? Check by using the

Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment

• See what happens when you don’t penalize end gaps


Local alignment: Smith-WatermanLocal alignment: Smith-Waterman• This is very similar to Needleman-

Wunsch, with two major differences:

• Must allow for starting a new alignment rather than extending one

• Must allow for alignment to end before the end of the sequences

• This is very similar to Needleman-Wunsch, with two major differences:

• Must allow for starting a new alignment rather than extending one

• Must allow for alignment to end before the end of the sequences


• Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0

0

F(i -1, j-1) + s(xi, yj)

F(i – 1, j) – d

F(i, j – 1) - d

• Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0

0

F(i -1, j-1) + s(xi, yj)

F(i – 1, j) – d

F(i, j – 1) - d

F(I,j) = max


• Allowing for the alignment to end before the end of the sequence is taken care of by looking for the highest score in the matrix, and starting the traceback from there until a 0 is reached.

• Allowing for the alignment to end before the end of the sequence is taken care of by looking for the highest score in the matrix, and starting the traceback from there until a 0 is reached.


In-class exercise IVIn-class exercise IV

• Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap

• Vary the same parameters you did before • Use randomizations to evaluate

alignments

• Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap

• Vary the same parameters you did before • Use randomizations to evaluate

alignments


Affine gap penaltiesAffine gap penalties

• To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap

• There are two ways to be aligned to a gap: xi aligned to a gap in y, or yj aligned to a gap in x

• To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap

• There are two ways to be aligned to a gap: xi aligned to a gap in y, or yj aligned to a gap in x


• In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from.

• M(i,j) = best score for two sequence characters aligned

• Ix = best score for xi aligned with a gap in y• Iy = best score for yj aligned with a gap in x

• In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from.

• M(i,j) = best score for two sequence characters aligned

• Ix = best score for xi aligned with a gap in y• Iy = best score for yj aligned with a gap in x


M(i-1,j-1) + s(xi,yj)

Ix(i-1,j-1) + s(xi,yj)

Iy(i-1,j-1) + s(xi,yj)

M(i-1,j) – d

Ix(i-1,j) – e

M(i, j-1) – d

Iy(i,j-1) - e

M(i-1,j-1) + s(xi,yj)

Ix(i-1,j-1) + s(xi,yj)

Iy(i-1,j-1) + s(xi,yj)

M(i-1,j) – d

Ix(i-1,j) – e

M(i, j-1) – d

Iy(i,j-1) - e

M (i,j) = max

Ix(I,j) = max

Iy(I,j) = max

bioiformatics i fall 20021 dynamic programming algorithm: pairwise comparisons

Documents