bioiformatics i fall 20021 dynamic programming algorithm: pairwise comparisons

36
Bioiformatics I Fall 2002 1 Dynamic programming algorithm: pairwise comparisons

Upload: blanche-melton

Post on 11-Jan-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 1

Dynamic programming algorithm: pairwise comparisons

Dynamic programming algorithm: pairwise comparisons

Page 2: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 2

• Need a method that is both reliable and efficient to compare two sequences

• Exhaustive comparison of every possible alignment will give good answers but takes too much time

• Need a method that is both reliable and efficient to compare two sequences

• Exhaustive comparison of every possible alignment will give good answers but takes too much time

Page 3: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 3

Dynamic programming: strategyDynamic programming: strategy• Break alignment problem into small

pieces• Optimize first piece• Then extend into second piece; since

first piece is optimized already, program only needs to optimize extension

• Continue until end of comparison

• Break alignment problem into small pieces

• Optimize first piece• Then extend into second piece; since

first piece is optimized already, program only needs to optimize extension

• Continue until end of comparison

Page 4: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 4

GapsGaps

• Remember we said we need to penalize gaps (mimicking evolution)

• Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic

• More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring

• Remember we said we need to penalize gaps (mimicking evolution)

• Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic

• More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring

Page 5: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 5

Global alignment: Needleman-WunschGlobal alignment: Needleman-Wunsch• What you need to start:• Matrix of sequences to be aligned

example: sequence example from text• Substitution matrix (choose one that

makes sense) example: BLOSUM50• Gap penalty example: -8• Start at 0 (top left) – this allows a “gap” in

the beginning of the alignment

• What you need to start:• Matrix of sequences to be aligned

example: sequence example from text• Substitution matrix (choose one that

makes sense) example: BLOSUM50• Gap penalty example: -8• Start at 0 (top left) – this allows a “gap” in

the beginning of the alignment

Page 6: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 6

Dynamic programming processDynamic programming process• Fill in the matrix starting from the top

left; each time you move away from a diagonal you add a gap penalty to the score in the position you started in; each time you move on a diagonal you add the score from the substitution matrix

• Fill in the matrix starting from the top left; each time you move away from a diagonal you add a gap penalty to the score in the position you started in; each time you move on a diagonal you add the score from the substitution matrix

Page 7: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 7

Fill in the values for “gaps” at the beginning (start with 0)

Fill in the values for “gaps” at the beginning (start with 0)

H E A G

0 -8 -16 -24 -32

P

A

W

H

Page 8: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 8

• For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example

•HEAG•-PAWH

• Arrow indicates adding score from 0

• For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example

•HEAG•-PAWH

• Arrow indicates adding score from 0

Page 9: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 9

• If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8.

•HEAG•--PAWH

• If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8.

•HEAG•--PAWH

Page 10: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 10

• Similar reasoning allows you to fill in the first column

• Similar reasoning allows you to fill in the first column

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

Page 11: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 11

Now, there are 3 possibilities to fill each remaing matrix element.So, if you aligned P with H, you move from 0 along the diagonal, so you add the substitution matrix value of -2.

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-2

Page 12: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 12

• Or, you could start with H aligned with a gap, and

then align P with a gap

H-

-P

• Or, you could start with H aligned with a gap, and

then align P with a gap

H-

-P

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-16

Page 13: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 13

• Or, you could start with P aligned with a gap, and

then align H with a gap

-H

P-

• Or, you could start with P aligned with a gap, and

then align H with a gap

-H

P-

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-16

Page 14: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 14

We choose the highest value, and preserve it and the information about where we started to get there (arrow)

We choose the highest value, and preserve it and the information about where we started to get there (arrow)

H E A G

0 -8 -16 -24 -32

P -8

A -16

W -24

H -32

-2 -16-16

Page 15: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 15

• Now we get to the P/E matrix element. There are 3 ways we could get to this position:

•HE..•-P..•HE ...•P-..•HE-..•--P..

• Now we get to the P/E matrix element. There are 3 ways we could get to this position:

•HE..•-P..•HE ...•P-..•HE-..•--P..

Page 16: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 16

• Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally

• These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score

• Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally

• These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score

Page 17: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 17

•HE..•-P..•HE ...•P-..•HE-.•--P.

•HE..•-P..•HE ...•P-..•HE-.•--P.

Score = -8 + -1 = -9

Score = -2 + -8 = -10

Score = -16 + -8 = -24

Page 18: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 18

H E A G

0 -8 -16 -24 -32

P -8 -2 -9

A -16

W -24

H -32

In this case, the highest score from the three parent matrix elements was along the diagonal

Page 19: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 19

• Using the same logic, you can fill in all the other cells in the matrix

• We can also express this process using matrix notation

• X and Y are sequences; X1…i, Y1…j

• Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to xi) and the initial part of y (to yj)

• Using the same logic, you can fill in all the other cells in the matrix

• We can also express this process using matrix notation

• X and Y are sequences; X1…i, Y1…j

• Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to xi) and the initial part of y (to yj)

Page 20: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 20

• Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down

• Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one

• Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down

• Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one

Page 21: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 21

• We can express this by:

F(i-1, j-1) + s(xi, yj),

F(i-1, j) - d

F(i, j-1) – d

where s = score from substitution matrix and d = linear gap penalty

• We can express this by:

F(i-1, j-1) + s(xi, yj),

F(i-1, j) - d

F(i, j-1) – d

where s = score from substitution matrix and d = linear gap penalty

F(i,j) = max

Page 22: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 22

So now what?So now what?

• So now, we look for the path through the matrix that gives the final score – in this kind of global alignment, the last cell of the matrix is by definition the best score for the alignment. Looking for the path is called traceback – you follow the pointers that got you to the end (like Hansel and Gretel …)

• So now, we look for the path through the matrix that gives the final score – in this kind of global alignment, the last cell of the matrix is by definition the best score for the alignment. Looking for the path is called traceback – you follow the pointers that got you to the end (like Hansel and Gretel …)

Page 23: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 23

• By following the arrows, you can arrive at the alignment

• Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one

• see example from the text

• By following the arrows, you can arrive at the alignment

• Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one

• see example from the text

Page 24: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 24

In-class exercise IIIn-class exercise II

• Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide

• Do a traceback to find the optimal alignment

• Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide

• Do a traceback to find the optimal alignment

Page 25: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 25

In-class exercise II: complete the matrix In-class exercise II: complete the matrix

G A A C T T A

0

A

C

C

T

T

T

Page 26: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 26

In-class exercise IIIIn-class exercise III

• Use Gap program to align sequences in nosalign file

• Vary the gap initiation penalty and the gap extension penalty; compare alignments

• Change the substitution matrix keeping all other variables same; compare alignments

• Use Gap program to align two unrelated sequences

• Use Gap program to align sequences in nosalign file

• Vary the gap initiation penalty and the gap extension penalty; compare alignments

• Change the substitution matrix keeping all other variables same; compare alignments

• Use Gap program to align two unrelated sequences

Page 27: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 27

Instructions for Gap exerciseInstructions for Gap exercise• In seqlab, bioinfI.list, select nosalign; get

into Editor• Select 2 sequences; use info button if

necessary to find out what these sequences are; select Edit Remove gaps All gaps

• Select Functions Pairwise Comparison Gap

• Select Options; select penalize end gaps like other gaps, then Close, then Run

• In seqlab, bioinfI.list, select nosalign; get into Editor

• Select 2 sequences; use info button if necessary to find out what these sequences are; select Edit Remove gaps All gaps

• Select Functions Pairwise Comparison Gap

• Select Options; select penalize end gaps like other gaps, then Close, then Run

Page 28: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 28

• Note the quality score of this alignment• Now systematically vary the gap penalties

and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation

• See what happens if you don’t penalize end gaps

• Don’t save this as anything, just go to main list when you are done

• Note the quality score of this alignment• Now systematically vary the gap penalties

and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation

• See what happens if you don’t penalize end gaps

• Don’t save this as anything, just go to main list when you are done

Page 29: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 29

• Go back to main list; select unrelated; use info button to find out what these sequences are

• Run the Gap program (penalizing end gaps)• Is this alignment meaningful? Check by using the

Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment

• See what happens when you don’t penalize end gaps

• Go back to main list; select unrelated; use info button to find out what these sequences are

• Run the Gap program (penalizing end gaps)• Is this alignment meaningful? Check by using the

Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment

• See what happens when you don’t penalize end gaps

Page 30: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 30

Local alignment: Smith-WatermanLocal alignment: Smith-Waterman• This is very similar to Needleman-

Wunsch, with two major differences:

• Must allow for starting a new alignment rather than extending one

• Must allow for alignment to end before the end of the sequences

• This is very similar to Needleman-Wunsch, with two major differences:

• Must allow for starting a new alignment rather than extending one

• Must allow for alignment to end before the end of the sequences

Page 31: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 31

• Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0

0

F(i -1, j-1) + s(xi, yj)

F(i – 1, j) – d

F(i, j – 1) - d

• Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0

0

F(i -1, j-1) + s(xi, yj)

F(i – 1, j) – d

F(i, j – 1) - d

F(I,j) = max

Page 32: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 32

• Allowing for the alignment to end before the end of the sequence is taken care of by looking for the highest score in the matrix, and starting the traceback from there until a 0 is reached.

• Allowing for the alignment to end before the end of the sequence is taken care of by looking for the highest score in the matrix, and starting the traceback from there until a 0 is reached.

Page 33: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 33

In-class exercise IVIn-class exercise IV

• Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap

• Vary the same parameters you did before • Use randomizations to evaluate

alignments

• Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap

• Vary the same parameters you did before • Use randomizations to evaluate

alignments

Page 34: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 34

Affine gap penaltiesAffine gap penalties

• To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap

• There are two ways to be aligned to a gap: xi aligned to a gap in y, or yj aligned to a gap in x

• To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap

• There are two ways to be aligned to a gap: xi aligned to a gap in y, or yj aligned to a gap in x

Page 35: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 35

• In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from.

• M(i,j) = best score for two sequence characters aligned

• Ix = best score for xi aligned with a gap in y• Iy = best score for yj aligned with a gap in x

• In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from.

• M(i,j) = best score for two sequence characters aligned

• Ix = best score for xi aligned with a gap in y• Iy = best score for yj aligned with a gap in x

Page 36: Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

Bioiformatics I Fall 2002 36

M(i-1,j-1) + s(xi,yj)

Ix(i-1,j-1) + s(xi,yj)

Iy(i-1,j-1) + s(xi,yj)

M(i-1,j) – d

Ix(i-1,j) – e

M(i, j-1) – d

Iy(i,j-1) - e

M(i-1,j-1) + s(xi,yj)

Ix(i-1,j-1) + s(xi,yj)

Iy(i-1,j-1) + s(xi,yj)

M(i-1,j) – d

Ix(i-1,j) – e

M(i, j-1) – d

Iy(i,j-1) - e

M (i,j) = max

Ix(I,j) = max

Iy(I,j) = max