brandon andrews. longest common subsequences global sequence alignment scoring alignments local...

Dynamic Programming6.5-6.9

Brandon Andrews

Topics Longest Common Subsequences Global Sequence Alignment Scoring Alignments Local Sequence Alignment Alignment with Gap Penalties Questions

Longest Common Subsequences (LCS)

Goal: Looking for sequence similarity between two sequences

Sequences can vary in length between each other• Sequences are denoted as v and w and are

viewed as strings of characters. v = ATTGCTA

Subsequences Subsequences are an ordered

sequence of characters in v or w For example: v = ATTGCTA then

AGCA and ATTA are subsequences• AGCA: ATTGCTA• ATTA: ATTGCTA

Operations The only operations we can perform is

insertion and deletion• Insertion: ATCTGAT -> A-TCTGAT

The hyphen represents inserting anything• Deletion: Insertion into the other sequence to

offset the characters to line up the longest common subsequences

•v=AT-C-TGAT•w=-TGCAT-A-• How do we find TCTA using dynamic programming?

Review: Edit Distance Turning one sequence into another

with the least number of operations.• Allowed insertion, deletion, and

substitutions The longest common subsequences

problem is basically identical with only insertion and deletion and the weights are 0 for a non-match and 1 for a match in the grid (basically Manhattan with fixed weights)

Example Example: Other slides

• Chapter 6: Edit Distance, Slides 54-58,

Global Sequence Alignment

Chapter 6: Alignment

Scoring Alignments Scoring matrices are based on

biological evidence.• Certain amino acid mutations are more

common than others.• For instance, Asn, Asp, Glu, and Ser are the

most mutable amino acids• The probability that Ser mutates into Phe is

approximately three times as likely as Trp mutating into the same amino acid Phe

PAM 1 mutation for every 100 amino acids Required condition that ensures proteins

that are being analyzed are closely related.• The scoring matrix uses probabilities that can

change if the proteins are not closely related. The probability that one amino acid can mutate into

another is different essentially 1 PAM is the average time for the

“average” protein to mutate 1% You end up with PAM 1, PAM 2 type scoring

matrices

Local Sequence Alignment Global alignment looked at two entire

strings Local alignment attempts to only look

for local alignments• That is look for small sequences that are

similar in larger sequences

Smith-Waterman Local Alignment Algorithm

Set an edge weight of 0 from the source to every other vertex.

Alignment with Gap Penalties

Gaps are expected in the sequences.• However, very small gaps could indicate

dissimilarity, so a penalty is given for gaps that meet a criteria

References An Introduction to Bioinformatics

Algorithms Related Slides

brandon andrews. longest common subsequences global sequence alignment scoring alignments local...

Documents