lecture 6ll

7/27/2019 Lecture 6ll

1/46

Bioinformatics 1 -- lecture 6

Affine gap penalty

Substitution Matrices

PAM

BLOSUM

Matrix bias in local alignment


2/46

Reminder: scoring an alignment

A ~ T S F M ~A G L S T F M

The score of the alignment is the sum of the scores of each

column (match, deletion or insertion) in the alignment.

Match: look up match score from substitution matrixNew gap: use gap initiation penalty

Additional gap: use gap extension penalty

End gap: Optional, may be zero.


3/46

Affine gap exercise

~ ~ A T S F M

A G L S T F M

A ~ ~ T S F M

A G L S T F M

A ~ T S F M ~

A G L S T F M

Which alignmentscores the highest?

Given:

No end gap penalty.

BLOSUM score.

Affine gap penalty. (-2,

-1)


4/46

Worksheet for affine gap local dynamic programming

Q

S

IM

PR

P

L T V K PM I D

Fill in the scores are the traceback letters using the BLOSUM62 matrix and

Gap opening = -2, gap extension = -1. Start at 0. End at maximum.No end gaps (meaning no starting gaps).

Q

S

IM

PR

P

L T V K P


5/46

Rules for affine gap penalty DPDo not penalize end-gaps.

I can follow D, and vice versa, but it is a gap opening.

For each box, write the score and the traceback letter (M,I,orD)

Affine gap penalty worksheet

M = match matrix I = insertion matrix D = deletion matrixscores for alignments

currently in a match statescores for alignments with

gap in first sequence.

scores for alignments with

gap in second sequence.

Fill in M as the

max over three

possible

diagnoal

arrows Fill in I as the

max over three

possible down

arrows

Fill in D as the

max over three

possible right

arrows

A D P Q F GA

K

L

K

L

D

Q

F

G

P

A

K

L

K

L

D

Q

F

G

P

A

K

L

K

L

D

Q

F

G

P

A D P Q F G A D P Q F GNote:Iwrotethesequences

inthegaprows!!

M[i,j] = MAX

M[i-1,j-1]+match score

I[i-1,j-1]+match score

D[i-1,j-1]+match score

I[i,j] = MAX

M[i,j-1] - 2

I[i,j-1] - 1

D[i,j-1] - 2

D[i,j] = MAX

M[i-1,j] - 2

I[i-1,j] - 2

D[i-1,j] - 1


6/46

BLOSUM matrix for match scoresTwo 20x20 substitution matrices are used: BLOSUM & PAM.

4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2

9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2

6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3

5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2

6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3

6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3

8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2

4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1

5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2

4 2 -3 -3 -2 -2 -2 -1 1 -2 -1

5 -2 -2 0 -1 -1 -1 1 -1 -1

6 -2 0 0 1 0 -3 -4 -2

7 -1 -2 -1 -1 -2 -4 -3

5 1 0 -1 -2 -2 -15 -1 -1 -3 -3 -2

4 1 -2 -3 -2

5 0 -2 -2

4 -3 -1

11 2

7

A C D E F G H I K L M N P Q R S T V W Y ACDEF

GHIKLMNP

QRSTVWY

BLOSUM62

Each number is the score

for aligning a single pair

ofamino acids.

What is the score for this alignment?:

ACEPGAA

ASDDGTV


7/46

Substitution matrices

Used to score aligned positions, usually of amino acids.

Expressed as the log-likelihood ratio of mutation (orlog-odds

ratio)

Derived from multiple sequence alignments

Two commonly used matrices: PAM and BLOSUM

PAM = percent accepted mutations (Dayhoff)

BLOSUM = Blocks substitution matrix (Henikoff)

Read: Mount pp94-113


8/46

PAM

Evolutionary time is measured in Percent Accepted

Mutations, or PAMs

One PAM of evolution means 1% of the residues/bases have

changed, averaged over all 20 amino acids.

To get the relative frequency of each type of mutation, we

count the times it was observed in a database of multiple

sequence alignments.

Based on global alignments

Assumes a Markov model for evolution.

M Dayhoff, 1978


9/46

BLOSUM

Based on database of ungapped local alignments

(BLOCKS)

Alignments have lower similarity than PAM alignments.

BLOSUM number indicates the percent identity level of

sequences in the alignment. For example, for BLOSUM62

sequences with approximately 62% identity were counted.

Some BLOCKS represent functional units, providing

validation of the alignment.

Henikoff & Henikoff, 1992


10/46

A multiple sequence alignment is made using many pairwise sequence alignments

Multiple Sequence Alignment


11/46

Columns in a MSA have a common evolutionary history

By aligning the sequences, we assert that the aligned

residues in each column had a common ancestor.


12/46

How do you count the mutations?Assume any of the sequences could be the ancestral

one.

L K F R L S K K P

L K F R L S K K P

L K F R L T K K P

L K F R L S K K P

L K F R L S R K PL K F R L T R K P

L K F R L ~ K K P

GG

G

W

W

NG

G

G W W N G G

If the first sequence was the ancestor,

then it mutated to a W twice, to N

once, and conserved G three times.


13/46

Or, we could have picked...

L K F R L S K K P

L K F R L S K K P

L K F R L T K K P

L K F R L S K K P

L K F R L S R K PL K F R L T R K P

L K F R L ~ K K P

WG

G

W

W

NG

G

G W W N G G

W was the ancestor, then it mutated

to a G four times, to N once, and was

conserved once.


14/46

Subsitution matrices are symmetrical

Since we don't know which sequence came first, we don't

know whether

...is correct. So we count this as one mutation of each type.G-->W and W-->G. In the end the 20x20 matrix will have

the same number for elements (i,j) and (j,i).

(That's why we only show the upper triangle)

G

Gw

w

or


15/46

Summing the substitution counts

G

G

W

W

N

GG

one column of a MSA

G

G

W

W

N

N

3 21

symmetrical matrix

We assume the ancester is one of the observed amino acids,

but we don't know which, so we try them all.


16/46

Next possible ancester...

G

G

W

W

N

GG

G

G

W

W

N

N

2 21

We already counted this residue against all others, so be blank it out.


17/46

Next...

G

G

W

W

N

GG

G

G

W

W

N

N

1

2

1


18/46

Next...

G

G

W

W

N

GG

G

G

W

W

N

N

0

2

1


19/46

Next...

G

G

W

W

N

GG

G

G

W

W

N

N

0

2

0


20/46

Next...

G

G

W

W

N

GG

G

G

W

W

N

N

1 0 0


21/46

Last...

G

G

W

W

N

GG

G

G

W

W

N

N

0 0 0

(no counts for last seq.)


22/46

Summing the substitution counts

G

G

W

W

N

GG

G

G

W

W

N

N

6 4 8

TOTAL=21

0

1

2

Now we do this for every

column in every multiple

sequence alignment...


23/46

log odds

log odds ratio = log2(observed/expected )

Substitutions (and many other things in bioinformatics) are

expressed as a "likelihood ratio", or "odds ratio" of the

observed data over the expected value. Likelihood and

odds are synomyms for Probability.So Log Odds is the log (usually base 2) of the odds ratio.


24/46

Getting log-odds from counts

Observed probability of G->G

qGG = P(G->G)=6/21 = 0.29

Expected probability of G->G,

eGG = 0.57*0.57 = 0.33

odds ratio = qGG/eGG = 0.29/0.33

log odds ratio = log2(qGG/eGG )

If the lod is < 0., then

the mutation is less

likely than expected by

chance. If it is > 0., it ismore likely.

P(G) = 4/7 = 0.57


25/46

Different observations, same expectation

G G

G A

W G

W A

N GG A

G A

P(G)=0.50

eGG = 0.25qGG = 9/42 =0.21

lod = log2(0.21/0.25) =0.2

G WG A

G W

G A

G WG A

G A

P(G)=0.50

eGG = 0.25qGG = 21/42 =0.5

lod = log2(0.50/0.25) = 1

Gs spread over many columns

Gs concentrated


26/46

Different observations, same expectation

G G

G A

W G

A W

N GG A

G A

P(G)=0.50, P(W)=0.14

eGW = 0.07qGW = 7/42 =0.17

lod = log2(0.17/0.07) = 1.3

G WG A

G W

G A

G WG A

A G

P(G)=0.50, P(W)=0.14

eGW = 0.07qGG = 3/42 =0.07

lod = log2(0.07/0.07) = 0

G and W seen together more

often than expected.

Gs and Ws not

seen together.

In class exercise:


27/46

Get the substitution value for P->Q

...given a very small database.

PQPPQQQPQQP

PQPPPQQQP

P(P)=_____, P(Q)=_____

ePQ = _____

qPQ = ___/___ =_____

lod = log2(ePQ/qPQ) = ____

P Q

Q

P

In class exercise:


28/46

Markovian evolution and PAMA Markov process is one where te likelihood of the next

"state" depends only on the current state.

Markovian evolution assumes that base changes (or

amino acid changes) occur at a constant rate and

depend only on the identity of the current base (oramino acid).

G G A V V G

millions of years

one

position ina protein

.9946 .0002 .0021 .0001.9932


29/46

Markovian evolution is an extrapolation

Start with all G's. Wait 1 million years. Where

do they go?

Using PAM1, we expect them to mutate to

about 0.0002 A, 0.0007 P, 0.9946 G, etcWait another million years.

The new A's mutate according to PAM1 for A's,

P's mutate according to PAM1 for P's, etc.

Wait another million, etc , etc etc.

What is the final distribution of amino acids at

the positions that were once G's?

PAM1 =

PAM1 =


30/46

Matrix multiplication

PAM1 =

00001000

00000000000

0

P(G->A)P(G->C)P(G->D)P(G->E)P(G->G)P(G->F)P(G->H)

P(G->I)P(G->K)P(G->L)P(G->M)P(G->N)P(G->P)P(G->Q)P(G->R)P(G->S)P(G->T)

PAM1 x =

To start we have 100%G,

0% everything else

After 1MY we have

each amino acid

according to the

PAM probabilities.

0.0001

0.0001

0.00015

0.00005

0.99943

0.00002

0.00005

0.00001

0.0002

0.00015

0.00002

0.00003

0.0006

0.0006

0.00002

=


31/46

Matrix multiplication

PAM1 =

PAM1 x

After 2MY each

amino acid has

mutated again

according to thePAM1 probabilities.

PAM1

=

etc.

0.0001

0.0001

0.00015

0.00005

0.99943

0.00002

0.00005

0.000010.0002

0.00015

0.00002

0.00003

0.0006

0.0006

0.00002


32/46

250 PAMs

PAM1 =PAM1 PAM1

PAM1

250

==


33/46

Differences between PAM and BLOSUM

PAMPAM matrices are based onglobal alignments of closely related proteins.

The PAM1 is the matrix calculated from comparisons of sequences with no more

than 1% divergence.

Other PAM matrices are extrapolated from PAM1 using an assumed Markov

chain.

BLOSUMBLOSUM matrices are based on local alignments.

BLOSUM 62 is a matrix calculated from comparisons of sequences with approx

62% identity.

All BLOSUM matrices are based on observed alignments; they are not

extrapolated from comparisons of closely related proteins.

BLOSUM 62 is the default matrix in BLAST (the database search program). It is

tailored for comparisons of moderately distant proteins. Alignment of distant

relatives may be more accurate with a different matrix.


34/46

Increasing sophistication in match scoring

1. Identity score.

2. Genetic code changes (mutations on one base more likely than 2,3). (1966)

3. Matrices based on chemical similarity of amino acids. (1985)

4. Matrices based on multiple sequence alignments (PAM (1978),BLOSUM (1994))

5. Dipeptide substitution matrices (ie. AG --> DG, etc) (1994)

6. Class specific substitution matrices (D. Jones' transmembrane proteinmatrix) (1994)

7. Structure-based substitution matrices (2000)

8. Position-specific, structure-based substitution matrices (2006)


35/46

PAM250


36/46

BLOSUM62


37/46

Which substitution matrix favors...

conservation of polar residues

conservation of non-polar residuesconservation of C, Y, or W

polar-to-nonpolar mutations

polar-to-polar mutations

PAM250 BLOSUM62


38/46

Local alignment, revisited

Starts at zero (the score of a non-alignment)

Ends at the maximum score anywhere in the matrix.

Global Local

Advantages:

Does not care if the aligned region has long "tails".

Can align pieces of one sequence to pieces of another.

Multi-domain sequences are OK


39/46

Local alignment, revisited

Global Local

Disadvantages:

Fails on multidomain alignments if large gaps are present.

Success depends on an additional parameter: Matrix Bias


40/46

Local Alignment with matrix biasMatrix bias= a constant added to the

substitution score.

Has the same effect as starting the

alignment at a number other than zero.

P

G

T

SF

E

P

A T S F M

A(i,j) = MAX

A(i-1,j-1) + match score + bias

A(i,j-1) + gap*

A(i-1,j) + gap

0 + match score

*linear gap penalty.


41/46

Effect of matrix bias

Higher matrix bias favors matches over

gaps. RESULT: more matches, longerlocal alignments.

more matrix bias

H d k h h li i


42/46

How do we know when the alignment is

correct?

2DRC:A 1/2 MISLIAALAVDRVIGMENAM-PFNLPADLAWFKRNTL-------DKPVIMGRHTWESIG-

1DRF:_ 3/4 SLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPE

2DRC:A 52/53 --RPLPGRKNIILSSQP--GTDDRVTWVKSVDEAIAACG------DVPEIMVIGGGRVYE

1DRF:_ 63/64 KNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYK

2DRC:A 102/103 QFLPK--AQKLYLTHIDAEVEGDTHFPDYEPDDWESVF------SEFHDADAQNSHSYCF

1DRF:_ 123/124 EAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEE---KGIKYKF

2DRC:A 154/155 EILERR

1DRF:_ 180/181 EVYEKN

Compare to known structure-based alignments


43/46

Aligning fibronectin

Fibronectin is a long multidomain

protein involved in

adhesion/migration of cells, blood

clotting, signaling, and

interactions with the extracellular

matrix (ECM). Interacts withcollagen, fibrin, heparin and

integins.

It is made up of many copies of at

least 3 "modules". Smalldifferences within modules cause

important biological effects.

How do you align fibronectins?


44/46

Multiple local alignments?

II

III

II II II III II III

Fragment-based alignment methods find all local

alignments. (BLAST, FASTA)

One way would be to select the maximum score, then thenext highest score, and so on to get all of the possible

alignments.


45/46

You have seen....

Dynamic programming:

Global alignment

Global/local alignment (no end gaps. 3 ways to do it.)

Local alignment

Linear gap penalty

Affine gap penalty

How many ways are there to do DP?


46/46

In class exercise: local alignment

using BestFitStart SeqLabGo to Editor mode, with no sequences.

Download two protein sequences from the PIRdatabase(File/Add sequences/Databases):

PIR2:A46444

PIR2:S58653

Select both. Run BestFit (Functions/Pairwisecomparison/BestFit.)

Go to Options: set gap creation (range 1-20), gap extension

(range 0-10) . Run. Look at the results. (Compare to what you see on

lecture 6ll

Documents