pairwise alignments biology 224 instructor: tom peavy sept 14 &16

53
Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16 <PowerPoint slides based on Bioinformatics and Functional Genomics by Jonathan Pevsner>

Upload: delilah-byrd

Post on 04-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Pairwise Alignments

Biology 224Instructor: Tom Peavy

Sept 14 &16

<PowerPoint slides based on Bioinformatics and Functional Genomics by Jonathan Pevsner>

Page 2: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

General approach to pairwise alignment

• Choose two sequences• Select an algorithm that generates a score• Allow gaps (insertions, deletions)• Score reflects degree of similarity• Alignments can be global or local• Estimate probability that the alignment occurred by chance

Page 3: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

“Scoring matrices” let us assign scores for each aligned amino acid in a pairwise alignment.

What should the score be when a serine matches a serine, or a threonine, or a valine?

Can we devise “lenient” scoring systems to help us align distantly related proteins, and more conservative scoring systems to align closely related proteins?

Page 4: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Calculation of an alignment score

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html

Page 5: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Scoring Matrices take into account:

Conservation – acceptable substitutions while notchanging function of protein(charge, size, hydrophobicity)

Frequency – reflect how often particular residues occur among entire collection of proteins (rare residues given more weight)

Evolution – different scoring matrices are designed to either detect closely related or more distantly related proteins

Page 6: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

global alignment algorithm--Needleman and Wunsch (1970)

local alignment algorithm--Smith and Waterman (1981)

Two kinds of sequence alignment: global and local

Page 7: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Global alignment (Needleman-Wunsch) extendsfrom one end of each sequence to the other

Local alignment finds optimally matching regions within two sequences (“subsequences”)

Local alignment is almost always used for databasesearches such as BLAST. It is useful to find domains(or limited regions of homology) within sequences

Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Othermethods (BLAST, FASTA) are faster but less thorough.

Global alignment versus local alignment

Page 8: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

• Two sequences can be compared in a matrix along x- and y-axes.

• If they are identical, a path along a diagonal can be drawn

• Find the optimal subpaths, and add them up to achieve the best score. This involves

--adding gaps when needed--allowing for conservative substitutions--choosing a scoring system (simple or complicated)

• N-W is guaranteed to find optimal alignment(s)

Global alignment with the algorithmof Needleman and Wunsch (1970)

Page 9: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

[1] set up a matrix

[2] score the matrix

[3] identify the optimal alignment(s)

Three steps to global alignment with the Needleman-Wunsch algorithm

Page 10: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

[1] identity (stay along a diagonal)[2] mismatch (stay along a diagonal)[3] gap in one sequence (move vertically!)[4] gap in the other sequence (move horizontally!)

Four possible outcomes in aligning two sequences

1

2

Page 11: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Try using needle to implement a Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps):

http://www.ebi.ac.uk/emboss/align/

Page 12: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Queries:beta globin (NP_000509)alpha globin (NP_000549)

Page 13: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Local AlignmentHow the Smith-Waterman algorithm works

Set up a matrix between two proteins (size m+1, n+1)

No values in the scoring matrix can be negative! S > 0

The score in each cell is the maximum of four values:[1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch)

[2] s(i,j-1) – gap penalty[3] s(i-1,j) – gap penalty[4] zero

Page 14: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Smith-Waterman local alignment algorithm

Page 15: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Rapid, heuristic versions of Smith-Waterman:FASTA and BLAST

Smith-Waterman is very rigorous and it is guaranteedto find an optimal alignment.

But Smith-Waterman is slow. It requires computerspace and time proportional to the product of the twosequences being aligned (or the product of a query against an entire database).

Gotoh (1982) and Myers and Miller (1988) improved thealgorithms so both global and local alignment requireless time and space.

FASTA and BLAST provide rapid alternatives to S-W

Page 16: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Try using SSEARCH to perform a rigorous Smith-Waterman local alignment: http://fasta.bioch.virginia.edu/

Page 17: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Queries:beta globin (NP_000509)alpha globin (NP_000549)

Page 18: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

How FASTA works

[1] A “lookup table” is created. It consists of short stretches of amino acids (e.g. k=3 for a protein search). The length of a stretch is called a k-tuple. The FASTA algorithm finds the ten highest scoring segments that align to the query.

[2] These ten aligned regions are re-scored with a PAM or BLOSUM matrix.

[3] High-scoring segments are joined.

[4] Needleman-Wunsch or Smith-Waterman is then performed.

Page 19: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

• Go to http://www.ncbi.nlm.nih.gov/BLAST• Choose BLAST 2 sequences• In the program,

[1] choose blastp or blastn[2] paste in your accession numbers (or use FASTA format)[3] select optional parameters

--3 BLOSUM and 3 PAM matrices--gap creation and extension penalties--filtering--word size

[4] click “align”

Pairwise alignment: BLAST 2 sequences

Page 20: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Page 73

Page 21: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Page 74

Page 22: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Statistical significance of pairwise alignment

We will discuss the statistical significance of alignment scores in the BLAST lectures. A basic question is how to determine whether a particular alignment score is likely to have occurred by chance. According to the null hypothesis, two aligned sequences are not homologous (evolutionarily related). Can we reject the null hypothesis at a particular significance level alpha?

Page 23: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

True positives False positives

False negatives

Sequences reportedas related

Sequences reportedas unrelated

True negatives

homologoussequences

non-homologoussequences

Sensitivity:ability to findtrue positives

Specificity:ability to minimize

false positives

Page 24: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.

Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments(or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.

The two major types of substitution matrices arePAM and BLOSUM.

Substitution Matrix

Page 25: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA

fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST

fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

Dayhoff et al. examined multiple sequence alignments(e.g. glyceraldehyde 3-phosphate dehydrogenases)

to generate tables of accepted point mutations

Examined 1572 changes in 71 groups of closely related proteins

Page 26: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Page 80

Emile Zuckerkandl and Linus Pauling (1965) considered substitution frequencies in 18 globins(myoglobins and hemoglobins from human to lamprey).

Black: identityGray: very conservative substitutions (>40% occurrence)White: fairly conservative substitutions (>21% occurrence)Red: no substitutions observed

lys found at 58% of arg sites

Page 27: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

Ig kappa chain 37Kappa casein 33luteinizing hormone b 30lactalbumin 27complement component 3 27epidermal growth factor 26proopiomelanocortin 21pancreatic ribonuclease 21haptoglobin alpha 20serum albumin 19phospholipase A2, group IB 19prolactin 17carbonic anhydrase C 16Hemoglobin 12Hemoglobin 12

Page 28: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

Ig kappa chain 37Kappa casein 33luteinizing hormone b 30lactalbumin 27complement component 3 27epidermal growth factor 26proopiomelanocortin 21pancreatic ribonuclease 21haptoglobin alpha 20serum albumin 19phospholipase A2, group IB 19prolactin 17carbonic anhydrase C 16Hemoglobin 12Hemoglobin 12

human (NP_005203) versus mouse (NP_031812)

Page 29: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

apolipoprotein A-II 10lysozyme 9.8gastrin 9.8myoglobin 8.9nerve growth factor 8.5myelin basic protein 7.4thyroid stimulating hormone b 7.4parathyroid hormone 7.3parvalbumin 7.0trypsin 5.9insulin 4.4calcitonin 4.3arginine vasopressin 3.6adenylate kinase 1 3.2

Page 30: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Dayhoff’s 34 protein superfamilies

Protein PAMs per 100 million years

triosephosphate isomerase 1 2.8vasoactive intestinal peptide 2.6glyceraldehyde phosph. dehydrogease 2.2cytochrome c 2.2collagen 1.7troponin C, skeletal muscle 1.5alpha crystallin B chain 1.5glucagon 1.2glutamate dehydrogenase 0.9histone H2B, member Q 0.9ubiquitin 0

Page 50

Page 31: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Pairwise alignment of human (NP_005203) versus mouse (NP_031812) ubiquitin

Page 32: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

• All the PAM data come from alignments of closely related proteins (>85% amino acid identity)

• PAM matrices are based on global sequence alignments.

• The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence (all other PAM matrices are extrapolated from PAM1).

• For the PAM1 matrix, that interval is 1% amino acidDivergence; note that the interval is not in units of time.

Dayhoff’s PAM1 mutation probability matrix(Point-Accepted Mutations)

Page 33: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

The relative mutability of amino acids

Page 53

Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined.

Relative mutability [changes] / [occurrences]

Example:

sequence 1 ala his val alasequence 2 ala arg ser val

For ala, relative mutability = [1] / [3] = 0.33For val, relative mutability = [2] / [2] = 1.0

Page 34: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

The relative mutability of amino acids

Asn 134 His 66Ser 120 Arg 65Asp 106 Lys 56Glu 102 Pro 56Ala 100 Gly 49Thr 97 Tyr 41Ile 96 Phe 41Met 94 Leu 40Gln 93 Cys 20Val 74 Trp 18

Page 53

Note that alanine is normalized to a value of 100.Trp and cys are least mutable.Asn and ser are most mutable.

Page 35: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Normalized frequencies of amino acids:variations in frequency of occurrence

Gly 8.9% Arg 4.1%Ala 8.7% Asn 4.0%Leu 8.5% Phe 4.0%Lys 8.1% Gln 3.8%Ser 7.0% Ile 3.7%Val 6.5% His 3.4%Thr 5.8% Cys 3.3%Pro 5.1% Tyr 3.0%Glu 5.0% Met 1.5%Asp 4.7% Trp 1.0%

blue=6 codons; red=1 codon;note: should be 5% for each if equally distributed

Page 36: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Page 54

Page 37: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Dayhoff’s mutation probability matrixfor the evolutionary distance of 1 PAM

Considered three kinds of information:• a table of number of accepted point mutations (PAMs)• relative mutabilities of the amino acids• normalized frequencies of the amino acids in PAM data

This information can be combined into a “mutation probability matrix” in which each element Mij gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval (e.g. 1 PAM).

Page 50

Page 38: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Dayhoff’s PAM1 mutation probability matrixOriginal amino acid

Each element of the matrix shows the probability that anamino acid (top) will be replaced by another residue (side) (n=10,000)

Page 39: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Consider a PAM0 matrix. No amino acids have changed,so the values on the diagonal are 100%.

Consider a PAM2000 (nearly infinite) matrix. The valuesapproach the background frequencies of the amino acids(given in Table 3-2).

PAM0 and PAM2000 mutation probability matrices

Page 40: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

The PAM250 matrix is of particular interest becauseit corresponds to an evolutionary distance of about20% amino acid identity (the approximate limit ofdetection for the comparison of most proteins).

Note the loss of information content along the maindiagonal, relative to the PAM1 matrix.

The PAM250 mutation probability matrix

Page 41: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

PAM250 mutation probability matrix A R N D C Q E G H I L K M F P S T W Y V A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9

R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2

N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3

D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3

C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2

Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3

E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3

G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7

H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2

I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9

L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13

K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5

M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2

F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3

P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4

S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6

T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6

W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0

Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2

V 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17

Top: original amino acidSide: replacement amino acid

Page 42: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Comparing two proteins with a PAM1 matrixgives completely different results than PAM250!

Consider two distantly related proteins. A PAM40 matrixis not forgiving of mismatches, and penalizes themseverely. Using this matrix you can find almost no match.

A PAM250 matrix is very tolerant of mismatches.

hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * **

24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% rbp4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** *

rbp4 86 --CADMVGTFTDTEDPAKFKM btlact 80 GECAQKKIIAEKTKIPAVFKI ** * ** **

Page 43: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Why do we go from a mutation probabilitymatrix to a log odds matrix?

• We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues.

• Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them).

Page 44: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

How do we go from a mutation probabilitymatrix to a log odds matrix?

• The cells in a log odds matrix consist of an “odds ratio”:

the probability that an alignment is authenticthe probability that the alignment was random

The score S for an alignment of residues a,b is given by:

S(a,b) = 10 log10 (Mab/pb)

As an example, for tryptophan,

S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4

Page 45: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V

PAM250 log oddsscoring matrix

Page 46: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

What do the numbers meanin a log odds matrix?

S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4

A score of +17 for tryptophan means that this alignmentis 55 times more likely than a chance alignment of twoTrp residues.

S(a,b) = 17Probability of replacement (Mab/pb) = xThen17.4 = 10 log10 x1.74 = log10 x101.74 = x = 55

Page 47: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

PAM10 log oddsscoring matrixNote that penalties formismatches are far moresevere than for PAM250;e.g. WT –19 vs. –5.

A 7

R -10 9

N -7 -9 9

D -6 -17 -1 8

C -10 -11 -17 -21 10

Q -7 -4 -7 -6 -20 9

E -5 -15 -5 0 -20 -1 8

G -4 -13 -6 -6 -13 -10 -7 7

H -11 -4 -2 -7 -10 -2 -9 -13 10

I -8 -8 -8 -11 -9 -11 -8 -17 -13 9

L -9 -12 -10 -19 -21 -8 -13 -14 -9 -4 7

K -10 -2 -4 -8 -20 -6 -7 -10 -10 -9 -11 7

M -8 -7 -15 -17 -20 -7 -10 -12 -17 -3 -2 -4 12

F -12 -12 -12 -21 -19 -19 -20 -12 -9 -5 -5 -20 -7 9

P -4 -7 -9 -12 -11 -6 -9 -10 -7 -12 -10 -10 -11 -13 8

S -3 -6 -2 -7 -6 -8 -7 -4 -9 -10 -12 -7 -8 -9 -4 7

T -3 -10 -5 -8 -11 -9 -9 -10 -11 -5 -10 -6 -7 -12 -7 -2 8

W -20 -5 -11 -21 -22 -19 -23 -21 -10 -20 -9 -18 -19 -7 -20 -8 -19 13

Y -11 -14 -7 -17 -7 -18 -11 -20 -6 -9 -10 -12 -17 -1 -20 -10 -9 -8 10

V -5 -11 -12 -11 -9 -10 -10 -9 -9 -1 -5 -13 -4 -12 -9 -10 -6 -22 -10 8

A R N D C Q E G H I L K M F P S T W Y V

Page 48: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

two nearly identical proteins

two distantly related proteins

Page 49: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

BLOSUM matrices are based on local alignments

Created in 1992 (rather than 1978 for PAM; global aln)

BLOSUM stands for blocks substitution matrix.

Used 2000 blocks of conserved regions representing more than 500 groups of proteins examined

BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% identity

BLOSUM Matrices

Page 50: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

BLOSUM62 is the default matrix in BLASTThough it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

BLOSUM Matrices

Page 51: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Pe

rce

nt

ide

nti

ty

Differences per 100 residues

At PAM1, two proteins are 99% identicalAt PAM10.7, there are 10 differences per 100 residuesAt PAM80, there are 50 differences per 100 residues

At PAM250, there are 80 differences per 100 residuesWHY?

“twilight zone”

Page 52: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Ancestral sequence

Sequence 1ACCGATC

Sequence 2AATAATC

A no change AC single substitution C --> AC multiple substitutions C --> A --> TC --> G coincidental substitutions C --> AT --> A parallel substitutions T --> AA --> C --> T convergent substitutions A --> TC back substitution C --> T --> C

ACCCTAC

Li (1997) p.70

Page 53: Pairwise Alignments Biology 224 Instructor: Tom Peavy Sept 14 &16

Summary of alignment scoring system

• global versus local alignment

• positive and negative values assigned

• gap creation and extension penalties

• positive score for identities

• some partial positive score for conservative substitutions

• use of a substitution matrix