biological sequence comparison: dynamic programming...
TRANSCRIPT
1
Biological Sequence Comparison:Dynamic Programming Algorithms
Similarity Score Matrices
William R. Pearson
Algorithms for Biological SequenceComparison
algorithm value scoring gap timecalculated matrix penalty required
Needleman- global arbitrary penalty/gap O(n2) Needleman andWunsch similarity q Wunsch, 1970
Sellers (global) unity penalty/residue O(n2) Sellers, 1974distance r k
Smith- local Sij < 0.0 affine O(n2) Smith and Waterman, 1981Waterman similarity q + r k optimal Gotoh, 1982
SRCHN approx local Sij < 0.0 penalty/gap O(n)-O(n2) Wilbur and Lipman, 1983similarity lookup-diagonal
FASTA approx. local Sij < 0.0 limited size O(n2)/K Lipman and Pearson, 1985similarity q + r k lookup-rescan Pearson and Lipman, 1988
BLASTP maximum Sij < 0.0 multiple O(n2)/K Altshul et al., 1990segment score segments DFA-extend
BLAST2.0 approx. local Sij < 0.0 q+r k O(n2)/K Altshul et al., 1997similarity lookup-extend
2
The sequence alignment problem:PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL: : : ::: : : : :::PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG
PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL: . :. . :. ::: :. :.:. :::PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG
P M I L G Y W N V R G LP XP XY x X xT x xI X x x xV x x x X xY x X xF x x x x x xP XV x x x X xR XG X
Global:-PMILGYWNVRGL :. .:. :::PPYTIVYFPVRG-
Local:AAAAAAAPMILGYWNVRGLBBBBB :. .:. :::XXXXXXPPYTIVYFPVRGYYYYYY
Algorithms for Biological SequenceComparison
Global Local Distance
HBHU vs HBHU Hemoglobin beta-chain - human 725 725 0HAHU Hemoglobin alpha-chain - human 314 322 152MYHU Myoglobin - Human 121 166 212GPYL Leghemoglobin - Yellow lupin 8 43 239
LZCH Lysozyme precursor - Chicken -107 32 220NRBO Pancreatic ribonuclease - Bovine -124 31 280CCHU Cytochrome c - Human -160 26 321
MCHU vs MCHU Calmodulin - Human 671 671 0TPHUCS Troponin C, skeletal muscle 395 438 161PVPK2 Parvalbumin beta - Pike -57 115 313CIHUH Calpain heavy chain - Human -2085 100 2463AQJFNV Aequorin precursor - Jelly fish -65 76 391KLSWM Calcium binding protein - Scallop -89 52 323
QRHULD vs EGMSMG EGF precursor -591 655 2549
3
Genomic Alignments
• nw/sw/lalign - dynamic programming -O(n2)• searchn - lookup on diagonals - O(n)-O(n2)• fasta - lookup on diagonals/rescan - O(n2)/K• blast - DFA, extend O(n2)/K• blastz - lookup/extend• ssaha, blat, - lookup - waba• mummer, avid - Suffix tree alignment• dialign, glass, lagan
Algorithms for Global and Local SimilarityScores
Global:
Local:
4
+1 : match-1 : mismatch-2 : gap
Global and Local Alignment Paths
A B D D E F G H IA \ \ \ \ \ \ \ \ \ 1 _-1 -1 -1 -1 -1 -1 -1 -1B \ ! \ \ \ \ \ \ \ -1 2 _ 0 _-2 -2 -2 -2 -2 -2D \ ! \ \ \ \ \ \ -1 0 3 _ 1 _-1 _-3 -3 -3 -3E \ \ ! ! \ \ \ \ -1 -2 1 2 2 _ 0 _-2 _-4 -4
G \ \ ! \ ! \ \ \ -1 -2 -1 0 1 1 1 _-1 _-3K \ \ \ ! \ ! \ ! \ \ \ \ -1 -2 -3 -2 -1 0 0 0 _-2H \ \ \ \ ! \ ! \ ! \ \ \ -1 -2 -3 -4 -3 -2 -1 1 _-1I \ \ \ \ \ ! \ ! \ ! ! \ -1 -2 -3 -4 -5 -4 -3 -1 2
Optimum global alignment ( score: 2) A B D D E F G H I (top) A B D - E G K H I (side)or A B - D E G K H I
A B D D E F G H IA \ 1 0 0 0 0 0 0 0 0B \ 0 2 _ 0 0 0 0 0 0 0D ! \ \ 0 0 3 _ 1 0 0 0 0 0E ! \ \ 0 0 1 2 2 _ 0 0 0 0G \ ! \ \ \ 0 0 0 0 1 1 1 0 0K \ \ \ 0 0 0 0 0 0 0 0 0H \ 0 0 0 0 0 0 0 1 0I \ 0 0 0 0 0 0 0 0 2
Optimal local alignment (score 3): A B D (top) A B D (side)
Global Local
A C G T
A
C
G
G
T
+1 : match–1 : mis-match–2 : gap
5
Smith-Waterman Space, TimeRequirements
score-space: O(n);time: O(n2)
alignment-space: O(n);time: O(n2)
FAST alignment by lookup 1 9GT8.7 KITQSNATQ .::. :::XURT8C LLTQTRATQ 1 9
1.Scan query,build 2 tables:
AT 7IT 2KI 1LL -1LT -1NA 6QS 4QT -1SN 5TR -1TQ 8
400 entries
1 -12 -13 -14 -15 -16 -17 -18 3
n-1 entries
K I T Q S N A T Q
LLT T TQ Q QTRA AT T TQ Q Q
87654321012345678 2 2 4 2 5
LL LT TQ QT TR RA AT TQ
O(n) spaceO(n+m) time (if few repeat hits)
6
4. Banded Smith-WatermanNLPYL-I..: . :QVPLVEI
2. Extend along diagonal(local maximum)
1. Identify identical matches(length = ktup)
FASTA
Q
V
P
L
V
E
I
N L P Y L I
Outcome: one continuous, near-optimal gapped alignment
3. Join diagonal segments (DP)(maintain linearity)(optimal sum score)
BLAST
Q
V
P
L
V
E
I
N L P Y L I
2. extend from diagonal ends(X-drop threshold)
1. neighborhood word hits(word length)
Outcome: multiple HSPs, multiple linkages; only partially aligned
NL I.: :PL I
NLP..:QVP
L.E
3. report HSP linkages(maintain linearity)(probability)
7
fasta.bioch.virginia.edu/noptalign
a
Similarity Scoring Matrices - Summary
• Similarity scoring matrices are “log-odds” matrices,reporting the “odds” that an alignment reflectshomology rather than chance
• One can predict evolutionary changes using a simplerandom model, which can generate mutationfrequencies at any evolutionary distance
• The optimal scoring matrix has an evolutionarydistance that matches that of the alignment
• Shallower scoring matrices have more informationcontent, or “bits/residue”, and thus can be used to findshorter domains
• Scoring matrices set evolutionary look back times
8
Scoring Matrices – Concepts
• Where do scoring matrices come from– Transition probabilities and PAMs– Scoring matrices as log-odds values
(log(p[related]/p[chance])• The PAM 250 matrix• Scoring matrices and information content• The BLOSUM matrices• Effective matrices and gap penalties
DNA transition probabilities –1 PAM
a c g ta 0.99 0.001 0.008 0.001 = 1.0c 0.001 0.99 0.001 0.008 = 1.0g 0.008 0.001 0.99 0.001 = 1.0t 0.001 0.008 0.001 0.99 = 1.0
a
t g
c a
t g
c
0.99
0.008
0.001
0.001
9
Matrix multiplesM^2={ PAM 2{0.980, 0.002, 0.016, 0.002},{0.002, 0.980, 0.002, 0.016},{0.016, 0.002, 0.980, 0.002},{0.002, 0.016, 0.002, 0.980}}
M^5={ PAM 5{0.952, 0.005, 0.038, 0.005},{0.005, 0.951, 0.005, 0.038},{0.038, 0.005, 0.952, 0.005},{0.005, 0.038, 0.005, 0.952}}
M^10={ PAM 10{0.907, 0.010, 0.073, 0.010},{0.010, 0.907, 0.010, 0.073},{0.073, 0.010, 0.907, 0.010},{0.010, 0.073, 0.010, 0.907}}
M^100={ PAM 100{0.499, 0.083, 0.336, 0.083},{0.083, 0.499, 0.083, 0.336},{0.336, 0.083, 0.499, 0.083},{0.083, 0.336, 0.083, 0.499}}
M^1000={ PAM 1000{0.255, 0.245, 0.255, 0.245}, {0.245, 0.255, 0.245, 0.255}, {0.255, 0.245, 0.255, 0.245}, {0.245, 0.255, 0.245, 0.255}}
qij = M^20= PAM20{0.828, 0.019, 0.133, 0.019},{0.019, 0.828, 0.019, 0.133},{0.133, 0.019, 0.828, 0.019},{0.019, 0.133, 0.019, 0.828}}
Where do scoring matrices come from?
pi(a,c,g,t)=pj=0.25
!
"S = logqij
p j
#
$ % %
&
' ( (
probability of mutation
probability of alignmentby chance
!
"S =10logqa,a
pa
#
$ %
&
' (
=10log0.828
0.25
#
$ %
&
' ( = 5.2
!
"S =10logqa,c
pc
#
$ %
&
' (
=10log0.019
0.25
#
$ %
&
' ( = )11.2
!
"2 =log(2)
10= 0.33
10
Two expressions for Sij
Transition frequency(probability)
- Durbin et al.
Alignment frequency(probability)
-Altschul
!
"S = logqija
pi p j
#
$ % %
&
' ( (
!
"S = logqijt
p j
#
$ % %
&
' ( (
!
"S = logqija
= piqijt
pi p j
#
$ % %
&
' ( (
!
Altschul qija
= pi " Durbin qijt
Scoring matrices at DNAPAMs - ratios
PAM1={ ratio=1/3.13=+1/-3 H=1.90{ 1.99, -6.23, -6.23, -6.22},{-6.23, 1.99, -6.23, -6.23}, {-6.23, -6.23, 1.99, -6.23}, {-6.23, -6.23, -6.23, 1.99}}
PAM2={ ratio=1/2.65=+2/-5 H=1.82{ 1.97, -5.24, -5.24, -5.24},{-5.24, 1.98, -5.24, -5.24},{-5.24, -5.24, 1.98, -5.24},{-5.24, -5.24, -5.24, -5.24}}
PAM10={ ratio=1/1.61=+2/-3 H=1.40{ 1.86, -3.00, -3.00, -3.00},{-3.00, 1.86, -3.00, -3.00},{-3.00, -3.00, 1.86, -3.00},{-3.00, -3.00, -3.00, 1.86}}
PAM20={ ratio=1/1.21=+4/-5 H=1.05{ 1.72, -2.09, -2.09, -2.09},{-2.09, 1.72, -2.09, -2.09},{-2.09, -2.09, 1.72, -2.09}, {-2.09, -2.09, -2.09, 1.72}}
PAM30={ ratio=1/1=+1/-1 H=0.80{ 1.59, -1.59, -1.59, -1.59},{-1.59, 1.59, -1.59, -1.59},{-1.59, -1.59, 1.59, -1.59},{-1.59, -1.59, -1.59, 1.59}}
blastn (DNA)
PAM45={ ratio=1.23/1=+5/-4 H=0.54{ 1.40, -1.14, -1.14, -1.14},{-1.14, 1.40, -1.14, -1.14},{-1.14, -1.14, 1.40, -1.14},{-1.14, -1.14, -1.14, 1.40}}
fasta (DNA)
11
Normalizedfrequencies
of the amino-acids
0.010Trp0.047Asp0.015Met0.050Glu0.030Tyr0.051Pro0.033Cys0.058Thr0.033Cys0.065Val0.037Ile0.070Ser0.038Gln0.081Lys0.040Phe0.085Leu0.040Asn0.087Ala0.041Arg0.089Gly
Relative mutabilitiesof the amino-acids
18Trp74Val20Cys93Gln40Leu94Met41Phe96Ile41Tyr97Thr49Gly100Ala56Pro102Glu56Lys106Asp65Arg130Ser66His134Asn
LIHGEQCDNRA
2534017157500371795L
303581713363066I
1023243104322610321H
112301016215610579G
4220831940266E
0765012093Q
001033C
5320154D
17109N
30R
A
Numbers of accepted mutations
numbers of accepted mutations (x10) from closelyrelated sequences. 1572 changes (20x20) tabulated
12
Mutation probability matrix for 1 PAMLIHGEQCDNRA
994722411600313L
99872002121322I
1099120120131881H
10199357311112121G
13249865350567010E
3123127987605493Q
01100099730011C
0146536098594206D
1321664036982214N
1310001010199131R
46221178310929867A
Mutation probability matrix for 250 PAMsLIHGEQCDNRA
994722411600313L
99872002121322I
1099120120131881H
10199357311112121G
2365129111745E
523651016553Q
122211521112C
2365107111845D
23646527644N
236235234173R
6861298599613A
13
The PAM250 matrixCys 12Ser 0 2Thr -2 1 3Pro -1 1 0 6Ala -2 1 1 1 2Gly -3 1 0 -1 1 5Asn -4 1 0 -1 0 0 2Asp -5 0 0 -1 0 1 2 4Glu -5 0 0 -1 0 0 1 3 4Gln -5 -1 -1 0 0 -1 1 2 2 4His -3 -1 -1 0 -1 -2 2 1 1 3 6Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W
A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10
Pam40 A R N D E I LA 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6
Pam250
Where do scoring matrices comefrom?
• Scoring matrices can canbe designed for differentevolutionary distances(less=shallow;more=deep)
• Deep matrices allowmore substitution
!
"S = logqij
pi p j
#
$ % %
&
' ( (
frequency of replace-ment in homologs
frequency of align-ment by chance
14
A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10
Pam40 A R N D E I LA 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6
Pam250
Where do scoring matrices come from?
qij : replacement frequency at PAM40, 250qR:N ( 40) = 0.000435 pR = 0.051 qR:N (250) = 0.002193 pN = 0.043 λ2 Sij = lg2 (qij/pipj) λe Sij = ln(qij/pipj) pRpN = 0.002193λ2 SR:N( 40) = lg2 (0.000435/0.00219)= -2.333λ2 = 1/3; SR:N( 40) = -2.333/λ2 = -7λ SR:N(250) = lg2 (0.002193/0.002193)= 0
!
"S = logqij
pi p j
#
$ % %
&
' ( (
frequency of replace-ment in homologs
frequency of align-ment by chance
PAM and % identity
15
Scoring matrix scale – revisited
!
"si, j = logqij
pi p j
si, j = logqij
pi p j
#
$ % %
&
' ( ( /"
sij '= sij )10* "'= "10
!
"e is the unique solution of :
pi p j
i, j
# e"e si , j =1, alternatively :
pi p j
i, j
# 2"2si , j =1 gives "2
!
"S =10 # logqR ,N
pR pN
$
% &
'
( )
=10 # log0.000435
0.051# 0.043
$
% &
'
( ) = *7.03
!
"2 =log(2)
10= 0.33
PAM250 scores are in 1/3 bitunits (scaled with λ2=0.33)PAM120, BLOSUM62 scores arescaled in 1/2 bit units (λ2=0.5)
Local alignment scores as measures of information
Information content: number of bitsrequired to represent all possibilitieswith optimal encoding
!
H = " pi log2i
# pi
H = " 0.5log2 0.5 = 0.50,1
# ("1) + 0.5("1)
=1 bit
H = " 0.25log21
4= " 0.25("2)
a,c,g,t
#a,c,g,t
#
= 2 bits
H = " pi20aa
# log2 pi = 4.19 bits
!
E = pi p jsij20aa
"
= #0.84 (PAM250)
= #1.64 (PAM120)
= #0.52 (BLOSUM62)
!
H = qijasij
20aa
" = pi20aa
" qijtsij
= 0.35 (PAM250)
= 0.98 (PAM120)
= 0.70 (BLOSUM62)
average informationcontent / position
16
Relative entropy H of PAMmatrices
More about scoring matrices ...
PAM series:• Evolutionary model -
extrapolated from PAM1• PAM20: 20% change
(mammals)• PAM250: 250% change
(<20% identity)• Gap penalties should vary• shallow matrices (PAM10-
40) for short sequences andshort distances
BLOSUM series• Empirically determined,
no extrapolation (nomodel)
• BLOSUM45-50 - distant(1/3 bits)
• BLOSUM80 -very highlyconserved (not smallchange), highinfo/position
• BLOSUM62 - 1/2 bits
17
Scoring Matrices and Gap-penalties -BLAST vs FASTA
BLAST• default scoring matrix:
BLOSUM62 (1/2 bit)• default gap penalty:
-11 (open)/-1(extend)(lowest -9/-1, -8/-2)
FASTA• default matrix:
BLOSUM50 (1/3 bit)• default gap penalty:
old: -12 (first residue)/-2= new: -10 (open)/-2(ext)
• BLOSUM62 -7/-1• PAM120 -16/-4• PAM20 -24/-4
Similarity Scoring Matrices - Summary
• Similarity scoring matrices are “log-odds” matrices,reporting the “odds” that an alignment reflectshomology rather than chance
• One can predict evolutionary changes using a simplerandom model, which can generate mutationfrequencies at any evolutionary distance
• The optimal scoring matrix has an evolutionarydistance that matches that of the alignment
• Shallower scoring matrices have more informationcontent, or “bits/residue”, and thus can be used to findshorter domains
• Scoring matrices set evolutionary look back times