Download - Dot Plots, Path Matrices, Score Matrices
VV
diagonal lines give equivalent residuesdiagonal lines give equivalent residues
II LL SS TT RR II VV HH VV NN SS II LL PP SS TT NN
VVIILLSSTTRRIIVVIILLPPEEFFSSTT
Sequence ASequence AS
equ
enc
e B
Se
que
nce
B
Dot Plots, Path Matrices, Score Dot Plots, Path Matrices, Score MatricesMatrices
VV II LL SS TT RR II VV HHVVNNSS II LL PP SS TT NN
VVIILLSSTTRRIIVVIILLPPEEFFSSTT
Sequence ASequence A
Seq
uen
ce B
Seq
uen
ce B
identical residues score 1identical residues score 1highest scoring path across the matrix gives best alignmenthighest scoring path across the matrix gives best alignment
V I L S L V I L P Q R S L V V I L S L V I L A L T VV I L S L V I L P Q R S L V V I L S L V I L A L T V
SSTTVVIILLSSLLVVRRNNVVIILLPPQQRRIILLSSLLVVIISSLLAALL
Sequence ASequence A
Seq
uen
ce B
Seq
uen
ce B
runs runs (tuples) of (tuples) of
33residuesresidues
66
66
55
66
33
33
33
66
SCORE = SCORE = 20 - 9 = 20 - 9 =
1111
33
gap gap penaltypenalty
= 3= 3
Alignment from Dot PlotAlignment from Dot Plot
VILSLV ILPQRSLVVILSLVI LALTVVILSLV ILPQRSLVVILSLVI LALTV
STVILSLVNVILPQR ILSLVISLAL STVILSLVNVILPQR ILSLVISLAL
score = 20score = 20
sequence identity = 20/26 = 75%sequence identity = 20/26 = 75%
HH CC NN II RR QQ CC LL CC RR PP MMAAAAIICCIINNRRCCKKCCRRHHPP
110000000000000000000000
000000000000000000001100
000011000000110011000000
000000001100000000000000
001100110000000000000000
000000000011000000110000
000000000000000000000000
000011000000110011000000
000000000000000000000000
000011000000110011000000
000000000011000000110000
000000000000000000000011
000000000000000000000000
ALVKRH…ALVKRH…
……H
RK
VLA
HR
KV
LA 11
1111
11
00 00 00 0……0……
Path or Score MatrixPath or Score Matrix
Residue Residue substitution substitution
matrixmatrix
11
Needleman & WunschNeedleman & Wunsch
HH CC NN II RR QQ CC LL CC RR PP MMAA
AA
II
CC
II
NN
RR
CC
KK
CC
RR
HH
PP
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
00
00
11
00
11
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
11
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
11
00
11
00
00
00
00
00
00
00
00
11
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
Needleman & Wunsch AlgorithmNeedleman & Wunsch Algorithm
• Accumulate the matrix by adding to each cell the highest score in Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itthe column or row to the right and below it
• find the highest scoring path in the matrix by:find the highest scoring path in the matrix by:
• starting in the top left cornerstarting in the top left corner
• moving down across the matrix from cell to cell moving down across the matrix from cell to cell
• choosing the highest scoring cell at each movechoosing the highest scoring cell at each move
• the path can not go back on itself or cross the same row or column the path can not go back on itself or cross the same row or column twicetwice
• Add to the score in the cell the highest score from a cell in the row or Add to the score in the cell the highest score from a cell in the row or column to right and belowcolumn to right and below
Accumulating the MatrixAccumulating the Matrix
i,ji,j
i-1,j-1i-1,j-1
i-n,j-1i-n,j-1
i-1,j-mi-1,j-m
Sequence ASequence A
HH CC NN II RR QQ CC LL CC RR PP MMAA
AA
II
CC
II
NN
RR
CC
KK
CC
RR
HH
PP
88
77
66
66
55
44
33
33
22
22
11
00
77
77
66
66
55
44
33
33
22
11
22
00
66
66
77
66
55
44
44
33
33
11
11
00
66
66
66
55
66
44
33
33
22
11
11
00
55
66
55
66
55
44
33
33
22
11
11
00
44
44
44
44
55
55
33
33
22
22
11
00
44
44
44
44
44
44
33
33
22
11
11
00
33
33
44
33
33
33
44
33
33
11
11
00
33
33
33
33
33
33
33
33
22
11
11
00
22
22
33
22
33
22
33
22
33
11
11
00
11
11
11
11
11
22
11
11
11
22
11
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
Seq
uenc
e B
Seq
uenc
e B
• start in the leftmost or topmost rowstart in the leftmost or topmost row
• move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below
Possible Moves in Finding a Path across the Possible Moves in Finding a Path across the MatrixMatrix
i,ji,j
i-1,j-1i-1,j-1
i-n,j-1i-n,j-1
i-1,j-mi-1,j-m
Sequence ASequence A
HH CC NN II RR QQ CC LL CC RR PP MMAA
AA
II
CC
II
NN
RR
CC
KK
CC
RR
HH
PP
88
77
66
66
55
44
33
33
22
22
11
00
77
77
66
66
55
44
33
33
22
11
22
00
66
66
77
66
55
44
44
33
33
11
11
00
66
66
66
55
66
44
33
33
22
11
11
00
55
66
55
66
55
44
33
33
22
11
11
00
44
44
44
44
55
55
33
33
22
22
11
00
44
44
44
44
44
44
33
33
22
11
11
00
33
33
44
33
33
33
44
33
33
11
11
00
33
33
33
33
33
33
33
33
22
11
11
00
22
22
33
22
33
22
33
22
33
11
11
00
11
11
11
11
11
22
11
11
11
22
11
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
Seq
uenc
e B
Seq
uenc
e B
Sequence ASequence AHH CC NN II RR QQ CC LL CC RR PP MMAA
AAIICCIINNRRCCKKCCRRHHPP
887766665544333322221100
777766665544333322112200
666677665544443333111100
666666556644333322111100
556655665544333322111100
444444445555333322221100
444444444444333322111100
333344333333443333111100
333333333333333322111100
222233223322332233111100
111111111122111111221100
000000000000000000000011
000000000000000000000000
Sequ
ence
BSequ
ence
B
A H C N I - R Q C L C R - P MA H C N I - R Q C L C R - P M
A I C - I N R - C K C R H P MA I C - I N R - C K C R H P M
Searching Sequence DatabasesSearching Sequence Databases
Can you inherit functional information?Can you inherit functional information?
Do fast scans using approximate Do fast scans using approximate methods e.g.methods e.g. BLAST or PSIBLASTBLAST or PSIBLAST
Align proteins carefully using a dynamic Align proteins carefully using a dynamic programming methodprogramming method Needleman & WunschNeedleman & WunschSmith & WatermanSmith & Waterman
Scan against sequence profiles (or Scan against sequence profiles (or HMMs) in secondary databases e.g.HMMs) in secondary databases e.g. Pfam, Gene3D, InterProPfam, Gene3D, InterPro
Align query sequence against family relatives Align query sequence against family relatives using:using: ClustalW, Jalview, MUSCLE, MAFFTClustalW, Jalview, MUSCLE, MAFFT
Profile Based Sequence Search MethodsProfile Based Sequence Search Methods
by comparing related sequences within a protein family can by comparing related sequences within a protein family can identify patterns of conserved residuesidentify patterns of conserved residues
even the most distant members of the family should have these even the most distant members of the family should have these patterns of conserved residuespatterns of conserved residues
can make acan make a profile profile which encapsulates these patterns and use it which encapsulates these patterns and use it to detect more distantly related sequencesto detect more distantly related sequences
highly conserved positions usually correspond to the buried core highly conserved positions usually correspond to the buried core or functional residues within the active siteor functional residues within the active site
• first constructs a multiple alignment of all the related sequences first constructs a multiple alignment of all the related sequences identified by BLASTidentified by BLAST
• then estimates the residue frequencies at each position to construct a then estimates the residue frequencies at each position to construct a score matrix score matrix Position Specific Score Matrices (PSSM)Position Specific Score Matrices (PSSM) also known as also known as weight matrices or profilesweight matrices or profiles
Iterated Application of BLASTIterated Application of BLAST
PSI-BLASTPSI-BLASTAltschul et al. (1997) Altschul et al. (1997)
PSI-BLASTPSI-BLAST
UniProt DatabaseUniProt Database
query query sequencesequence
further iterations pull out more distant sequence relativesfurther iterations pull out more distant sequence relatives
aligns matched aligns matched sequences and builds sequences and builds
profileprofile
Altschul et al. (1997) Altschul et al. (1997)
Use the Multiple Alignment to Calculate Residue FrequenciesUse the Multiple Alignment to Calculate Residue Frequencies
PSI-BLASTPSI-BLAST
the residue frequencies at each position are used to calculate the scores the residue frequencies at each position are used to calculate the scores for aligning a query sequence against the patternfor aligning a query sequence against the pattern
P1……... P5 P6…………... Pn…………...
queryquery
relativesrelatives
putativeputativerelativerelative
three times more powerful than BLAST!!three times more powerful than BLAST!!
AAIICCIINNRRCCKKCCRRHHPP
Position Position specific specific
substitution substitution matrixmatrix……
HR
VLA
HR
VLA 1010
1010202070709090..
1010
7070
7070
9090
Path matrix Path matrix or score or score matrixmatrix
Multiple AlignmentMultiple Alignment
• direct extensions of the standard DP approach for the alignment direct extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 of 2 sequences are computationally impossible for more than 3 sequencessequences
• practical heuristic solutions are based on the idea that sequences practical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying are evolutionary related and can be aligned using an underlying phylogenetic tree phylogenetic tree
this is known as progressive alignmentthis is known as progressive alignment
(1) Pairwise Alignment(1) Pairwise Alignment
(2) Multiple Alignment following the tree from 1(2) Multiple Alignment following the tree from 1
4 sequences A, B, C, D4 sequences A, B, C, D
AA
BBCC
DD
6 pairwise comparisons6 pairwise comparisonsthen cluster analysisthen cluster analysis
BB
DD
AA
CC
AACC
BBDD
AA
BB
DD
CC
Align most similar pairAlign most similar pair
Align next most similar pairAlign next most similar pair
Align alignments - preserve gapsAlign alignments - preserve gaps
gaps to optimise alignmentgaps to optimise alignment
new gap to optimise alignment of BD with ACnew gap to optimise alignment of BD with AC
Multiple AlignmentMultiple Alignment
• start by aligning the most closely related pairs using DP and start by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that gradually align these groups together keeping the gaps that appear in earlier alignments fixed appear in earlier alignments fixed
• alternatively can add sequences one at a time to a growing alternatively can add sequences one at a time to a growing multiple alignmentmultiple alignment
the heuristic approach is not guaranteed to find the optimum the heuristic approach is not guaranteed to find the optimum alignment - but it is soundly based, biologicallyalignment - but it is soundly based, biologically
ClustalWClustalW
• since the choice of parameters used can have significant effect on the since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem alignment for very distant sequences, ClustalW addresses this problem by:by:
position specific gap opening and extension penaltiesposition specific gap opening and extension penalties
using different amino acid substitution matrices - one for close relatives, using different amino acid substitution matrices - one for close relatives, one for distantone for distant
Higgins, 1997Higgins, 1997
More recent resources:More recent resources:
MAFFTMAFFT
MUSCLEMUSCLE
JALVIEWJALVIEW
ClustalWClustalW
• where structure is known, one would want to increase the gap penalty where structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps within helices and strands and decrease it between them - forcing gaps to occur more frequently in loops to occur more frequently in loops
• if no structure known, can use simple rules which depends on the if no structure known, can use simple rules which depends on the residues occurring and the frequencies of gapsresidues occurring and the frequencies of gaps
e.g. use lower gap penalties where gaps already occure.g. use lower gap penalties where gaps already occur
Gap penaltiesGap penalties
Secondary databases (as opposed to primary sequence databases) group
proteins into related families
Families are usually represented by a sequence profile or sequence model
(Hidden Markov Model HMM) derived from a multiple sequence alignment of the
relatives
Searching Protein Family DatabasesSearching Protein Family Databases
Pfam, SUPERFAMILY, Gene3D : Hidden Markov Models (HMMs)
•sequence is aligned using a probabilistic model of interconnecting match, delete or insert states
•contains statistical information on observed and expected positional variation - “fingerprint of a protein family”
B EMi
Di
Ii
HMMs for Protein Domain Family RecognitionHMMs for Protein Domain Family Recognition
Pfam-A 10,340 curated families with annotation
Pfam-B 224,303 families derived from ADDA (50% clearly related to a Pfam-A)
UniProt coverage 74% of sequences 51% of residues
PDB coverage 94% of sequences 76% of residues
Pfam-APfam-BOther
Pfam :Pfam :
Profile-HMMHMMer-2.0
FULL alignment
Search UniProt
Manually curated Automatically made
SEED alignmentrepresentative members
Protein
Pfam classificationPfam classification
Protein fold, etc.
Protein
Family
Protein fold, etc.
Pfam classificationPfam classification
Protein
Clan
Family
Protein fold, etc.
Pfam classificationPfam classification