![Page 1: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/1.jpg)
Biological Sequence Analysis
Chapter 3
ClausLundegaard
![Page 2: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/2.jpg)
Objectives
• Review sequence alignment• Scoring matrices• Insertion/deletions• Dynamics programming
• Multiple alignments• How it is done?
![Page 3: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/3.jpg)
Protein Families
Organism 1
Organism 2
Enzyme 1
Enzyme 2
Closely related Same Function
MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLR::::::.::::::::::::::::::::.:::::::::::::::::::::::::::::MSEKKQTVDLGLLEEDDEFEEFPAEDWTGLDEDEDAHVWEDNWDDDNVEDDFSNQLR
Related Sequences
Protein Family
![Page 4: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/4.jpg)
Homology modeling and the human genome
![Page 5: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/5.jpg)
Alignments
ACDEFGHIKLMNACEDFGHIPLMN
75%ID
ACDEFGHIKLMNACACFGKIKLMN
75%ID
![Page 6: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/6.jpg)
Substitutions
Glutamic acid
Aspartic acidD
E
![Page 7: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/7.jpg)
Substitutions
T Threonine
S Serine
![Page 8: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/8.jpg)
Substitutions
ThreonineT
W Tryptophane
![Page 9: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/9.jpg)
Deriving Substitution ScoresBLOSUM, Henikoff & Henikoff, 1992
Protein Family
Block A Block B
![Page 10: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/10.jpg)
BLOSUM MatricesHenikoff & Henikoff, 1992
...A...
...A...
...A...
...S...
...A...
...A...
...A...
...A...
...A...
...A...
A 8 AA 1 AS
7 AA
6 AA
5 AA
4 AA
3 AA
2 AA
0 AA
1 AA
1 AS
1 AS
1 AS
1 AS
1 AS
2 SA
0 AS
1 AS
36 9
45
s
w
ws(s-1)/2 = 1x10x9/2 =
f
![Page 11: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/11.jpg)
BLOSUM MatricesHenikoff & Henikoff, 1992
The probability of occurrence of the i’th amino acid in an i, j pair is:
45 pairs = 90 participants in pairs
A’s in pairs: 36x2 + 9x1 = 81AA AS
S’s in pairs: 0x2 + 9x1 = 9
Probability pA for encountering an A: 81/90 = 0.9
Probability pS for encountering an S: 9/90 = 0.1
qAA = 36/45qAA = 9/45pA = 0.8 + 0.2/2 = 0.9
OR
![Page 12: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/12.jpg)
BLOSUM MatricesHenikoff & Henikoff, 1992
Expected probability, e, of occurrence of pairs:
eAA = pApA = 0.9x0.9 = 0.81
eAS = pApS + pSpA = 0.9x0.1 + 0.1x0.9 = 2x(0.9x0.1) = 0.18
eSS = pSpS = 0.1x0.1 = 0.01
![Page 13: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/13.jpg)
BLOSUM MatricesHenikoff & Henikoff, 1992
Odds and logodds:Odd ratio:
logodd, s:
means that the observed frequencies are as expected
means that the observed frequencies are lower than expected
means that the observed frequencies are higher than expected
In the final BLOSUM matrices values are presented in half-bits, i.e., logodds are multiplied with 2 and rounded to nearest integer.
![Page 14: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/14.jpg)
BLOSUM MatricesHenikoff & Henikoff, 1992
• Segment clustering•Sequences with more than X% ID are represented as one average sequence (cluster)
•Sequences are added to the cluster if it has more than X% ID to any of the sequences already in the cluster
•If the clustering level is more than 50% ID, the final Matrix is a BLOSUM50, more than 62% leads to the BLOSUM62 matrix, etc.
•The higher the %ID the more conserved
![Page 15: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/15.jpg)
BLOSUM MatricesHenikoff & Henikoff, 1992
A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
ACDEFGHIKLMNACEDFGHIPLMN
ACDEFGHIKLMNACACFGKIKLMN
4+9-2-4+6+6-1+4+5+4+5+6 = 42
4+9+2+2+6+6+8+4-1+4+5+6 = 55
![Page 16: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/16.jpg)
A
10 20 30 40 50 60 70humanD -----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS ..: . : :.:. : . .. ...:. :::::::::::::::..::::.::::Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------ 10 20 30 40 50 60 B
10 20 30 40 50 60 70humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS ....:...:::::::::::::::::::::..:::..........::....:..::..........Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK----- 10 20 30 40 50 60
Figure 3.3: (A) The human proteasomal subunit aligned to the mosquito homolog using the BLOSUM50 matrix. (B) The human proteasomal subunit aligned to the mosquito homolog using identity scores.
Gaps
10 20 30 40 50 60 70humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAH-VWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS ....:...:::::::::::::::::::::..:::... :::::::::::::::..::::.::::Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------ 10 20 30 40 50 60
![Page 17: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/17.jpg)
Gap Penalties
•A gap is a kind like a mismatch but...•Often the gap score (gap penalty) has an even lower value than the lowest mismatch score
•Having only one type of gap penalties is called a linear gap cost
•Biologically gaps are often inserted/deleted as a one or more event
•In most alignment algorithms is two gap penalties.
•One for making the first gap•Another (higher score) for making an additional gap
•Affine gap penalty
![Page 18: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/18.jpg)
Dynamic Programming
The rest of the slides are stolen from Anders Gorm Petersen
Anders G.Pedersen
![Page 19: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/19.jpg)
Alignment depicted as path in matrix
T C G C A
T
C
C
A
T C G C A
T
C
C
A
TCGCATC-CA
TCGCAT-CCA
![Page 20: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/20.jpg)
Position labeled “x”: TC aligned with TC
--TC -TC TCTC-- T-C TC
Alignment depicted as path in matrix
T C G C A
T
C
C
A
x
Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths).
![Page 21: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/21.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
![Page 22: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/22.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max
score(x,y-1) - gap-penalty
![Page 23: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/23.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
![Page 24: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/24.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
![Page 25: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/25.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.
Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
![Page 26: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/26.jpg)
Dynamic programming: example
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Gaps: -2
![Page 27: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/27.jpg)
Dynamic programming: example
![Page 28: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/28.jpg)
Dynamic programming: example
![Page 29: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/29.jpg)
Dynamic programming: example
![Page 30: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/30.jpg)
Dynamic programming: example
T C G C A: : : :T C - C A1+1-2+1+1 = 2
![Page 31: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/31.jpg)
Global versus local alignments
Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).
Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).
Global alignment
Seq 1
Seq 2
Local alignment
![Page 32: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/32.jpg)
Local alignment overview
• The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative.
• Trace-back is started at the highest value rather than in lower right corner
• Trace-back is stopped as soon as a zero is encountered
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
0
![Page 33: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/33.jpg)
Local alignment: example
![Page 34: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/34.jpg)
Substitution matrices and sequence similarity
• Substitution matrices come as series of matrices calculated for different degrees of sequence similarity (different evolutionary distances).
• ”Hard” matrices are designed for similar sequences
– Hard matrices a designated by high numbers in the BLOSUM series (e.g., BLOSUM80)
– Hard matrices yield short, highly conserved alignments
• ”Soft” matrices are designed for less similar sequences
– Soft matrices have low BLOSUM values (45)
– Soft matrices yield longer, less well conserved alignments
![Page 35: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/35.jpg)
Alignments: things to keep in mind
“Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”.
This is NOT necessarily the biologically most meaningful alignment.
Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc.
Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.
![Page 36: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/36.jpg)
Database searching
Using pairwise alignments to search
databases for similar sequences
Database
Query sequence
![Page 37: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/37.jpg)
Database searching
Most common use of pairwise sequence alignments is to search databases for related sequences. For instance: find probable function of newly isolated protein by identifying similar proteins with known function.
Most often, local alignment ( “Smith-Waterman”) is used for database searching: you are interested in finding out if ANY domain in your protein looks like something that is known.
Often, full Smith-Waterman is too time-consuming for searching large databases, so heuristic methods are used (fasta, BLAST).
![Page 38: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/38.jpg)
Database searching: heuristic search algorithms
FASTA (Pearson 1995)
Uses heuristics to avoid calculating the full dynamic programming matrix
Speed up searches by an order of magnitude compared to full Smith-Waterman
The statistical side of FASTA is still stronger than BLAST
BLAST (Altschul 1990, 1997)
Uses rapid word lookup methods to completely skip most of the database entries
Extremely fast
One order of magnitude faster than FASTA
Two orders of magnitude faster than Smith-Waterman
Almost as sensitive as FASTA
![Page 39: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/39.jpg)
BLAST flavors
BLASTN
Nucleotide query sequence
Nucleotide database
BLASTP
Protein query sequence
Protein database
BLASTX
Nucleotide query sequence
Protein database
Compares all six reading frames with the database
TBLASTN
Protein query sequence
Nucleotide database
”On the fly” six frame translation of database
TBLASTX
Nucleotide query sequence
Nucleotide database
Compares all reading frames of query with all reading frames of the database
![Page 40: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/40.jpg)
Searching on the web: BLAST at NCBI
Very fast computer dedicated to running BLAST searches
Many databases that are always up to date
Nice simple web interface
But you still need knowledge about BLAST to use it properly
![Page 41: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/41.jpg)
When is a database hit significant?
• Problem:
– Even unrelated sequences can be aligned (yielding a low score)
– How do we know if a database hit is meaningful?
– When is an alignment score sufficiently high?
• Solution:
– Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences).
– Compare actual scores to the distribution of random scores.
– Is the real score much higher than you’d expect by chance?
![Page 42: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/42.jpg)
Random alignment scores follow extreme value distributions
The exact shape and location of the distribution depends on the exact nature of the database and the query sequence
Searching a database of unrelated sequences result in scores following an extreme value distribution
![Page 43: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/43.jpg)
Significance of a hit: one possible solution
(1) Align query sequence to all sequences in database, note scores
(2) Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution
(3) Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the “E-value”)
![Page 44: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/44.jpg)
Significance of a hit: example
Search against a database of 10,000 sequences.
An extreme-value distribution (blue) is fitted to the distribution of all scores.
It is found that 99.9% of the blue distribution has a score below 112.
This means that when searching a database of 10,000 sequences you’d expect to get 0.1% * 10,000 = 10 hits with a score of 112 or better for random reasons
10 is the E-value of a hit with score 112. You want E-values well below 1!
![Page 45: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/45.jpg)
Database searching: E-values in BLAST
BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores
For this reason BLAST only allows certain combinations of substitution matrices and gap penalties
This also means that the fit is based on a different data set than the one you are working on
A word of caution: BLAST tends to overestimate the significance of its matches
E-values from BLAST are fine for identifying sure hitsOne should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
![Page 46: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/46.jpg)
Refresher: pairwise alignments
• Most used substitution matrices are themselves Most used substitution matrices are themselves derived empirically from simple multiple alignmentsderived empirically from simple multiple alignments
Multiple alignment
A/A 2.15%A/C 0.03%A/D 0.07%...
Calculatesubstitutionfrequencies
Score(A/C) = log
Freq(A/C),observedFreq(A/C),expected
Convertto scores
![Page 47: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/47.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Multiple alignment
![Page 48: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/48.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Multiple alignments: what use are they?
• Starting point for studies of Starting point for studies of molecular evolutionmolecular evolution
![Page 49: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/49.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Multiple alignments: what use are they?
• Characterization of protein families (sequence profiles):Characterization of protein families (sequence profiles):– Identification of conserved (functionally important) sequence Identification of conserved (functionally important) sequence
regionsregions– Prediction of structural features (disulfide bonds, amphipathic Prediction of structural features (disulfide bonds, amphipathic
alpha-helices, surface loops, etc.)alpha-helices, surface loops, etc.)
![Page 50: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/50.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Scoring a multiple alignment:the “sum of pairs” score
...A...
...A...
...S...
...T...
One column from alignment
AA: 4, AS: 1, AT:0AS: 1, AT: 0ST: 1
Score: 4+1+0+1+0+1 = 7
⇒In theory, it is possible to define an alignment score for multiple alignments (there are many alternative scoring systems)
![Page 51: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/51.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Multiple alignment: dynamic programming is only feasible for very
small data sets • In theory, optimal multiple In theory, optimal multiple
alignment can be found by dynamic alignment can be found by dynamic programming using a matrix with programming using a matrix with more dimensions (one dimension more dimensions (one dimension per sequence)per sequence)
• BUT even with dynamic BUT even with dynamic programming finding the optimal programming finding the optimal alignment very quickly becomes alignment very quickly becomes impossible due to the astronomical impossible due to the astronomical number of computationsnumber of computations
• Full dynamic programming only Full dynamic programming only possible for up to about 4-5 protein possible for up to about 4-5 protein sequences of average length sequences of average length
• Even with heuristics, not feasible for Even with heuristics, not feasible for more than 7-8 protein sequencesmore than 7-8 protein sequences
• Never used in practiceNever used in practice
Dynamic programming matrix for 3 sequences
For 3 sequences, optimal path must comefrom one of 7 previous points
![Page 52: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/52.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Multiple alignment: an approximate solution
• Progressive alignment (ClustalX and other Progressive alignment (ClustalX and other programs):programs):
3.3. Perform all Perform all pairwisepairwise alignments; keep track of sequence alignments; keep track of sequence similarities between all pairs of sequences (construct similarities between all pairs of sequences (construct “distance matrix”)“distance matrix”)
5.5. Align the most similar pair of sequencesAlign the most similar pair of sequences
7.7. Progressively add sequences to the (constantly growing) Progressively add sequences to the (constantly growing) multiple alignment in order of decreasing similaritymultiple alignment in order of decreasing similarity..
![Page 53: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/53.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Progressive alignment: details
1) Perform all pairwise alignments, note pairwise distances (construct “distance matrix”)
2) Construct pseudo-phylogenetic tree from pairwise distances
S1S2S3S4
6 pairwisealignments
S1 S2 S3 S4S1S2 3S3 1 3S4 3 2 3
S1 S3 S4 S2
S1 S2 S3 S4S1S2 3S3 1 3S4 3 2 3
“Guide tree”
![Page 54: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/54.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Progressive alignment: details
3) Use tree as guide for multiple alignment:a) Align most similar pair of sequences using dynamic
programming
b) Align next most similar pair
c) Align alignments using dynamic programming - preserve gaps
S1 S3 S4 S2
S1
S3
S2
S4
S1
S3
S2
S4New gap to optimize alignmentof (S2,S4) with (S1,S3)
![Page 55: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/55.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Aligning profiles
S1
S3
S2
S4
+
S1
S3
S2
S4New gap to optimize alignmentof (S2,S4) with (S1,S3)
Aligning alignments: each alignment treated as a single sequence (a profile)
Full dynamic programmingon two profiles
![Page 56: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/56.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Scoring profile alignments
...A...
...S...
...S...
...T...
+
One column from alignment
AS: 1, AT:0
SS: 4, ST:1
Score: 1+0+4+1 = 1.54
Compare each residue in one profile to all residues in second profile. Score is average of all comparisons.
![Page 57: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/57.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Additional ClustalX heuristics
• Sequence weighting:Sequence weighting:– scores from similar groups of sequences are down-weightedscores from similar groups of sequences are down-weighted
• Variable substitution matrices:Variable substitution matrices:– during alignment ClustalX uses different substitution matrices during alignment ClustalX uses different substitution matrices
depending on how similar the sequences/profiles aredepending on how similar the sequences/profiles are
• Variable gap penalties:Variable gap penalties: gap penalties depend on substitution matrixgap penalties depend on substitution matrix gap penalties depend on similarity of sequencesgap penalties depend on similarity of sequences reduced gap penalties at existing gapsreduced gap penalties at existing gaps increased gap penalties CLOSE to existing gapsincreased gap penalties CLOSE to existing gaps reduced gap penalties in hydrophilic stretches (presumed reduced gap penalties in hydrophilic stretches (presumed
surface loop)surface loop) residue-specific gap penaltiesresidue-specific gap penalties and more...and more...
![Page 58: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/58.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Other multiple alignment programs
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
![Page 59: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/59.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Other multiple alignment programs
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
![Page 60: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/60.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Global methods (e.g., ClustalX) get into trouble when data is not globally
related!!!
![Page 61: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/61.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Global methods (e.g., ClustalX) get into trouble when data is not globally
related!!!
Clustalx
![Page 62: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/62.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Global methods (e.g., ClustalX) get into trouble when data is not globally
related!!!
Clustalx
Possible solutions:(1) Cut out conserved regions of interest and THEN align
them (2) Use method that deals with local similarity (e.g. DIALIGN)
![Page 63: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/63.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
•Real life example with ClustalWReal life example with ClustalW
![Page 64: Biological Sequence Analysis Chapter 3 Claus Lundegaard](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d475503460f94a2299e/html5/thumbnails/64.jpg)
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS