assumptions: life is monophyletic biological entities (sequences, taxa) share common ancestry
DESCRIPTION
GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES. Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry. ancestor. descendant 1. descendant 2. Any two organisms share a common ancestor in their past. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/1.jpg)
1
GLOBAL GLOBAL PAIRWISE ALIGNMENTPAIRWISE ALIGNMENT
GLOBAL ALIGNMENT OF:GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES 2 NUCLEOTIDE SEQUENCES
OR OR 2 AMINO-ACID SEQUENCES2 AMINO-ACID SEQUENCES
![Page 2: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/2.jpg)
2
Assumptions:Assumptions:
Life is monophyleticLife is monophyleticBiological entities (sequences, Biological entities (sequences, taxa) share common ancestrytaxa) share common ancestry
![Page 3: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/3.jpg)
3
Any two organisms share a common ancestor in their past
ancestor
descendant 1 descendant 2
![Page 4: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/4.jpg)
4
ancestor (~5 MYA)
![Page 5: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/5.jpg)
5
ancestor (~120 MYA)
![Page 6: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/6.jpg)
6
ancestor (~1,500 MYA)
![Page 7: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/7.jpg)
7
(1) Speciation events(2) Gene duplication (3) Duplicative transposition
Homologoussequences
![Page 8: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/8.jpg)
8
HomologHomology:y: A term coined by Richard Owen in 1843.
Definition: Similarity resulting from common ancestry.
![Page 9: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/9.jpg)
9
Homology
There are three main types of
molecular homology: orthology,
paralogy (including ohnology) and
xenology.
![Page 10: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/10.jpg)
10
Homology: General Definition
• Homology designates a qualitative relationship of common descent between entities
• Two genes are either homologous or they are not!– it doesn’t make sense to say “two
genes are 43% homologous.”– it doesn’t make sense to say “Linda is
43% pregnant.”
![Page 11: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/11.jpg)
11
Orthology & Paralogy• Two genes are orthologs if they
originated from a single ancestral gene in the most recent common ancestor of their respective genomes
• Two genes are paralogs if they are related by gene duplication. Two genes are ohnologs if they are related by gene duplication due to genome duplication
![Page 12: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/12.jpg)
12
![Page 13: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/13.jpg)
13
= Gene death
![Page 14: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/14.jpg)
14
Xenology is due to horizontal (lateral) gene transfer (HGT or
LGT)XA and XB are xenologs
Distinguishing orthologs from xenologs is impossible in pairwise genomic comparisons, but possible when multiple genomes are compared
![Page 15: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/15.jpg)
15
Orthology, Paralogy, Xenology(Fitch, Trends in Genetics, 2000. 16(5):227-231)
![Page 16: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/16.jpg)
16
By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor.
Homology
![Page 17: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/17.jpg)
17
When comparing sequences, we are interested in POSITIONAL HOMOLOGY. We identify POSITIONAL HOMOLOGY through SEQUENCE ALIGNMENT.
Homology
![Page 18: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/18.jpg)
Alignment:Alignment: A hypothesis concerning positional homology among residues from two or more sequence.Positional homologyPositional homology = In
pairwise alignment, a pair of nucleotides from two
homologous sequences that have descended from one
nucleotide in the ancestor of the two sequences.
![Page 19: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/19.jpg)
19
Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.
![Page 20: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/20.jpg)
20
![Page 21: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/21.jpg)
21
Unknown sequence
Unknown events & unknown sequence of events
Unknown events & unknown sequence of
events
The true alignment is unknown.
![Page 22: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/22.jpg)
There are two modes of alignment.
Global alignment: each residue of sequence A is compared with each residue in sequence B. Global alignment algorithms are used in comparative and evolutionary studies.
Local alignment: Determining if sub-segments of one sequence are present in another. Local alignment methods have their greatest utility in database searching and retrieval (e.g., BLAST).
![Page 23: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/23.jpg)
For reasons of computational complexity, sequence alignment is divided into two categories:
Pairwise alignment (i.e., the alignment of two sequences).
Multiple-sequence alignment (i.e., the alignment of three or more sequences).
Pairwise alignment problems have exact solutions.
Multiple-sequence alignment problems only have approximate (heuristic) solutions.
![Page 24: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/24.jpg)
24
A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs:(1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other.
GCGGCCCATCAGGTAGTTGGTG-GGCGTTCCATC--CTGGTTGGTGTG
![Page 25: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/25.jpg)
25
-Two DNA sequences: A and B.-Two DNA sequences: A and B.-Lengths are -Lengths are mm and and nn, respectively. , respectively.
-The number of matched pairs is -The number of matched pairs is xx. .
-The number of mismatched pairs -The number of mismatched pairs is is yy. . - Total number of bases in gaps is - Total number of bases in gaps is zz..
![Page 26: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/26.jpg)
26
There are internal internal and terminal terminal gaps.
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
![Page 27: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/27.jpg)
27
A terminal gap may indicate missing data.
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
![Page 28: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/28.jpg)
28
An internal gap indicates that a deletiondeletion or an insertioninsertion has occurred in one of the two lineages.
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
![Page 29: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/29.jpg)
29
When sequences are compared through alignment, it is impossible to tell whether a deletion has occurred in one sequence or an insertion has occurred in the other. Thus, deletions and insertions are collectively referred to as indels (short for insertion or deletion).
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
![Page 30: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/30.jpg)
30
The alignment is the first step in many functional and evolutionary studies.
Errors in alignment tend to amplify in later stages of the study.
![Page 31: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/31.jpg)
31
Motivation for sequence alignment
Function– Similarity may be indicative of
similar function.
Evolution– Similarity may be indicative of
common ancestry.
![Page 32: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/32.jpg)
32
Some definitions
![Page 33: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/33.jpg)
34
Methods of alignment:
1. Manual2. Dot matrix3. Distance Matrix4. Combined (Distance +
Manual)
![Page 34: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/34.jpg)
35
Manual aliManual aliggnmentnment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection.
GCG-TCCATCAGGTAGTTGGTGTGGCGATCCATCAGGTGGTTGGTGTG
![Page 35: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/35.jpg)
36
Advantages of manual alignment:
(1) use of a powerful and trainable tool (the brain, well… some brains).
(2) ability to integrate additional data, e.g., domain structure, biological function.
![Page 36: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/36.jpg)
37
![Page 37: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/37.jpg)
38
Protein Alignment may be Protein Alignment may be guided by Secondary and guided by Secondary and
Tertiary StructuresTertiary Structures
Homo sapiensDjlA protein
Escherichia coli DjlA protein
![Page 38: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/38.jpg)
39
Disadvantages of manual alignment: subjectivitysubjectivity (the algorithm is unspecified)
irreproducibility irreproducibility (the results cannot be independently reproduced)
unscalabilityunscalability (inapplicable to long sequences)
incommensurabilityincommensurability (the results cannot be compared to those obtained by other methods)
![Page 39: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/39.jpg)
40
The dot-matrix method (Gibbs and McIntyre, 1970): The two sequences are written out as column and row headings of a two-dimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical.
![Page 40: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/40.jpg)
41
The alignment is defined by a path from the upper-left element to the lower-right element.
![Page 41: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/41.jpg)
42
There are 4 possible steps in the There are 4 possible steps in the path: path:
(1) a diagonal step through a dot = match.
(2) a diagonal step through an empty element of the matrix = mismatch.
(3) a horizontal step = a gap in the sequence on the left of the matrix.
(4) a vertical step = a gap in the sequence on the top of the matrix.
![Page 42: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/42.jpg)
43
A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone.
![Page 43: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/43.jpg)
44
The number of spurious matches is determined by: window size (how many residues are compared), stringency (the minimum number of matches for a hit), & alphabet size (number of characters states). Window size must be an odd number.
window size =1stringency = 1alphabet size = 4
![Page 44: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/44.jpg)
45
window size =1stringency = 1alphabet size = 4
window size = 3stringency = 2alphabet size = 4
![Page 45: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/45.jpg)
46
window size = 1stringency = 1alphabet size = 20
![Page 46: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/46.jpg)
47
Dot-matrix methods:Dot-matrix methods:Advantages: By being a visual Advantages: By being a visual representation, and humans representation, and humans being visual animals, the being visual animals, the method may unravel method may unravel information on the evolution of information on the evolution of sequences that cannot easily sequences that cannot easily be gleaned from a line be gleaned from a line alignment.alignment.Disadvantages: May not Disadvantages: May not identify the best possible identify the best possible alignment.alignment.
![Page 47: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/47.jpg)
48
Advantages:Highlighting Information
The vertical gap indicates The vertical gap indicates that a coding region that a coding region corresponding to ~75 corresponding to ~75 amino acids has either amino acids has either been deleted from the been deleted from the human gene or inserted human gene or inserted into the bacterial gene. into the bacterial gene.
Window size = 60 amino acids; Stringency = 24 matches
![Page 48: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/48.jpg)
49
The two pairs of The two pairs of diagonally oriented diagonally oriented parallel lines most parallel lines most probably indicate that two probably indicate that two small internal duplications small internal duplications occurred in the bacterial occurred in the bacterial gene. gene.
Window size = 60 amino acids; Stringency = 24 matches
Advantages:Highlighting Information
![Page 49: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/49.jpg)
50
Disadvantages:
Not possible to identify the best alignment.
![Page 50: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/50.jpg)
51
Scoring Matrices & Gap Penalties
![Page 51: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/51.jpg)
The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences.
Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria.
![Page 52: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/52.jpg)
53
Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa.
![Page 53: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/53.jpg)
54
= matches = mismatches = nucleotides in gaps = gaps
![Page 54: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/54.jpg)
55
The scoring scheme comprises a gap penalty and a scoring matrix, M(a,b), that specifies the score for each type of match (a = b) or mismatch (a b).
The units in a scoring matrix may be the nucleotides in the DNA or RNA sequences, the codons in protein-coding regions, or the amino acids in protein sequences.
![Page 55: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/55.jpg)
56
DNA scoring matrices are usually simple. In the simplest scheme all mismatches are given the same penalty.
M(a,b) is positive if a = b and negative otherwise.
In more complicated matrices a distinction may be made between transition and transversion mismatches or each type of mismatch may be penalized differently.
M(a,b) 0 if ab 0 if ab
![Page 56: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/56.jpg)
57
Further complications: Distinguishing among different matches and mismatches.
For example, a mismatched pair consisting of LeuLeu && IleIle, which are very similar biochemically to each other, may be given a lesser penalty than a mismatched pair consisting of ArgArg && GluGlu, which are very dissimilar from each other.
![Page 57: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/57.jpg)
58
Lesser penalty than
![Page 58: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/58.jpg)
59
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
![Page 59: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/59.jpg)
60
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
B = asx (asp or asn) X = unknownZ = glx (glu or gln) * = termination codon
![Page 60: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/60.jpg)
61
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
The matrix is symmetrical
![Page 61: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/61.jpg)
62
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Positive numbers on the diagonal
![Page 62: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/62.jpg)
63
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Mismatches are usually penalized
![Page 63: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/63.jpg)
64
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Some mismatches are not penalized
![Page 64: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/64.jpg)
65
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
A few mismatches are even rewarded
![Page 65: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/65.jpg)
66
Gap penalty (or cost) is a factor (or a set of factors) by which the gap values (numbers and lengths of gaps) are mathematically manipulated to make the gaps equivalent in value to the mismatches.
The gap penalties are based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions.
![Page 66: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/66.jpg)
MismatchesGaps
![Page 67: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/67.jpg)
68
The gap penalty has two components: a gap-opening penalty and a gap-extension penalty.
![Page 68: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/68.jpg)
69
Three main gap-penalty systems: (1) Fixed gap-penalty system = 0 gap-extension costs.
![Page 69: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/69.jpg)
70
Three main gap-penalty systems: (2) Linear gap-penalty system = the gap-extension cost is calculated by multiplying the gap length minus 1 by a constant representing the gap-extension penalty for increasing the gap by 1.
![Page 70: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/70.jpg)
71
Three main gap-penalty systems: (3) Logarithmic gap-penalty system = the gap-extension penalty increases with the logarithm of the gap length, i.e., slower.
![Page 71: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/71.jpg)
72
Alignment algorithms
![Page 72: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/72.jpg)
73
Aim: Given a predetermined set of criteria, find the alignment associated with the best score from among all possible alignments.
The OPTIMAL ALIGNMENT
![Page 73: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/73.jpg)
74
The number of possible alignments may be astronomical.
nmmin(n,m)
(nm)!n!m!
nm2nm
(nm)nm
nn mm
where n and m are the lengths of the two sequences to be aligned.
![Page 74: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/74.jpg)
75
The number of possible alignments may be astronomical.
For example, when two DNA sequences 200 residues long each are compared, there are more than 10153 possible alignments.
In comparison, the number of protons in the universe is only ~1080.
![Page 75: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/75.jpg)
76
FORTUNATELY:
There are computer algorithms for finding the optimal alignment between two sequences that do not require an exhaustive search of all the possibilities.
![Page 76: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/76.jpg)
77
The Needleman-Wunsch (1970) Needleman-Wunsch (1970)
algorithmalgorithmuses
Dynamic Dynamic ProgrammingProgramming
![Page 77: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/77.jpg)
78
Dynamic programming = a computational technique. It is applicable when large searches can be divided into a succession of small stages, such that (1) the solution of the initial search stage is trivial, (2) each partial solution in a later stage can be calculated by reference to only a small number of solutions in an earlier stage, and (3) the last stage contains the overall solution.
![Page 78: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/78.jpg)
79
Dynamic programming can be applied to problems of alignment because ALIGNMENT SCORES obey the following rules:
S1 x, 1 ySx1, y1S1 x1, 1 y1
![Page 79: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/79.jpg)
80
Path Graph for aligning two Path Graph for aligning two sequencessequences
![Page 80: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/80.jpg)
81
allowedallowed
![Page 81: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/81.jpg)
82
not allowednot allowed
![Page 82: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/82.jpg)
![Page 83: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/83.jpg)
84
Scoring scheme
match = +5mismatch = –3gap-opening penalty = –4gap-extension penalty = 0
![Page 84: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/84.jpg)
Matrix initialization
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 85: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/85.jpg)
Matrix initialization0 + match = 5
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 86: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/86.jpg)
Matrix initialization0 + gap = –4
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 87: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/87.jpg)
Matrix initialization0 + gap = –4
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 88: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/88.jpg)
Matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
0 + match = 5
![Page 89: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/89.jpg)
Matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
5 + gap = 1
![Page 90: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/90.jpg)
Matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
0 + gap = –4
![Page 91: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/91.jpg)
… and so on and so forth
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 92: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/92.jpg)
Complete matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 93: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/93.jpg)
Trace back
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 94: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/94.jpg)
95
The alignment is produced by either starting at the highest score in either the rightmost column or the bottom row, and proceeding from right to left by following the best pointers, or at the bottom rightmost cell.
This stage is called the tracebacktraceback. The graph of pointers in the traceback is also referred to as the path graphpath graph because it defines the paths through the matrix that correspond to the optimal alignment or alignments.
![Page 95: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/95.jpg)
Trace back (if we DO allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 96: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/96.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
![Page 97: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/97.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
10 + gap ≠ 11 14 + mismatch = 1110 + gap ≠ 11
![Page 98: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/98.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
10 + gap ≠ 14 9 + match = 145 + gap ≠ 14
![Page 99: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/99.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
4 + mismatch ≠ 9 13 + gap= 90 + gap ≠ 9
![Page 100: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/100.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
8 + match = 13 4 + gap ≠ 139 + gap ≠ 13
![Page 101: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/101.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
–1 + gap ≠ 812 + gap = 8 3 + match = 8
![Page 102: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/102.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
7 + gap = 3 –6 + gap ≠ 3–2 + mismatch ≠ 37 + gap ≠ 12 7 + match = 123 + gap ≠ 12
![Page 103: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/103.jpg)
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
…
![Page 104: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/104.jpg)
Trace back (complete)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
high road/low road/middle roadhigh road/low road/middle road
![Page 105: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/105.jpg)
Two possible alignments:
GAATTCAGTGGA-TC-GA* * ** *
GAATTCAGTGGAT-C-GA* ** * *
![Page 106: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/106.jpg)
107
Scoring Matrices
Mismatch and gap penalties should be inversely proportional to the frequencies with which changes occur.
![Page 107: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/107.jpg)
108
To A To T To C To G Row totals
From A3.4 0.7
(3.6 0.7)4.5 0.8
(4.8 0.9)12.5 1.1
(13.3 1.1)20.3
(21.6)
From T3.3 0.6
(3.5 0.6)13.8 1.9
(14.7 2.0)3.3 0.6
(3.5 0.6)20.4
(21.7)
From C4.2 0.5
(4.2 0.5)20.7 1.3
(16.4 1.3)4.6 0.6
(4.4 0.6)29.5
(25.1)
From G20.4 1.4
(21.9 1.5)4.4 0.6
(4.6 0.6)4.9 0.7
(5.2 0.8)29.7
(31.6)
Columntotals
27.9(29.5)
28.5(24.6)
23.2(23.2)
20.5(21.3)
Transitions (68%) occur more frequently than transversions (32%).Mismatch penalties for transitions should be smaller than those for transversions.
![Page 108: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/108.jpg)
109
Empirical substitution matrices
PAM (Percent/Point Accepted Mutation)
BLOSUM (BLOcks SUbstitution Matrix)
![Page 109: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/109.jpg)
110
PAM• Developed by Margaret
Dayhoff in 1978.• Based on comparisons of very
similar protein sequences.
![Page 110: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/110.jpg)
111
• A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment.
• The values in a scoring matrix are log ratios of two probabilities.
One is the random probability. The other is the probability of a empirical pair occurrence.
• Because the scores are logarithms of probability ratios, they can be added to give a meaningful score for the entire alignment. The more positive the score, the better the alignment!
Log-odds ratios
![Page 111: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/111.jpg)
112
• Align sequences that are at least 85% identical.
– Minimizes ambiguity in alignments and the number of coincident mutations.
• Reconstruct phylogenetic trees and infer ancestral sequences.
• Tally replacements "accepted" by natural selection, in all pairwise comparisons.
– Meaning, the number of times j was replaced by i in all comparisons.
• Compute amino acid mutability (i.e., the propensity of a given amino acid, j, to be replaced).
The PAM matrices(Percent accepted mutations)
![Page 112: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/112.jpg)
113
• Combine data to produce a Mutation Probability Matrix for one PAM of evolutionary distance, which is used to calculate the Log Odds Matrix for similarity scoring.
• Thus, depending on the protein family used, various PAM matrices result - some of which are “good” at locating evolutionary distant conserved mutations and some that are good at locating evolutionary close conserved mutations.
The PAM matrices
![Page 113: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/113.jpg)
114
More on log-odds ratios
In PAM log-odds scores are multiplied by 10 to avoid decimals. Therefore, a PAM score of 2 actually corresponds to a log-odds ratio of 0.2.
0.2 = substitioni to j = log10 { (observed ij mutation rate) / (expected rate) }
The value 0.2 is log10 of the relative expectation value of the mutation. Therefore, the expectation value is 100.2 = 1.6.
So, a PAM score of 2 indicates that (in related sequences) the mutation would be expected to occur 1.6 times more frequently than random.
![Page 114: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/114.jpg)
115
PAM250– Calculated for families of related proteins
(>85% identity)– 1 PAM is the amount of evolutionary
change that yields, on average, one substitution in 100 amino acid residues
– A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement
– PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)
![Page 115: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/115.jpg)
116
Sequence alignment matrix that allows 250 accepted point mutations per 100 amino acids. PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.
PAM250
![Page 116: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/116.jpg)
117
Selecting a PAM Matrix• Low PAM numbers: short sequences, strong local
similarities.
• High PAM numbers: long sequences, weak similarities.– PAM60 for close relations (60% identity)– PAM120 recommended for general use (40% identity)– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended.
![Page 117: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/117.jpg)
118
BLOSUM• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992).• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function.– Highly conserved protein domains.
• Ungapped local alignment to identify motifs– Each motif is a block of local alignment.– Counts amino acids observed in same column.– Symmetrical model of substitution.
![Page 118: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/118.jpg)
119
BLOSUM62
• BLOSUM matrices are based on local alignments (“blocks” or conserved amino acid patterns).
• BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence.
• All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.
• BLOSUM 62 is the default matrix in BLAST 2.0.
![Page 119: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/119.jpg)
120
BLOSUM Matrices
• Different BLOSUMn matrices are calculated independently from BLOCKS
• BLOSUMn is based on sequences that are at most n percent identical.
![Page 120: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/120.jpg)
121
The procedure for calculating a BLOSUM matrix is based on a likelihood method estimating the occurrence of each possible pairwise substitution. Only aligned blocks are used to calculate the BLOSUMs.
The higher the scoreThe more closely related sequences.
BLOSUM62
![Page 121: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/121.jpg)
122
Because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence.
Why is BLOSUM62 called
BLOSUM62?
![Page 122: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/122.jpg)
123
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general
use– BLOSUM80 for close relations– BLOSUM45 for distant relations
![Page 123: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/123.jpg)
124
Equivalent PAM and Blosum matrices
The following matrices are roughly equivalent...
•PAM100 ==> Blosum90 •PAM120 ==> Blosum80 •PAM160 ==> Blosum60 •PAM200 ==> Blosum52 •PAM250 ==> Blosum45
Generally speaking... •The Blosum matrices are best for detecting local alignments. •The Blosum62 matrix is the best for detecting the majority of weak protein similarities. •The Blosum45 matrix is the best for detecting long and weak alignments.
Less divergent
More divergent
![Page 124: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/124.jpg)
125
Comparison of PAM250 and BLOSUM62
The relationship between BLOSUM and PAM substitution matrices:
BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences.
BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.
If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.
![Page 125: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/125.jpg)
126
Scoring matrices commonly used
• PAM250 – Shown to be appropriate for searching for
sequences of 17-27% identity.
• BLOSUM62– Though it is tailored for comparisons of
moderately distant proteins, it performs well in detecting closer relationships.
• BLOSUM50– Shown to be better for FASTA searches.
![Page 126: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/126.jpg)
127
Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone
(a) Penalty for gaps is 0(b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k(c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by showing pairs of biochemically similar amino acids
![Page 127: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/127.jpg)
Alignments: things to keep in mind
“Optimal alignment” means “having the highest possible score, given a substitution matrix and a set of gap penalties”
This is NOT necessarily the most meaningful alignment
The assumptions of the algorithm are often wrong: - substitutions are not equally frequent at all positions, - it is very difficult to realistically model insertions and
deletions.Pairwise alignment programs ALWAYS produce an
alignment (even when it does not make sense to align sequences)
![Page 128: Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry](https://reader035.vdocument.in/reader035/viewer/2022062810/56815e6e550346895dccefe9/html5/thumbnails/128.jpg)